Compilation is the process of converting high-level source code (like C, C++, Java) into machine readable instructions that a computer's processor can understand and execute. The source code is first converted into object code, which is a machine code format that isn't yet executable. The object code is then linked to produce an executable which will either be:
Lexical analysis is first stage of the compilation process, responsible for breaking down the source code into its fundamental components called tokens to prepare the input for the next stage of compilation.
The lexer, or lexical analyzer, is a component which performs lexical analysis. It reads the source code character by character and groups them into tokens. This involves identifying keywords, operators, literals, identifiers, and punctuation that make up the code. As the lexer goes through the code, it eliminates any whitespace or comments as they are ignored and not required unless they serve a special purpose in the language (e.g. indentation in Python).
The tokens are inputted into a symbol table to be used during the next stage of compilation as it provides an efficient way to store and look up information about indentifiers in the program. It helps with:
Syntax analysis, also known as parsing, ensures that the sequence of tokens provided by the lexer adhered to the syntactical rules of the programming language it has been written in. During this stage, errors are checked and reported by the syntax analyzer, and an abstract syntax tree (AST) is built.
Each programming language has production rules which must be followed. If the tokens do not match the rules, or if there's an unexpected token, the syntax analyzer reports a syntax error. Once the syntax analyzer has verified that the sequence of tokens follows the rules, it genereates an abstract syntax tree as the output.
The tree represents the structure of the program and serves as the input for the next stages; each node in the tree represents a language construct (such as an expression, statement, or operator), and the leaves represent the actual tokens (such as numbers, identifiers, or operators)
Syntax analysis focuses on whether the program follows the correct structure, whereas semantic analysis focuses on whether the program makes logical sense in terms of variable usage, data types and program logic. The symbol table is important for this stage.
Checks during this stage include:
Intermediate code generation is the stage in compilation where the high-level source code is translated into an intermediate form, often bytecode, which is easier to optimize and convert into final machine code.
This stage analyzes the code aiming to improve the performance and efficiency of the code. Insignificant, redundant parts of code are identified and removed. Repeated sections of code may be grouped and replaced with a more efficient piece of code. The goal is to produce code that performs the same tasks with reduced resource consumption, such as fewer instructions or better memory usage, without altering the program's intended behaviour.
The intermediate code is translated into executable machine code or assembly code for the target architecture. Code generation takes into account the specifics of the processor, such as instruction sets, memory addressing modes, and optimization for performance, like efficient use of registers and reducing unnecessary instructions.