[Translation] A brief and vigorous overview of the architecture of compilers

[Translation] A brief and vigorous overview of the architecture of compilers



Most compilers have the following architecture:



In this article I am going to dissect this architecture in detail, element by element.
We can say that this article is an addition to the huge amount of existing resources on the topic of compilers. It is an autonomous source that will allow you to understand the basics of the design and implementation of programming languages.

The target audience of the article is people whose understanding of the work of compilers is extremely limited (the maximum is that they are engaged in compiling). However, I expect the reader to understand data structures and algorithms.

The article is by no means devoted to modern production compilers with millions of lines of code - no, this is a short course “compilers for dummies” that helps to understand what a compiler is.

Introduction


I am currently working on a Krug system language , inspired by Rust and Go. In the article I will refer to Krug as an example to illustrate my thoughts. Krug is under development, but is already available at https://github.com/krug-lang in the caasper and krug . The language is not quite typical compared to the usual compiler architecture, which partly inspired me to write an article - but more on that later.

I hasten to inform you that I am by no means a compiler expert! I do not have a doctoral degree, and I did not go through any formal training — I studied everything described in the article myself in my free time. I must also say that I do not describe the actual, the only correct approach to creating a compiler, but rather, I present basic methods suitable for creating a small "toy" compiler.

Frontend


Let's return to the diagram above: the arrows on the left that are directed to the frontend field are known and favorite languages ​​like C. The frontend looks like this: lexical analysis - & gt; parser.

Lexical Analysis


When I started learning compilers and language design, I was told that lexical analysis is the same as tokenization. We will use this description. The analyzer takes input data in the form of strings or a stream of characters and recognizes patterns in it, which it cuts into tokens.

In the case of a compiler, it receives a written program as an input. It is read into a string from a file, and the analyzer tokenizes its source code.

  enum TokenType {
  Identifier,
  Number,
 };

 struct token {
  std :: string Lexeme;
  TokenType type;
//...
//It's also handy to store things in here.
//like the position of the token (start to end row: col)
 };  

In this fragment, written in a C-shaped language, you can see the structure containing the aforementioned lexeme, as well as TokenType, which serves to recognize this lexeme.

Note: the article is not an instruction for creating a language with examples - but for better understanding I will insert code fragments from time to time.

Usually analyzers are the simplest components of a compiler. The whole frontend is, in fact, fairly simple compared to the rest of the puzzle pieces. Although it depends a lot on your work.

Take the following piece of C code:

  int main () {
  printf ("Hello world! \ n");
  return 0;
 }  

Having read it from file to line and having performed a linear scan, you may be able to slice tokens.We identify tokens in a natural way - seeing that int is a “word”, and 0 in a return statement is a “number”. The lexical analyzer does the same procedure as we do — later we will look into this process in more detail. For example, analyze the numbers:

  0xdeadbeef - HexNumber (hexadecimal number)
 1231234234 - WholeNumber (integer)
 3.1412 - FloatingNumber (floating point number)
 55.5555 - FloatingNumber (floating point number)
 0b0001 - BinaryNumber (binary number)  

Definition of words can be difficult. Most languages ​​define a word as a sequence of letters and numbers, and an identifier, as a rule, must begin with a letter or underscore. For example:

  123foobar: = 3
 person-age: = 5
 fmt.Println (123foobar)  

In Go, this code will not be considered correct and will be parsed into the following tokens:

  Number (123), Identifier (foobar), Symbol (: =), Number (3) ...  

Most common identifiers look like this:

  foo_bar
 __uint8_t
 fooBar123  

Analyzers will have to solve other problems, for example, with spaces, multiline and single-line comments, identifiers, numbers, number systems and number formatting (for example, 1_000_000) and encodings (for example, support for UTF8 instead of ASCII).

And if you think that you can resort to regular expressions - it’s better not to. It's much easier to write the analyzer from scratch, but I highly recommend reading this article from our king and god Rob Pike. The reasons why Regex does not suit us are described in many other articles, so I’ll omit this point. Besides, writing an analyzer is much more interesting than suffering over long wordy expressions uploaded to regex101.com at 5:24 am. In my first language, I used the split (str) function to tokenize - and it was far from being advanced.

Parsing


Parsing is somewhat more complicated than lexical analysis. There are many parsers and parsers-generators - here the game starts on a large scale.

Parsers in compilers usually accept input in the form of tokens and build a specific tree — an abstract syntax tree or a parsing tree. They are similar in nature, but have some differences.

These stages can be represented as functions:

  fn lex (string input) [] Token {...}
 fn parse (tokens [] Token) AST {...}

 let input = "int main () {return 0;}";
 let tokens = lex (input);
 let parse_tree = parse (tokens);//....  

Typically, compilers are built from a variety of small components that take input data, change it, or convert it into various output data. This is one of the reasons why functional languages ​​are well suited for building compilers. Other reasons are an excellent benchmarking and fairly extensive standard libraries. Cool fact: the first implementation of the Rust compiler was on Ocaml.

I advise you to keep these components as simple and autonomous as possible - modularity will greatly facilitate the process. In my opinion, the same can be said about many other aspects of software development.

Trees


Parsing Tree


What the fuck is that? Also known as a parse tree, this thick tree is used to visualize a source program. They contain all the information (or most of it) about the input program, usually the same as what is described in the grammar of your language. Each tree node will be leaf or non-leaf, for example, NumberConstant or StringConstant.

Abstract syntax tree


As the name implies, ASD is an abstract syntax tree. The parsing tree contains a lot of (often redundant) information about your program, and in the case of an ASD it is not required. ASD does not need useless information about structure and grammar, which does not affect the semantics of the program.

Suppose you have an expression of the type ((5 + 5) -3) +2 in your tree. In the parsing tree, you would store it completely, along with parentheses, operators, and values ​​5, 5, 3, and 2. But you can simply make associations with ASD - we only need to know the values, the operators, and their order.

The picture below shows a tree for the expression a + b/c.


ASD can be represented as follows:

  interface Expression {...};

 struct UnaryExpression {
  Expression value;
  char op;
 };

 struct BinaryExpression {
  Expression lhand, rhand;
  string op;//string because it couldn’t be more than 1 char.
 };

 interface Node {...};
//or for something like a variable
 struct Variable: Node {
  Token identifier;
  Expression value;
 };  

This presentation is quite limited, but I hope you can see how your nodes will be structured. For parsing you can resort to the following procedure:

  Node parseNode () {
  Token current = consume ();
  switch (current.lexeme) {
  case "var":
  return parseVariableNode ();
//...
  }
  panic ("unrecognized input!");
 }

 Node n = parseNode ();
 if (n! = null) {
//append to some list of top level nodes?
//or append to a block of nodes!
 }  

I hope that you have grasped the essence of how the other nodes will be step-by-step parsing, starting with high-level language constructs. How exactly a parser with a recursive descent is implemented, you need to study it yourself.

Grammar


The parsing in the ASD from a set of tokens can be difficult. Usually you should start with the grammar of your language. In essence, grammar defines the structure of your language. There are several languages ​​for defining languages ​​that can describe (or parse) themselves.

An example language for defining languages ​​is extended Backus-Naur form (RBNF ). It is a variation of BNF with a smaller number of angle brackets. Here is an example of the RBNF from a Wikipedia article:

  digit excluding zero = "1" |  "2" |  "3" |  "4" |  "5" |  "6" |  "7" |  "8" |  "9" ;
 digit = "0" |  digit excluding zero;  

The production rules are defined: they indicate which terminal template constitutes a “non-terminal”. Terminals are part of the alphabet, for example, the token if or 0 and 1 in the example above are terminals. The non-terminals are their opposite, they are in the left part of the production rules, and they can be considered variables or “named pointers” to groups of terminals and non-terminals.

Many languages ​​have specifications that contain grammar. For example, for Go , Rust and D .

Recursive descent analyzers


Recursive descent is the easiest of many approaches to parsing.

Recursive descent analyzers are downstream based on recursive procedures. It is much easier to write a parser, because your grammar does not have left recursion .In most "toy" languages, this technique is sufficient for parsing. GCC uses a hand-written descending parser, although YACC was used before.

However, problems with parsing of these languages ​​can arise. Especially C where

  foo * bar  

can be interpreted as

  int foo = 3;
 int bar = 4;
 foo * bar;//unused expression  

or like

  typedef struct {
 int b;
 } foo;
 foo * bar;
 bar.b = 3;  

The Clang implementation also uses a recursive descent analyzer:

Since this is the usual code for C ++, recursive descent makes it easy for beginners to understand. It supports specially created rules and other strange things required by C/C ++ and helps you easily implement diagnostics and error correction.

Also pay attention to other approaches:

  • downward LL, recursive descent
  • ascending lr, shift, ascending descent

Parser Generators


Another good way. Of course, there are also disadvantages - but this can be said about any other choice programmers make when creating software.

Parser-generators work very briskly. Using them is easier than writing your own analyzer and getting a quality result — although they are not very user friendly and do not always display error messages. In addition, you will have to learn how to use a parser generator, and during the promotion of the compiler, you will probably have to unwind a parser generator.

An example of a parser generator is ANTLR , and there are many others.

I think that this tool is suitable for those who do not want to waste time writing the frontend, and who would prefer to write the middle and backend of the compiler/interpreter and analyze whatever it is.

Using parsing


If you have not understood yourself. Even the front-end compiler (lex/parse) can also be used to solve other problems:

  • syntax highlighting
  • HTML/CSS parsing for rendering engine
  • Transporters: TypeScript, CoffeeScript
  • linkers
  • REGEX
  • interface data analysis
  • URL Parsing
  • formatting tools like gofmt
  • SQL parsing and more.

Middle


Semantic analysis! Analysis of the semantics of the language is one of the most difficult tasks when creating a compiler.

You need to make sure that all input programs work correctly. The aspects related to semantic analysis have not yet been included in my Krug language, and without it, the programmer will always have to write the correct code. In reality, this is impossible - and we always write, compile, sometimes run, fix errors. This spiral is infinite.

In addition, compiling programs is impossible without analyzing the correct semantics at the appropriate stage of compilation.

I once came across a chart devoted to the percentage ratio of the frontend, midland and backend. Then it looked like a

  F: 20% M: 20%: B: 60%  

  Today it represents something like

  F: 5% M: 60% B: 35%  

The frontend is mainly engaged in the generator, and in contextless languages ​​that do not have the duality of grammar, they can be done quite quickly - here recursive descent will help.

With LLVM technology, most optimization work can be loaded into a framework, which presents a number of ready-made optimizations.

The next step is semantic analysis, the most important part of the compilation phase.

For example, in Rust, with its memory management model, the compiler acts as a big, powerful machine that performs various types of static analysis on introductory forms.In part, this task is to convert the input data into a form more convenient for analysis.

For this reason, semantic analysis plays an important role in the compiler architecture, and grueling preparatory work such as optimizing the generated assembly or reading the input data in the ASD is done for you.

Semantic passages


In the course of semantic analysis, most compilers spend a large number of “semantic passes” on ASD or other abstract form of code expression. This article contains details on most compiler passes. NET C #.

I will not consider each passage, especially since they differ depending on the language, but below are described a few steps in Krug.

Top Level Ad


The compiler will go through all the "highest level" ads in the modules and be aware of their existence. He will not go deeper into blocks - he will simply declare which structures, functions, etc. are available in this or that module.

Name/Symbol Resolution


The compiler runs through all blocks of code in functions, etc. and resolves them — that is, finds the characters requiring permission. This is a common passage, and it is from here, as a rule, that the error No such symbol XYZ comes from when compiling the Go code.

Running this pass can be very difficult, especially if there are cyclic dependencies in your dependency diagram. Some languages ​​do not allow them, for example, Go will give an error if one of your packages forms a loop, like my Krug language. Cyclic dependencies can be considered a side effect of bad architecture.

Cycles can be defined by modifying DFS in a dependency diagram, or using by Tarjan’s algorithm (as made in Krug) to define (multiple) cycles.

Type Inference


The compiler goes through all the variables and displays their types. Type inference in Krug is very weak, it simply displays variables based on their values. This is in no way a whimsical system, like those found in functional languages ​​like Haskell.

You can infer types using the process of “unification” or “unification of types”. For simpler type systems, a very simple implementation can be used.

Types are implemented in Krug like this:

  interface Type {};

 struct IntegerType: Type {
  int width;
  bool signed;
 };

 struct FloatingType: Type {
  int width;
 };

 struct ArrayType: Type {
  Type base_type;
  uint64 length;
 };  

You may also have a simple type inference, in which you will assign a type to expression nodes, for example, IntegerConstantNode may be of type IntegerType (64). And then you may have a unify (t1, t2) function that chooses the widest type that you can use to infer the type of more complex expressions, say, binary ones. So it's a matter of assigning a variable to the left of the values ​​of the above types to the right.

I once wrote a simple type casting on Go, which became the implementation prototype for Krug.

Mutability Pass


Krug (like Rust) defaults to immutable language, that is, variables remain unchanged, unless otherwise specified:

  let x = 3;
 x = 4;//BAD!

 mut y = 5;
 y = 6;//OK!  

The compiler runs through all blocks and functions and checks that their “variables are correct”, that is, we do not change what does not follow, and that all variables passed to certain functions are constant or changeable where required.

This is done using symbolic information that has been collected in previous passes. A symbol table constructed from the results of a semantic passage contains the names of tokens and signs of variable variables. It may contain other data, for example, in C ++, the table can store information about whether a character is external or static.

Character tables


A symbol table, or “stab”, is a table for searching for characters that are used in your program. One table is created for each field of view, and all of them contain information about the symbols present in a specific field of view.

This information includes such properties as the name of the symbol, type, sign of variability, the presence of external communication, location in static memory, and so on.

Scope


This is an important concept in programming languages. Of course, your language is not obliged to give the ability to create nested scopes, everything can be put into one common namespace!

Although the representation of the scope is an interesting task for the compiler architecture, in most C-like languages, the scope is behaving (or is) like a stack data structure.

We usually create and destroy scopes, and they are usually used to manage names, that is, they allow us to hide (shadowing) variables:

  {//push scope
  let x = 3;
  {//push scope
  let x = 4;//OK!
  }//pop scope
 }//pop scope  

It can be presented differently:

  struct Scope {
  Scope * outer;
  SymbolTable symbols;
 }  

A small offtop, but I recommend reading about spaghetti stack . This is a data structure that is used to store scopes in ASD nodes of opposite blocks.

Type Systems


Many of the following sections can be developed into separate articles, but it seems to me that this headline deserves this most. Today, a lot of information is available about type systems, as well as the varieties of the systems themselves, around which many copies break. I will not dive deep into this topic, just leave a link to excellent article by Steve Klabnik .

The type system is what is provided and semantically defined in the compiler using compiler representations and analyzing these representations.

Ownership


This concept is used in programming more and more. The principles of ownership and movement semantics are embedded in the Rust language, and I hope it will appear in other languages. Rust performs many different types of static analysis, which checks whether the input data satisfies a set of rules regarding memory: who owns what memory, when memory is destroyed and how many references (or borrowings) exist to these values ​​or memory.

The beauty of Rust is that all of this is done during compilation, inside the compiler, so the programmer does not have to do garbage collection or reference counting. All these semantics are related to the type system and can be provided even before the program is presented in the form of a complete binary file.

I can not say how it all works under the hood, but all this is the result of static analysis and remarkable research by the Mozilla team and the project participants Cyclone .

Control Flow Graphs


To represent program flows, we use control flow graphs (CFG), which contain all the ways in which the program can run.This is used in semantic analysis to exclude non-working sections of code, that is, blocks, functions, and even modules that will never be achieved during program execution. Also, graphs can be used to identify cycles that cannot be interrupted. Or to search for inaccessible code, for example, when you call a panic (call a panic), or return in a loop, and the code outside the loop is not executed. Data flow analysis plays an important role during the semantic phase of the compiler, so I recommend reading about the types of analysis you can perform how they work and what optimizations they can do.

Backend



The final part of our architecture diagram.

We have done most of the work on generating executable binaries. This can be done in various ways, which we will discuss below.

It is not necessary to significantly change the phase of semantic analysis due to the information contained in the tree. Perhaps it is better not to change it at all, in order to avoid spaghetti.

A few words about transpilers


This is a type of compiler that converts code in one language to source code in another. For example, you can write code that compiles to source code in C. In my opinion, the thing is rather meaningless, unless your language is much inferior to the language into which it is compiled. Typically, transpiling makes sense for relatively high-level languages, or for languages ​​with disabilities.

However, in the history of compilers, conversion to C code is very common. In fact, the first C ++ compiler, Cfront, transported into C code.

A good example is javascript. It translates TypeScript and many other languages ​​into its code in order to bring more features and, more importantly, a sensitive type system with different types of static analysis for catching bugs and errors before we encounter them during execution.

This is one of the “goals” of the compiler, and most often the simplest, because you do not need to think about assigning variables, working with optimization, etc., as part of lower-level concepts, because you simply “fall on the tail” of another language. However, this approach has an obvious drawback - large overhead, besides, you are usually limited by the capabilities of the language into which you are translating your code.

LLVM


Many modern compilers usually use their LLVM backend: Rust, Swift, C/C ++ (clang), D, Haskell.

This can be considered "simple way" because you have done most of the work to support a wide range of architectures, and optimizations are available at the highest level. Compared with the aforementioned transpilation, LLVM provides great management capabilities. Certainly more than if you compiled it in C. For example, you can decide how large the types should be, say, 1, 4, 8, or 16-bit. In C this is not so easy to do, sometimes it is impossible, but for some platforms it is not even possible to determine.

Assembly code generation


Generating code for a specific architecture — that is, generating machine code — is technically the most popular way used in many programming languages.

Go is an example of a modern language that does not enjoy the benefits of the LLVM framework (at the time of this writing). Go generates code for several platforms, including Windows, Linux and MacOS. It's funny that in the Krug prototype, an assembler code was also generated earlier.

This approach has many advantages and disadvantages.However, today, when technologies like LLVM are available, it is no longer wise to generate an assembler yourself, because it is unlikely that a toy compiler with its own backend will surpass LLVM in terms of optimization for one platform, not to mention a few.

However, a significant advantage of self-generating assembler is that your compiler will probably be much faster than if you used a framework like LLVM, which your IR must first build, optimize, and so on, and then finally be able to issue an assembler (or whatever you choose there).

But it's still nice to try. And besides, it will be interesting if you want to learn more about programming in assembler, or how programming languages ​​work at lower levels. The easiest way is to open the ASD or generated IR (if you have one) and “give out” the assembler instructions to a file using fprintf or another utility. This is how 8cc works .

Bytecode generation


You can also generate bytecode for a particular type of virtual machine or bytecode interpreter. A vivid example is Java: in fact, JVM gave rise to a whole family of languages ​​that generate bytecode for it, for example, Kotlin.

Bytecode generation has many advantages, and for Java, portability was the main thing. If you can run your virtual machine anywhere, then any code running on it will also work anywhere. In addition, it is much easier to run an abstract set of bytecode instructions on machines than to generate code according to stoptsot computer architectures.
As far as I know, using JIT JVM turns frequently used code into native functions, and also uses other JIT tricks to squeeze even more performance.

Optimization


They are an integral part of the compiler, nobody needs a slow-working code! Optimizations typically make up the bulk of the backend, and developers put a lot of effort into improving performance. If you ever compile the C code and run it with all the optimizations, you will be surprised what a madness it will be. Godbolt is a great tool for understanding how modern compilers generate code and which instructions refer to which source code. You can also set the desired level of optimizations, goals, compiler versions and so on.

If you have ever written a compiler, you can start by creating a simple C program, turn off all optimizations and strip debug symbols, and see what GCC generates. You can then use it as a reminder if you ever have trouble.

When setting up optimizations, you can find a compromise between accuracy and speed of the program. However, finding the right balance is not so easy. Some optimizations are very specific, and in some cases can lead to erroneous results. For obvious reasons, they are not included in production compilers.

In comments to this article on another resource, user rwmj noticed that only 8 optimizing passes are enough to get 80% of maximize your compiler performance. And all these optimizations were described in 1971! This is a publication of Redon Hoare, the mastermind behind Rust.

IR


Intermediate representation (intermediate representation, IR) is not necessary, but useful. You can generate code from the ASD, although it can be quite tedious and messy, and the result will be difficult to optimize.

You can think of IR as a higher-level representation of the generated code.It should accurately reflect what it represents and contain all the information necessary to generate the code.

There are specific types of IRs, or “forms,” that you can create using IR to simplify optimizations. For example, SSA is a Static Single Assignment, the only static assignment in which each variable is assigned only once.

In Go, before generating the code, IR is built on the basis of SSA. IR in LLVM is based on SSA to ensure its optimization.

Initially, SSA provides several optimizations, for example, the substitution of constants (constant propagation), the elimination of non-working code and (very important) allocation of registers.

Register allocation


This is not a requirement for code generation, but an optimization. One abstraction, which we consider to be a given, is that we can define as many variables as our programs need. However, in an assembler, we have a finite number of registers (usually from 16 to 32) that we need to keep in mind, or we can use the stack (spill to the stack).

Register allocation is an attempt to select a particular register for a particular variable at a specific point in time (without overwriting other values). This is much more efficient than using the stack, although it can entail additional costs, and the computer cannot always calculate the ideal distribution scheme.

There are several register allocation algorithms:

  • Graph coloring is computationally complex (NP-complete problem). It is required to present the code as a graph in order to calculate the liveness ranges of variables.
  • Linear scan - scans the variables and determines their life ranges.

Things to remember


Much has been written about the compiler. So much so that will not fit in any article. I want to remind, or at least to mention a few important points that need to be remembered in the course of your future projects.

Name Distortion ( Name Mangling )


If you generate an assembler code that actually has no scope or namespaces, then character conflicts will often occur. Especially if your language supports redefinition of functions or classes, and the like.

  fn main () int {
  let x = 0;
  {
  let x = 0;
  {
  let x = 0;
  }
  }
  return 0;
 }  

For example, in this code (if the variables are not optimized in any way :)) you will have to distort the names of these characters so that they do not conflict in the assembler. Also, the distortion is usually used to indicate the type of information, or it may contain information about the namespace.

Debug Information


Tools like LLDB typically use standards like DWARF . One of the great features of LLVM is that, with DWARF, you get relatively easy integration with existing GNU debugging tools. Perhaps your language will need a debugging tool, and it's always easier to use ready-made than to write your own.

External Function Interface (Foreign Function Interface, FFI )


Usually, libc is not going anywhere, you need to read about this library and think about how to integrate it into your language. How do you connect to the C code, or how will you open your C code?

Linker


Linker writing is a separate task. When your compiler generates code, does it generate machine instructions (in the .s/.asm file)? Does he write code directly to the object file? For example, in the programming language Jai , all the code is supposedly written to a single object file.There are various options, which are characterized by their own compromises.

Compiler as a Service (CaaS)


Here, all of the above compiler phases are distributed across API routes. This means that the text editor can access the Krug-server so that it tokenizes the file and returns the tokens to the response. In addition, all static analysis routes are open, so it becomes easier to apply the tools.

Of course, this approach has drawbacks, for example, a delay in sending and receiving files. In addition, many aspects of the compiler architecture need to be rethought to work in the context of API routes.

Few production compilers use the CaaS approach. Microsofts Roslyn comes to mind, although I don't know much about this compiler, so study it yourself. And I could be wrong, but it seems that many compilers implement this approach, but their authors write API mashruta that connect to existing compilers, for example, Rust has RLS .

In my Krug language - which is still being actively developed and unstable - a CaaS compiler is used in the Caasper compiler.

Caasper runs locally on your machine (or, if you want, on the server), and then front-ends or clients, for example, krug, interact with this service. The advantage is that you can have many front-end implementations, and you can load a single frontend (bootstrap) in the language itself before rewriting the entire compiler.

The frontend for Krug is implemented in JavaScript, although there will be alternative implementations on Go *, and also, I hope, in Krug itself. JavaScript was chosen for its availability, you can download it with the very popular package managers yarn/npm.

* Initially, the frontend was written in Go and was (expectedly) much faster than the JS version.

The source code for the Caasper compiler is here . In my personal Github is a prototype of Krug, it is written in D and compiled into LLVM. You can also watch a demo on my YouTube- channel .

The Krug Guide (intermediate) is here .

Useful Links


Source text: [Translation] A brief and vigorous overview of the architecture of compilers