[Translation] How processors are designed and manufactured: the basics of computer architecture

[Translation] How processors are designed and manufactured: the basics of computer architecture


image

We perceive the central processor as the “brain” of a computer, but what does this really mean? What exactly happens inside the billions of transistors that make a computer work? In our new mini-series of four articles, we will look at the process of creating the architecture of computer equipment and describe the principles of its work.

In this series, we will talk about computer architecture, design of processor boards, VLSI (very-large-scale integration), chip production and future trends in computing. If you were interested in understanding the details of the work of processors, then it is better to start studying with this series of articles.

We will start with a very high-level explanation of what the processor does and how the building blocks are connected into a functioning structure. In particular, we consider the processor cores, memory hierarchy, branch prediction and more. First, we need to give a simple definition of what the CPU does. The simplest explanation: the processor follows a set of instructions for performing a particular operation on a set of incoming data. For example, it may be reading a value from memory, then adding it to another value, and finally saving the result to memory at another address. It may be something more complicated, for example, the division of two numbers, if the result of the previous calculation is greater than zero.

Programs, such as an operating system or a game, are in themselves sequences of instructions that the CPU must execute. These instructions are loaded from memory and in a simple processor are executed one by one, until the program is completed. Software developers write programs in high-level languages, such as C ++ or Python, but the processor cannot understand them. He understands only ones and zeros, so we need to somehow present the code in this format.


Programs are compiled into a set of low-level instructions called assembly language , which is part of the Instruction Set Architecture (ISA). This is a set of commands that the CPU must understand and execute. One of the most common ISA are x86, MIPS, ARM, RISC-V and PowerPC. Just as the syntax for writing a function in C ++ is different from the function that performs the same action in Python, each ISA has its own different syntax.

These ISA can be divided into two main categories: fixed and variable length. ISA RISC-V uses fixed-length instructions, and this means that a certain predetermined number of bits in each instruction determines which type this instruction has. In x86, everything is different; it uses variable-length instructions. In x86, instructions can be encoded in a different way with different numbers of bits for different parts. Because of this complexity, the instruction decoder in the x86 processor is usually the most complex part of the entire device.

The fixed-length instructions provide simple decoding due to the constant structure, but limit the total number of instructions that can be supported by the ISA. While popular versions of the RISC-V architecture have about 100 instructions and all of them are open source, the x86 architecture is proprietary and no one knows how many instructions there are. It is usually considered that there are several thousand x86 instructions, but no one publishes the exact number. Despite the differences between ISA, in fact, they all have the same basic functionality.


An example of some RISC-V instructions. The opcode on the right is 7 bits long and determines the type of instruction. In addition, each instruction contains bits that define the registers used and the functions performed. So assembler instructions are broken into binary code so that the processor understands it.

Now we are ready to turn on the computer and start executing the programs. Execution of the instruction has several basic parts, which are divided into many processor steps.

The first step is to transfer instructions from memory to the processor to start execution. At the second stage, the instruction is decoded so that the CPU can understand what type this instruction is. There are many types, including arithmetic instructions, branch instructions and memory instructions. After the CPU finds out what type of instruction it executes, the operands for the instruction are taken from the memory or internal CPU registers. If you want to add the number A and the number B, then you cannot perform the addition until you know the values ​​of A and B. Most modern processors are 64-bit, that is, the size of each data value is 64 bits.


64 bits is the width of the register of the processor, data transmission channel and/or memory address. For ordinary users, this means how much information a computer can process at one time, and this is best understood in comparison with the younger relative of the architecture - a 32-bit processor. The 64-bit architecture can process twice as many bits of information (64 bits versus 32).

Having received operands for instructions, the processor transfers them to the execution stage, where the operation is performed on the incoming data. It can be the addition of numbers, the execution of logical manipulations with numbers or just the transfer of numbers without changing them. After calculating the result, memory access may be required to store it, or the processor may simply store the value in one of its internal registers. After saving the result, the CPU updates the state of the various elements and proceeds to the next instruction.

This explanation is, of course, greatly simplified, and most modern processors break up these several stages into 20 or even more small stages to increase efficiency. This means that although the processor starts and ends several instructions in each cycle, it may take 20 or more cycles to execute one instruction from start to finish. This model is usually called the pipeline (“pipeline”, usually translated into Russian as a “pipeline”), because it takes time to fill the pipeline with liquid and complete its passage, but after filling the flow (data output) will be constant.


Example of a 4-stage conveyor. Multi-colored rectangles indicate instructions independent of each other.

The entire cycle passed by the instruction is a very carefully coordinated process, but not all instructions can be completed at the same time. For example, addition is performed very quickly, and division or loading from memory can take thousands of cycles. Instead of stopping the entire processor until the end of one slow instruction, most modern processors perform them with a change of order. That is, they determine which of the instructions is most advantageous to execute at the current moment and buffer other instructions that are not yet ready. If the current instruction is not yet ready, the processor can jump forward along the code to see if something else is ready.

In addition to performing with changing the order, modern processors use a technology called superscalar architecture .This means that at any time, the processor simultaneously executes multiple instructions at each stage of the pipeline. He can also expect hundreds more to begin their execution, and in order to be able to simultaneously execute several instructions inside the processors there are several copies of each stage of the pipeline. If the processor sees that two instructions are ready for execution, and there is no dependency between them, then it does not wait for them to complete individually, but executes them simultaneously. One of the popular implementations of this architecture is called Simultaneous Multithreading (SMT) and is also known as Hyper-Threading. Intel and AMD processors now support two-way SMT, and IBM has developed chips that support up to eight SMTs.


To complete this carefully coordinated execution, the processor has many additional elements besides the base core. The processor has hundreds of individual modules, each of which has a specific function, but we will only look at the basics. The most important and profitable are caches and predictor of transitions. There are other additional structures that we will not consider: reordering buffers, register renaming tables and redundancy stations.

The need for caches can sometimes be confusing, because they store data, like RAM or SSD. But caches differ in latency and access speed. Even though the RAM memory is extremely fast, it is orders of magnitude slower than the CPU needs. It may take hundreds of cycles to respond to the transfer of RAM data, and the processor will have nothing to do at this time. And if there is no data in the RAM, then tens of thousands of cycles may be required to access it from the SSD. Without caches, processors would constantly stall.

Usually, processors have three cache levels that form the so-called memory hierarchy . The L1 cache is the smallest and fastest, L2 is in the middle, and L3 is the largest and slowest of all caches. Above the caches in the hierarchy are small registers that store a single data value during calculations. In order of magnitude, these registers are the fastest storage devices in the system. When the compiler converts a high-level program into assembly language, it determines the best way to use these registers.

When the CPU requests data from the memory, it first checks whether the data is already stored in the L1 cache. If so, you can access them in just a couple of cycles. If they are not there, then the processor checks L2, and then the L3 cache. Caches are implemented in such a way that, in general, they are transparent to the kernel. The kernel simply requests the data to the specified memory address, and the level in the hierarchy on which they exist responds to it. When going to the next levels in the memory hierarchy, the size and latency usually grow by orders of magnitude. In the end, if the CPU does not find data in any of the caches, then it is turned into main memory (RAM).


In a normal processor, each core has two L1 caches: one for data and the other for instructions. L1 caches usually have a total volume of about 100 kilobytes and the size varies greatly depending on the chip and processor generation. In addition, each kernel usually has its own L2 cache, although in some architectures it may be common to two cores. L2 caches are usually several hundred kilobytes in size. Finally, there is a single L3 cache, common to all cores, having a size on the order of tens of megabytes.

When the processor executes code, the most frequently used instructions and data values ​​are cached. This greatly speeds up execution, because the processor does not need to constantly seek the necessary data in the main memory.In the second and third parts of the series, we’ll talk more about how these memory systems are implemented.

In addition to caches, one of the most important building blocks of a modern processor is the exact predictor of transitions . The transition instructions (branching) are similar to the if design for the processor. One set of instructions is executed if the condition is true, and the other if it is false. For example, we need to compare two numbers, and if they are equal, perform one function, and if they are not equal, then execute the other. These branch instructions are used extremely often and can account for about 20% of all instructions in a program.

At first glance it seems that these branch instructions should not cause problems, but their proper execution can be very difficult for the processor. At any given time, the processor can be in the process of simultaneously executing ten or twenty instructions, so it is very important to know which which instructions to execute. It may take 5 cycles to determine that the current instruction is a transition and another 10 cycles to determine the truth of the condition. At this time, the processor can already begin to execute dozens of additional instructions, without even knowing if these are really suitable instructions for execution.

To get around this problem, all modern high-performance processors use a technique called “speculation”. This means that the processor keeps track of the branch instructions and wonders if the conditional jump will be executed or not. If the prediction is correct, the processor has already begun to follow the instructions, and this provides an increase in performance. If the prediction is incorrect, the processor stops the execution, deletes all the incorrect instructions that it has started to execute, and starts anew from the correct point.

Such predictors of transition are one of the simplest types of machine learning, because a predictor studies the behavior of branches in the process of execution. If he predicts incorrectly too often, he begins to learn the correct behavior. Decades of research of methods for predicting transitions have led to the fact that in modern processors the accuracy of predictions exceeds 90%.
Although preemption provides a tremendous increase in performance, because the processor can execute instructions that are already ready, instead of waiting in the queue to finish executing, it also creates security vulnerabilities. The famous Specter attack exploits bugs in predicting and anticipating transitions. The attacker uses specially selected code to make the processor proactively execute the code, thereby leaking values ​​from memory. To prevent data leakage, it was necessary to redo the design of certain aspects of preemption, which led to a slight drop in performance.

Over the past decades, the architecture used in modern processors has come a long way. Innovation and the development of a well-designed structure have led to increased productivity and more optimal use of hardware. However, the developers of central processors carefully keep the secrets of their technology, so we can not know exactly what is happening inside them. Nevertheless, the fundamental principles of the processor are standardized for all architectures and models. Intel can add its secret ingredients to increase the share of cache hits, and AMD can add an improved predictor of transitions, but the processors of both companies perform the same task.

In this first look and review we looked at the basics of the processors. In the next part, we will describe how the components that make up the processors are developed, let's talk about logic elements, clock frequencies, power management, circuitry and other circuitry.

Recommended reading

Source text: [Translation] How processors are designed and manufactured: the basics of computer architecture