Hardware acceleration of deep neural networks: GPU, FPGA, ASIC, TPU, VPU, IPU, DPU, NPU, RPU, NNP and other letters

Hardware acceleration of deep neural networks: GPU, FPGA, ASIC, TPU, VPU, IPU, DPU, NPU, RPU, NNP and other letters



On May 14, when Trump was preparing to unleash all the dogs on Huawei, I peacefully sat in Shenzhen at the Huawei STW 2019 - a large conference for 1000 participants - whose program included reports Philip Wong , vice president of research for TSMC for prospects, for prospects. von Neumann computational architectures, and Heng Liao, Huawei Fellow, Chief Scientist Huawei 2012 Lab, on the topic of developing a new architecture of tensor processors and neuroprocessors. TSMC, if you know, makes neuro accelerators for Apple and Huawei using 7 nm technology (which few people own ), and Huawei on neuroprocessors ready to compete with Google and NVIDIA.

Google in China is banned, I did not bother to put VPN on the tablet, so patriotic used Yandex to see what the situation is with other manufacturers of similar hardware, and what happens in general. In general, I followed the situation, but only after these reports I realized how large the revolution was being prepared in the depths of companies and the silence of scientific cabinets.

Last year alone, more than $ 3 billion was invested in the topic. Google has long declared neural networks to be a strategic area, actively building their hardware and software support. NVIDIA, feeling that the throne was staggering, invests fantastic efforts in libraries of accelerating neural networks and new hardware. Intel in 2016 spent 0.8 billion on the purchase of two companies engaged in hardware acceleration of neural networks. And this is despite the fact that the main purchases have not yet begun, and the number of players has exceeded fifty and is growing rapidly.


TPU, VPU, IPU, DPU, NPU, RPU, NNP - what does all this mean and who will win? Let's try to figure it out. Who cares - Wellcome under the cat!


Disclaimer: The author had to completely rewrite the video processing algorithms for effective implementation on ASIC, and clients did prototyping on FPGA, so the idea of ​​depth There is a difference in architecture. However, the author has not worked directly with iron lately. But I anticipate that I will have to delve into.

Background to problems


The number of required calculations is growing rapidly, people would gladly take more layers, more options for architecture, more actively play with hyper parameters, but ... run into performance. At the same time, for example, with the growth of the performance of good old processors - big problems. All good things come to an end: Moore's law, as we know, dries up and the speed of growth of processor performance drops:


Real performance calculations of integer operations by SPECint compared to VAX11-780 , hereinafter often the logarithmic scale

If from the mid-80s to the mid-2000s — in the blessed years of the heyday of computers — growth was at an average speed of 52% per year, in recent years it has decreased to 3% per year. And this is a problem (translation of the recent article by the patriarch on the topic of John Hennessy about the problems and prospects of modern architectures was on Habré ).

There are many reasons, for example, the frequency of processors ceased to grow:


It became more difficult to reduce the size of transistors. The last attack that drastically reduces performance (including the performance of already released CPUs) is (drum roll) ... that's right, security. Meltdown , Specter and others vulnerabilities cause enormous damage to the speed of growth of the computing power of the CPU ( example of disabling hyperthreading (!)). The theme has become popular, and new vulnerabilities of this kind are almost monthly . And this is a nightmare, because it hurts the performance.

At the same time, the development of many algorithms is firmly tied to the growth of processor power that has become habitual. For example, many researchers today do not worry about the speed of algorithms - something will be invented. And it would be okay when training - the networks become large and “heavy” to use. This is especially clearly seen in the video, for which most approaches, in principle, are not applicable at high speed. And they make sense often only in real time. This is also a problem.

Similarly, new compression standards are being developed, which imply an increase in decoder power. And if the power of the processors will not grow? The older generation remembers how in the 2000s there were problems to play high-resolution video in the fresh then H.264 on old computers. Yes, the quality was better with a smaller size, but on fast scenes the picture hung up or the sound was torn. I have to talk with the developers of the new VVC/H.266 (release is planned for next year). They do not envy.

So, what is the coming age preparing for us in the light of a decrease in the growth rate of processor performance in the application to neural networks?

CPU




The usual CPU is a great number of shredders that have been improved over the decades. Alas, for other tasks.

When we work with neural networks, especially deep ones, our network can take hundreds of megabytes. For example, object detection memory requirements are:
model
input size
param memory
feature memory
rfcn-res50-pascal
600 x 850
122 MB
1 GB
rfcn-res101-pascal
600 x 850
194 MB
2 GB
ssd-pascal-vggvd-300
300 x 300
100 MB
116 MB
ssd-pascal-vggvd-512
512 x 512
104 MB
337 MB
ssd-pascal-mobilenet-ft
300 x 300
22 MB
37 MB
faster-rcnn-vggvd-pascal
600 x 850
523 MB
600 MB


In our experience, deep neural network coefficients for processing translucent borders can take 150–200 MB. Colleagues in the neural network for determining age and sex size coefficients of the order of 50 MB. And when optimizing for the low-precision mobile version, it is about 25 MB (float32⇒float16).

At the same time, the delay schedule for accessing the memory is distributed something like this (horizontal logarithmic scale):



Those. with an increase in data volume of more than 16 MB, the delay increases 50 or more times, which is fatal to performance. In fact, most of the CPU time when working with deep neural networks stupidly is waiting for data. Interesting Intel data to accelerate different networks, where, in fact, , acceleration goes only when the network becomes small (for example, as a result of quantization of weights), in order to start at least partially entering the cache along with the processed data. Note that the cache of a modern CPU consumes up to half the processor power. In the case of heavy neural networks, it is ineffective and works unreasonably expensive heater.

For adherents of neural networks on the CPU
Even Intel OpenVINO according to our internal tests, loses the framework implementation on matrix multiplication + NNPACK on many network architectures (especially on simple architectures where bandwidth is important for realtime data processing in a single-threaded mode). Such a scenario is relevant for
different classifiers of objects in the image (where the neural network needs to be run a large number of times - 50–100 by the number of objects in the image) and the overhead of starting OpenVINO becomes unreasonably high.

Pros:

  • “Everyone has one,” and usually idle, i.e. relatively low input price calculations and implementation.
  • There are some non-CV networks that fit the CPU well, colleagues call, for example, Wide & amp; Deep and GNMT.

Minus:
  • The CPU is inefficient when working with deep neural networks (when the number of network layers and the size of the input data are large), everything works painfully slowly.

GPU




The topic is well known, so we will quickly denote the main thing. GPU has a significant performance advantage in massively parallel tasks in the case of neural networks:


Notice how 72-core Xeon Phi 7290 is annealed , while “blue” is also server-side Xeon, t .Intel does not give up so easily, as will be lower. But more importantly, the memory of video cards was originally designed for about 5 times higher performance. In neural networks, data calculations are extremely simple. Several elementary actions, and we need new data. As a result, the speed of access to data is critical for the efficient operation of a neural network. The high-speed on-board memory of the GPU and a more flexible cache management system than on the CPU solve this problem:



Tim Detmers has been supporting an interesting review for several years now “Which GPU (s) to Get for Deep Learning: My Experience and Advice for Using GPUs in Deep Learning ” (“ Which GPU is best for deep learning ... ”). It is clear that Tesla and Titans rule for learning, although the difference in architectures can cause interesting bursts, for example, in the case of recurrent neural networks (and the leader in general TPU, we note for the future):


However, there is a very useful performance chart for the dollar, where RTX is on horseback (most likely due to their ( Tensor cores ), if you have enough memory, of course:


Of course, the cost of computing is important. The second place of the first rating and the last of the second - Tesla V100 is sold for 700 thousand rubles, as 10 "ordinary" computers (+ expensive Infiniband switch, if you want to train on several nodes). True V100 and works for ten. People are willing to overpay for a noticeable acceleration of learning.

Total, summarized!

Pros:
  • The cardinal one is 10–100 times faster than the CPU.
  • Extremely effective for learning (and somewhat less effective for use).

Minus:
  • The cost of top-end video cards (which have enough memory to train large networks) exceeds the cost of the rest of the computer ...

FPGA




FPGA is more interesting. This is a network of several millions of programmable blocks that we can also programmatically interconnect. The network and blocks of look something like this (thin place - Bottleneck, pay attention, again before the chip's memory, but everything is easier, what will be lower):


Naturally, it makes sense to use FPGA already at the stage of applying a neural network (in most cases there is not enough memory for training). Moreover, the topic of implementation on the FPGA is now beginning to actively develop. For example, here's fpgaConvNet framework , which helps to significantly accelerate the use of CNN on FPGA and reduce energy consumption. .

The key advantage of the FPGA is that we can store the network directly in cells, i.e. a thin place in the form of overloaded 25 times per second (for video) in the same direction hundreds of megabytes of the same data magically disappears.This allows for a lower clock frequency and no caches, instead of a decrease in performance, to obtain a noticeable increase. Yes, and dramatically reduce global warming energy consumption per unit of calculation.

Intel was actively involved in the process, releasing the open source OpenVINO Toolkit last year, which includes Deep Learning Deployment Toolkit (part of OpenCV ). And the performance on FPGA on different grids looks quite interesting, and the advantage of the FPGA over the GPU (albeit the integrated GPU from Intel) is very significant:


What particularly warms the soul of the author - FPS are compared, i.e. frames per second is the most practical metric for video. Given that Intel bought the second-largest player in the FPGA market in 2015, Altera , the schedule gives good food for reflection.

And, obviously, the entrance barrier to such architectures is higher, so some time must pass in order for convenient tools to appear that effectively take into account fundamentally different FPGA architecture. But you should not underestimate the potential of technology. It pains a lot of thin places, she embroider.

Finally, we emphasize that FPGA programming is a separate art. As such, the program is not executed there, and all calculations are done in terms of data flows, flow delays (which affects performance) and used gates (which are always lacking). Therefore, to start programming effectively, you need to thoroughly change your own firmware (in the neural network that is between the ears). With good efficiency, it does not work for everyone. However, new frameworks will soon hide from the researchers an external difference.

Pros:

  • Potentially faster network performance.
  • Power consumption is noticeably lower than CPU and GPU (this is especially important for mobile solutions).

Cons:

  • Mostly help with the acceleration of performance, teaching them, unlike the GPU, is noticeably less convenient.
  • More complex programming than previous options.
  • Noticeably fewer professionals.

ASIC




Next comes ASIC - short for Application-Specific Integrated Circuit, i.e. integrated circuit under our task. For example, implementing the neural network put into iron. However, most compute nodes can work in parallel. In fact, only data dependencies and uneven computations at different levels of the network can prevent us from constantly using all ALUs that work.

Perhaps the greatest advertisement of ASIC among the general public in recent years has been mining cryptocurrency mining. At the very beginning, mining on the CPU was quite profitable, later I had to buy GPUs, then FPGAs, and then specialized ASICs, since the people (read the market) were ripe for orders in which their production became profitable.


In our area, too (naturally!) services have already appeared, which help to put a neural network on iron with the necessary characteristics for power consumption, FPS and price.Magically, agree!

BUT! We lose network customizability. And, of course, people think about it too. For example, here’s an article with the talking title " Or in the iron chamber. Well, you get the idea, shorter !



It is important that in mass production the chip is cheap, works fast and consumes minimal energy.

Pros:

  • Lowest chip cost compared to all previous solutions.
  • Lowest power consumption per unit of operation.
  • A fairly high speed of work (including, if desired, a record).

Cons:

  • The possibilities of network and logic update are very limited.
  • The highest development cost compared to all previous solutions.
  • Using ASIC is cost effective mainly for large runs.

TPU


Recall that when working with networks there are two tasks - training (training) and execution (inference). If FPGA/ASIC are primarily aimed at speeding up execution (including some kind of fixed network), then TPU (Tensor Processing Unit or tensor processors) is either a hardware learning acceleration or relatively universal acceleration of an arbitrary network. The name is beautiful, agree, although in fact it is still used tensors of rank 2 with Mixed Multiply Unit (MXU) connected to High-Band Memory (HBM). The following is a diagram of the architecture from TPU Google 2nd and 3rd version:


TPU Google


In general, TPU made the advertisement for Google, revealing its internal development in 2017:


They began preliminary work on specialized processors for neural networks with their words back in 2006, in 2013 they created a project with good funding, and in 2015 they started working with the first chips that greatly helped neural networks for the Google Translate cloud service and not only. And it was, we emphasize, the acceleration of performing the network. An important advantage for data centers is two orders of magnitude higher TPU energy efficiency compared to CPU (graph for TPU v1):


Also, as a rule, compared to a GPU, performance of running of the network is 10-30 times better:


The difference is even 10 times significant. It is clear that the difference with the GPU in 20-30 times determines the development of this direction.

And, fortunately, Google is not alone.

TPU Huawei


Now, the long-suffering Huawei also began the development of TPU several years ago under the name Huawei Ascend, and in two versions at once - for data centers (like Google) and for mobile devices (which Google also began to do recently). If you believe the materials of Huawei, they overtook fresh Google TPU v3 by FP16 by 2.5 times and NVIDIA V100 by 2 times:



As usual, a good question: how this chip will behave on real tasks. For on the graph, as you can see, peak performance. In addition, Google TPU v3 is good in many respects because it can work effectively in clusters of 1024 processors each. Huawei also announced server clusters for the Ascend 910, but no details. In general, Huawei engineers have shown themselves to be extremely literate over the past 10 years, and there is every chance that a 2.8 times greater peak performance compared to Google TPU v3, coupled with the latest 7 nm process technology, will be used in the case.

Memory and data bus are critical for performance, and you can see from the slide that considerable attention is paid to these components (including the speed of communication with the memory noticeably faster than that of the GPU):



Also in the chip there is a slightly different approach - not two-dimensional MXU 128x128 are scaled, but calculations in a three-dimensional cube of a smaller size 16x16xN, where N = {16,8,4,2,1}. Therefore, the key question is how well this will fall on the actual acceleration of specific networks (for example, calculations in a cube are convenient for images). Also, a closer look at the slide shows that, unlike Google, the chip immediately starts working with compressed FullHD video. For the author, this sounds very encouraging!

As mentioned above, processors for mobile devices are developed in the same line, for which energy efficiency is critical, and on which the network will be mainly executed (that is, separately - processors for cloud learning and separately - for execution):



And according to this parameter, everything looks pretty good compared to NVIDIA at least (note that they didn’t compare Google with Google’s, however, TPU doesn’t give Google hands on a cloud). And their mobile chips will compete with processors from Apple, Google and other companies, but it’s still too early to sum up.

It is clearly seen that the new Nano, Tiny and Lite chips should be even better. It becomes clear, why Trump was scared why many manufacturers are carefully studying the successes of Huawei (which overtook all US iron producers in terms of revenue, including Intel in 2018).

Analog Deep Networks


As you know, technology often develops in a spiral, when at the new stage old and forgotten approaches become relevant.

Something similar may well happen with neural networks. You may have heard that once the multiplication and addition operations were performed with electron tubes and transistors (for example, color space conversion — a typical matrix multiplication — was in every color television until the mid-90s)? There was a good question: if our neural network is relatively resistant to inaccurate calculations inside, what if we translate these calculations into an analog form? We are getting a noticeable acceleration of calculations and a potentially dramatic reduction in energy consumption for performing a single operation:



With this approach, DNN (Deep Neural Network) is calculated quickly and energy efficiently. But there is a problem - it is DAC/ADC (DAC/ADC) - converters from digital to analog and vice versa, which reduce both energy efficiency and process accuracy.

However, back in 2017, IBM Research offered analog CMOS for RPU (Resistive Processing Units ), which allow you to store the processed data also in analog form and significantly improve the overall effectiveness of the approach:


Also, in addition to analog memory, a reduction in the accuracy of a neural network can greatly help - this is the key to the miniaturization of PRU, which means an increase in the number of computational cells per chip. And here IBM is also among the leaders, and in particular, recently this year they quite successfully hardened the network to 2-bit accuracy and are going to bring the accuracy to one-bit (and two-bit during training), which will potentially allow 100 times (!) Increase in performance compared to modern GPU:


It’s still too early to talk about analog neurochips, because so far all this is being tested at the level of early prototypes:


However, potentially the direction of analog computing looks extremely interesting.

The only thing that confuses is that this is IBM, already filed dozens of patents on the topic . According to the experience, due to the peculiarities of the corporate culture, they are relatively weakly cooperating with other companies and, owning some kind of technology, are more likely to slow down its development in others than to effectively share. For example, IBM at the time refused to license arithmetic compression for JPEG to the committee ISO despite the fact that the draft standard had arithmetic compression . As a result, JPEG went into life with compression according to Huffman and pressed 10–15% worse than it could. The same situation was with video compression standards. And the industry massively switched to arithmetic compression in codecs only when 5 IBM patents expired 12 years later ... Let's hope that IBM will be more inclined to cooperate this time, and, accordingly, we wish maximum success in the field to everyone who does not associated with IBM , the benefit of there are a lot of such people and companies .

If it works, it will be a revolution in the application of neural networks and a revolution in many areas of computer science.

Miscellaneous other letters


In general, the topic of accelerating neural networks has become fashionable; all large companies and dozens of startups are engaged in it, and at least 5 of them attracted more than $ 100 million of investments by the beginning of 2018. In total, in 2017, startups associated with the development of chips were invested 1.5 billion dollars. Given that investors did not notice chip makers for a good 15 years (for there was nothing to catch there against the background of giants). In general - now there is a real chance for a small iron revolution.Moreover, it is extremely difficult to predict which architecture will win, the need for a revolution has matured, and the possibilities for increasing productivity are great. A classic revolutionary situation has matured: Moore is no longer able, and Dean is not yet ready.

Well, since the most important market law is different, many new letters have appeared, for example:

  • Neural Processing Unit ( NPU ) - Neuroprocessor, sometimes nice - neuromorphic chip - in general speaking, the common name for the accelerator of neural networks, what chips are called Samsung , Huawei and further on the list ...

    Here and later in this section, mainly corporate presentation slides will be presented as examples of self-names technologies

    It is clear that a direct comparison is problematic, but here is a curious data comparing the chips with neuroprocessors from Apple and Huawei, produced by the TSMC mentioned at the beginning. It can be seen that the competition is tough, the new generation shows a performance increase of 2-8 times and the complexity of technological processes:


  • Neural Network Processor (NNP) - Neural Network Processor.


    So calls its family of chips, for example, Intel (it was originally Nervana Systems , which Intel bought in 2016 for $ 400 + million). However, in the articles , and in books the name NNP is also quite common.
  • Intelligence Processing Unit (IPU) - intelligent processor - the name of the chips promoted by Graphcore (by the way, which has already received $ 310 million of investments).


    It produces special cards for computers, but sharpened to train neural networks, with an RNN training capacity 180–240 times higher than that of the NVIDIA P100.
  • Dataflow Processing Unit (DPU) - data flow processor - the name is promoted by WAVE Computing , which has already received $ 203 million investment. It releases roughly the same accelerators as Graphcore:


    Since they received 100 million less, they declare training only 25+ times faster than on the GPU (although they promise that it will soon be 1000 times).Let's see ...
  • Vision Processing Unit ( VPU ) - Computer vision processor:


    The term is used in products of several companies, for example, Myriad X VPU from Movidius (was also purchased by Intel in 2016).
  • One of IBM's competitors (who, we recall, use the term RPU ) - < a href = "https://www.mythic-ai.com/about-us/"> Mythic - move Analog DNN , which also stores the network in the chip and relatively fast execution So far, they only have promises, really serious :



And this lists only the most important areas, in the development of which hundreds of millions have been invested (it is important when developing iron.)
In general, as we see, all the flowers are flourishing. Gradually, companies will digest billions of dollars in investment (usually it takes 1.5–3 years to produce chips), the dust will settle, the leader will become clear, the winners will write history as usual, and the name of the technology most successful on the market will become generally accepted. This has already happened more than once (“IBM PC”, “Smartphone”, “Xerox”, etc.).

A couple of words about the correct comparison


As noted above, it is not easy to compare the performance of neural networks correctly. Exactly why Google publishes a graph on which the TPU v1 makes the NVIDIA V100. NVIDIA, seeing such a disgrace, publishes a chart where Google TPU v1 loses V100. (Duc!) Google publishes the following graph, where V100 loses the crash of Google TPU v2 & amp; v3. And finally, Huawei is a graph where everyone loses Huawei Ascend, but V100 is better than TPU v3. Circus, in short. Which is characteristic - its is true in every chart!

The root causes of the situation are clear:

  • You can measure learning speed or speed of execution (whichever is more convenient).
  • You can measure different neural networks, since the speed of execution/training of different neural networks on specific architectures may differ significantly due to the network architecture and the amount of data required.
  • And you can measure the peak performance of the accelerator (perhaps the most abstract value of all above).

As an attempt to restore order in this zoo, a test appeared MLPerf , which now has version 0.5 available, i.e. it is in the process of developing a comparison methodology, which is planned to be brought to the first release in 3rd quarter of this year :


Since there is one of the main contributors of TensorFlow in the authors, there is every chance to find out what the fastest way to train and possibly use (for the mobile version of TF will probably also be included in this test over time).

Recently, the international organization IEEE , which publishes the third part of the world technical literature on radio electronics, computers and electrical engineering, is not childish has banned Huawei , soon, however, canceling a ban. In the current MLPerf Huawei rating is not yet available, while Huawei TPU is a serious competitor to Google TPU and NVIDIA cards (i.e., besides political, there are also economic reasons to ignore Huawei, let's face it). With undisguised interest, we will follow the development of events!

All to heaven! Closer to the clouds!


And, since we are talking about training, it is worth saying a few words about its specificity:

  • With the general care of research in deep neural networks (with dozens and hundreds of layers that really break everyone), it was necessary to grind hundreds of megabytes of coefficients, which immediately made all previous-generation processor caches ineffective. At the same time, the classic ImageNet argue about the strict correlation between the size of the network and its Accuracy (the higher the better, the more to the right, the larger the network, the horizontal axis is logarithmic):


  • Computing within a neural network proceeds in a fixed pattern, i.e. where all “branching” and “transitions” (in terms of the last century) will take place in the overwhelming majority of cases, it is precisely known in advance, which leaves out of work the speculative execution of instructions that previously markedly increases productivity:

    This makes tricked superscalar branch prediction and prediction mechanisms of previous decades of improved processors (this part of the chip also, unfortunately, on DNN rather contributes to global warming, as does the cache).
  • At the same time, neural network training is relatively weak scaled horizontally . Those. we can not take 1000 powerful computers and get the acceleration of learning 1000 times. And even in 100 we cannot (at least the theoretical problem of deterioration in the quality of education on a large amount of batch is not solved yet). It’s generally quite difficult for us to distribute something across several computers, because as soon as the access speed to a single memory in which the network is located decreases, the speed of its learning drops dramatically. Therefore, if the researcher has access to 1000 powerful computers for free , he will certainly take all of them soon, but most likely (if there is no infiniband + RDMA) there will be many neural networks with different hyper parameters. Those. the total training time will be only several times less than with 1 computer. There is possible a game with the size of a batch, and additional training, and other new modern technologies, but the main conclusion is yes, with an increase in the number of computers, the work efficiency and the probability of achieving results will increase, but not linearly. And today the time of the Data Science researcher is expensive and often if you can spend a lot of cars (albeit unwisely), but to get acceleration is done (see the example with 1, 2 and 4 expensive V100 in the clouds just below).

Exactly these moments explain why so many people rushed towards the development of specialized iron for deep neural networks. And why they got their billions.The light at the end of the tunnel is really visible there and not only at Graphcore (which, we recall, 240 times the RNN training was accelerated).

For example, gentlemen from IBM Research full of optimism , develop special chips that after 5 years, the efficiency of computations will increase by an order of magnitude (and after 10 years by 2 orders of magnitude, having reached an increase of 1000 times compared to the 2016 level, in this graph, however, in efficiency per watt, but the power of the cores will also increase):


All this means the appearance of glands, the training on which will be relatively fast, but which will cost much, which naturally leads to the idea of ​​sharing the time of using this expensive piece of iron between researchers. And this idea today no less naturally leads us to cloud computing. And the transition of learning to the clouds has long been active.

Note that already now the training of the same models may differ in time by an order of magnitude with different cloud services. Below is the lead Amazon, and in the last place free Colab from Google. Notice how the result changes from the number of V100 among the leaders - an increase in the number of cards by 4 times (!) Increases productivity by less than a third (!!!) from blue to purple, and Google even less:


It seems that in the coming years the difference will grow to two orders of magnitude. Lord Cooking money! Together we will return multibillion-dollar investments to the most successful investors ...

Briefly


Let's try to summarize the key points in the table:
Type
what speeds up
Comment
CPU
Basically execution
Normally inferior in speed and energy efficiency, but quite suitable for small neural networks
GPU
Run +
learning
The most versatile solution, but rather expensive, both in terms of computation cost and energy efficiency
fpga
Execution
A relatively universal solution for network execution, in some cases it allows to drastically speed up the execution
asic
Execution
The cheapest, fastest, and most energy efficient way to make a network, but large circulation runs are needed.
TPU
Run +
learning
The first versions were used to speed up the execution, now they are used to very effectively speed up the execution and learning
IPU, DPU ... NNP
Mostly training
Many marketing letters that will be safely forgotten in the coming years. The main advantage of this zoo is checking the different directions of acceleration DNN
Analog DNN/RPU
Run +
learning
Potentially analogue accelerators can revolutionize the speed and energy efficiency of performing and training neural networks

A few words about software acceleration


In fairness, we mention that today a big topic is the software acceleration of the implementation and training of deep neural networks. Implementation can be significantly accelerated primarily due to the so-called quantization of the network. Perhaps this is, first, because the range of weights used is not so large and it is often possible to roughen weights from a 4-byte floating-point value to 1 byte integer (and, recalling IBM's successes, even stronger).Secondly, the trained network as a whole is fairly robust to computation noise and precision when moving to int8 falls slightly. At the same time, despite the fact that the number of operations may even increase (due to scaling when counting), the fact that the network decreases in size by 4 times and can be considered as fast vector operations significantly increases the overall execution speed. This is especially important for mobile applications, but it also works in the clouds (an example of speeding up execution in Amazon clouds):


There are other algorithmic ways to speed up network execution and even more ways to learning acceleration . However, these are separate big topics about which not this time.

Instead of a conclusion


In his lectures, the investor and author of Tony Sebah gives a great example: in 2000, the No. 1 supercomputer with a capacity of 1 teraflops occupied 150 square meters, cost 46 million dollars and consumed 850 kW:


15 years later, NVIDIA's GPU with a capacity of 2.3 teraflops (2 times more) was placed in the hand, cost $ 59 (improvement about a million times) and consumed 15 W (improvement 56 thousand times):


In March of this year, Google introduced TPU Pods - in fact, liquid-cooled supercomputers based on TPU v3, a key feature of which is that they can work together on 1024 TPU systems. They look quite impressive:


Exact data is not given, but it is said that the system is comparable with the Top-5 supercomputers in the world. TPU Pod allows you to dramatically increase the speed of learning neural networks. To increase the speed of interaction, TPUs are connected by high-speed highways to a toroidal structure:


It seems that in 15 years this neuroprocessor with twice the performance can also fit in your hand, as Skynet processor (agree with something look like):

A frame from the uncut version of the movie "Terminator 2"

Given the current rate of improvement of hardware accelerators of deep neural networks and the example above, this is completely realistic. There is every chance in a few years to take in hand a chip with a performance like today's TPU Pod.

By the way, it's funny that in the film the chip makers (apparently, imagining where the network can lead to self-study) by default turned off the additional training. Characteristically, the T-800 itself could not enable the training mode and worked in inference mode (see the longer < a href = "https://terminator.fandom.com/wiki/Terminator_2: _Judgment_Day_ (film) #Deleted_scenes "> director's version ). And its neural-net processor was advanced and with the inclusion of pre-training could use the previously accumulated data to update the model. Not bad for 1991.

This text was started in the hot 13-million Shenzhen. I was sitting in one of the 27,000 electric vehicles in the city and with great interest looked at 4 LCD screens of the car. One small one is among the devices in front of the driver, two are centrally located in the dashboard and the last one is translucent in the rearview mirror combined with a DVR, interior surveillance camera and android on board (judging by the top line with the charge level and connection to the network). There the driver's data was displayed (to whom to complain, if that), a fresh weather forecast and, it seems, there was a connection with the taxis. The driver did not know English, and asked him about the impressions of the electric machine did not work. Therefore, he idly pressed the pedal, slightly pushing the car in traffic. And I watched with interest the futuristic look of the window - the Chinese in jackets were driving from work on electric scooters and monowheels ... and wondered how it would look like in 15 years ...

Actually, already today the rear-view mirror, using data from the camera of the DVR and hardware acceleration of neural networks , is quite able to control the car in a traffic jam and create a route. Happy at least). After 15 years, the system will obviously not only be able to drive a car, but will also gladly provide me with the characteristics of fresh Chinese electric cars. In Russian, of course (as an option: English, Chinese ... Albanian, finally). The driver is superfluous, poorly trained, link.

Lord We are waiting for EXTREMELY INTERESTING 15 years!

Stay tuned!

I’ll be back! )))



Thanks
I would like to thank you heartily:

  • Laboratory for Computer Graphics VMK MSU. MV Lomonosov for his contribution to the development of computer graphics in Russia and not only,
  • our colleagues, Mikhail Yerofeyev and Nikita Bagrov, whose examples are used above,
  • personally by Konstantin Kozhemyakov, who did a lot to make this article better and clearer,
  • and, finally, many thanks to Alexander Bokov, Mikhail Yerofeyev, Vitaly Lyudvichenko, Roman Kazantsev, Nikita Bagrov, Ivan Molodetsky, Egor Sklyarov, Alexey Solovyov, Evgeny Lyapustin, Sergey Lavrushkin and Nikolay Oplachko for a lot of comments, and for making a lot of remarks, for commenting on remarks, for commenting on remarks, for Yevgeny Lyapustin, Sergey Lavrushkin and Nikolay Oplachko for a lot of comments, for a lot of comments, for a lot of comments, for comment, for a lot of comments, for comments, for a large number of remarks this text is much better!

Source text: Hardware acceleration of deep neural networks: GPU, FPGA, ASIC, TPU, VPU, IPU, DPU, NPU, RPU, NNP and other letters