QEMU.js: now in earnest and with WASM

QEMU.js: now in earnest and with WASM


Once upon a time I decided to prove the reversibility of the process for fun, and learn how to generate JavaScript (or rather, Asm.js) from machine code. QEMU was chosen for the experiment, some time later an article was written on Habr. In the comments, I was advised to redo the project on WebAssembly, and somehow I didn’t want to throw the almost finished project ... The work was going, but very slowly, and, recently, commentary on the topic “So, and how was it all over?”. In my detailed response, I heard "It pulls on the article." Well, since pulls, there will be an article. Maybe someone will come in handy. From it, the reader will learn some facts about the QEMU code generation device backend, as well as how to write the Just-in-Time compiler for the web application.


Tasks


Since I’ve already learned how to port QEMU to JavaScript somehow, this time it was decided to make minds and not repeat old mistakes.


Error number: withdraw from point release


My first mistake was to branch my version from upstream version 2.4.1. Then it seemed to me a good idea: if point release exists, then it is probably more stable than the simple 2.4, and even more so the master branch. And since I planned to add a fair amount of my bugs, I didn’t need strangers at all. So it probably happened. But bad luck: QEMU does not stand still, and at some point there even announced optimization of the generated percent code by 10. “Yep, now I'm dead,” I thought and broke off . Here we must digress: due to the single-threaded nature of QEMU.js and the fact that the original QEMU does not imply the absence of multithreading (that is, the possibility of simultaneous operation of several unrelated code paths, and not just “zayuzat all cores” is critical), the main stream functions had to "turn out" to be able to call outside. This created some natural problems with the merger. However, the fact that some of the changes from the master branch with which I tried to merge my code were also cherry picked in point release (and therefore in my branch) too, probably wouldn’t add convenience .


In general, I decided that anyway, the prototype makes sense to throw out into parts and build a new version from scratch on the basis of something fresher and now from master .


Error number two: TLP-methodology


In essence, this is not a mistake, in general - just a feature of creating a project in conditions of complete misunderstanding as “where and how to move?”, and in general “will we get there?”. Under these conditions, tyap-bloop programming was a justified option, but, of course, absolutely did not want to repeat it unnecessarily. This time I wanted to do it according to my mind: atomic commits, deliberate code changes (rather than “stringing random characters together until it compiles (with warnings)”, as Linus Torvalds once said about someone, if you believe Wikisitnik), etc.


Error number three: without knowing the ford to climb into the water


I didn’t get rid of this even now, but now I decided not to follow the path of absolutely minimal resistance, and to make it “adult”, namely, to write my TCG backend from scratch, so that I don’t say “Yes , of course, it’s slow, but I can’t control everything - TCI is so written ... ” In addition, initially it seemed like an obvious solution, because I generate the binary code . As the saying goes, “Gent y gathered, but not that one”: the code, of course, is binary, but the control cannot simply be transferred to it - it must be explicitly shoved into the browser for compilation, resulting in some object from the world of JS, which still needs to be saved somewhere.However, on the normal RISC architectures, as far as I understand, a typical situation is the need to explicitly reset the instruction cache for the regenerated code — if this is not what we need, then in any case, it’s close. In addition, from my last attempt, I learned that the control in the middle of the translation unit does not seem to be transferred, so we don’t really need the bytecode interpreted from any offset, and you can simply generate by function on TB.


They came and kicked


Although I started rewriting the code back in July, the magic Pendel sneaked up unnoticed: usually letters from GitHub come as notifications about replies to Issues and Pull requests, and here, suddenly mention in the thread Binaryen as a qemu backend in the context, "Here he did something similar, maybe he will say something." It was about using the related Emscripten library Binaryen to create a WASM JIT. Well, I said that you have an Apache 2.0 license there, and QEMU as a whole is distributed under GPLv2, and they are not very compatible. Suddenly it turned out that the license can be somehow fixed (I don’t know: maybe, change, maybe dual licensing, maybe something else ...). This, of course, made me happy, because I had already watched several times by that time looking at the binary format WebAssembly, and I was somehow sad and incomprehensible. There was also a library, which would devour the basic blocks with the transition graph, and issue the baytkod, and even launch it itself in the interpreter, if necessary.


Then there was another letter on the QEMU mailing list, but this is more likely to the question, "And who needs it at all?". And it, suddenly , turned out to be necessary. At a minimum, it is possible to scrape together such opportunities of use, if it works more or less quickly:


  • launching something to teach at all without installing
  • iOS virtualization, where the only application that is rumored to be eligible for code generation on the fly is the JS engine (is it true?)
  • demonstration of mini-OS - single-disk, built-in, all sorts of firmware, etc ...

Browser runtime features


As I said before, QEMU is tied to multithreading, but not in the browser. Well, that is, how not ... At first, it was not there at all, then WebWorkers appeared - as far as I understand, this is multithreading, based on sending messages without jointly changing variables . Naturally, this creates significant problems when porting existing code based on a shared memory model. Then, under public pressure, it was implemented and it was called SharedArrayBuffers . They gradually introduced it, celebrated its launch in different browsers, then celebrated the new year, and then Meltdown ... Then they came to the conclusion that you don’t blush — don’t blush the time dimension, but with the help of shared memory and the counter incrementing flow, it’s still it’s pretty accurate . So disconnected multithreading with shared memory. It seems that it was then turned back on, but as it became clear from the first experiment, there is life without it, and if so, we will try to do it without laying down on multithreading.


The second feature is the impossibility of low-level manipulations with the stack: you can not just take, save the current context and switch to a new one with a new stack. The call stack is managed by the JS virtual machine.It would seem, what's the problem, since we still decided to manage the former threads completely manually? The fact is that block I/O in QEMU is implemented through cortinas, and here low-level stack manipulations would be useful to us. Fortunately, Emscipten already has a mechanism for asynchronous operations, even two: Asyncify and Emterpreter . The first works through a significant swelling of the generated JavaScript code and is no longer supported. The second is the current "correct way" and works through the generation of bytecode for its own interpreter. It works, of course, slowly, but it does not inflate the code. True, the support of corutin for this mechanism had to be contributed independently (there were already corortines written under Asyncify and there was an implementation of approximately the same API for Emterpreter, you just had to connect them).


At the moment, I have not yet had time to divide the code into WASM compiled and interpreted using Emterpreter, so block devices are not working yet (see the following series, as they say ...). That is, in the end, you should get this funny layered something:


  • interpreted block I/O. Well, what do you really expect emulated NVMe with native performance?:)
  • statically compiled QEMU master code (translator, other emulated devices, etc.)
  • dynamically compiled guest code in WASM

Features of QEMU source codes


As you probably already guessed, the emulation code of the guest architectures and the generation code of the host machine instructions for QEMU are separate. In fact, there is even a little more cunning:


  • there are guest architectures
  • there are accelerators , namely, KVM for hardware virtualization on Linux (for compatible guest and host systems), TCG for JIT code generation anywhere. Beginning with QEMU 2.9, there is support for the HAXM hardware virtualization standard on Windows ( details )
  • if TCG is used, not hardware virtualization, then it has separate support for code generation for each host architecture, as well as a universal interpreter
  • ... and around it all - emulated peripherals, user interface, migration, record-replay, etc.

By the way, do you know: QEMU can emulate not only a whole computer, but also a processor for an individual user process in the host core, which is used, for example, by an AFL fuzzer for binary toolmaking. Perhaps someone wants to port this QEMU mode of operation to JS? ;)


Like most of the long-standing free software, QEMU is built by calling configure and make . Suppose you decide to add something: TCG-backend, implementation of threads, something else. Do not be in a hurry to rejoice/be terrified (underline the necessary) the perspective of communicating with Autoconf - in fact, QEMU’s configure seems to be self-written and is not generated from anything.


WebAssembly


So what is this thing - WebAssembly (also known as WASM)? This is a replacement for Asm.js, now no longer pretending to be valid JavaScript code. On the contrary, it is purely binary and optimized, and even just writing an integer into it is not very simple: it is stored for compactness in the format LEB128 .


You may have heard about the relooping algorithm for Asm.js - this is the restoration of "high-level" flow control instructions (i.e. if-then-else, loops, etc.), which have been sharpened by the JS engines, from the low-level LLVM IR, closer to the machine code executed by the processor. Naturally, the QEMU intermediate representation is closer to the second.It would seem, here it is, the baytkod, the end of the torment ... And then the blocks, if-then-else and cycles! ..


And this is another reason why Binaryen is useful: it can naturally accept high-level blocks that are close to what will be stored in WASM. But he can also issue a code from the graph of base blocks and transitions between them. Well, I already said that it hides behind a convenient C/C ++ API storage format for WebAssembly.


TCG (Tiny Code Generator)


TCG was originally a backend for the C compiler. Then he apparently could not stand the competition with GCC, but in the end I found my place in QEMU as a code generation mechanism for a host platform. There is also a TCG backend that generates some abstract baytkod, which is immediately executed by the interpreter, but I decided to use it this time. However, the fact that QEMU already has the ability to enable the transition to the generated TB via the tcg_qemu_tb_exec function turned out to be very useful to me.


To add a new TCG backend to QEMU, you need to create a subdirectory tcg/& lt; architecture name & gt; (in this case, tcg/binaryen ), and there are two File: tcg-target.h and tcg-target.inc.c and register the whole thing in configure . You can put other files there, but, as you can guess from the names of these two, they both will be included somewhere: one as a normal header file (it is included in tcg/tcg.h , and that to other files in the directories tcg , accel and not only), the other - only as a code snippet in tcg/tcg.c , but it has access to its static functions.


Having decided that I would spend too much time on the detailed proceedings, as it is arranged, I simply copied the “skeletons” of these two files from another implementation of the backend, honestly indicating this in the license header.


The tcg-target.h file contains mainly settings in the form of #define -s:


  • how many registers and how wide is the target architecture (we have - how much we want, so much is - the question is bigger than what will be generated in a more efficient code by the browser on the "very target" architecture ...)
  • alignment of host instructions: on x86, and in TCI, instructions are not aligned at all, I'm going to put code and not instructions at all in the buffer, but pointers to the structures of the Binaryen library, so I will say: 4 bytes
  • what optional instructions can generate a backend - we include everything we find in Binaryen, let the rest of the accelerator be broken down into simpler ones
  • what approximately the size of the TLB cache is requesting a backend. The fact is that in QEMU everything is serious: although there are helpers that implement load/store taking into account the guest MMU (and where is it now without it?), They retain their broadcast cache as a structure, which processing can be easily embedded straight into the broadcast blocks. The question is which offset in this structure is most efficiently handled by a small and fast sequence of commands
  • here you can tweak the assignment of one or two reserved registers, enable the TB call through the function and optionally describe a couple of small inline functions like flush_icache_range (but this is not our case)

The tcg-target.inc.c file, of course, is usually much larger and contains several required functions:


  • initialization, including, among other things, restrictions on which instruction with which operands can work.I brazenly copied from another backend
  • function that accepts a single inner bytecode statement
  • here you can put auxiliary functions, and also here you can use static functions from tcg/tcg.c

For myself, I chose the following strategy: in the first words of the next translation block, I wrote down four pointers: the start mark (some value in the neighborhood of 0xFFFFFFFF , which determined the current state of TB), the context, the generated module, and magic number for debugging. First, the label was set to 0xFFFFFFFF - n , where n is a small positive number, and with each execution through the interpreter it increased by 1. When it reached 0xFFFFFFFE , a compilation took place, the module was saved in the function table imported into a small “launcher”, into which execution from tcg_qemu_tb_exec went, and the module was deleted from QEMU memory.


To paraphrase the classics, "The crutch, how much of this sound for Proger's heart is woven ...". However, the memory flowed away somewhere. And it was a memory driven by QEMU! I had a code that, when I wrote down the next instruction (well, that is, the pointer), deleted the one to which it was at this place earlier, but it did not help. In fact, in the simplest case, QEMU allocates memory at startup and writes the generated code there. When the buffer ends, the code is thrown away, and the next one begins to be written in its place.


After learning the code, I realized that the crutch with the magic number made it possible not to fall on the destruction of the heap, freeing up something wrong on the uninitialized buffer during the first pass. But who rewrites the buffer to bypass my function then? As the developers of Emscripten advise, having rested against the problem, I ported the resulting code back to the native application, set Mozilla Record-Replay on it ... In general, I ended up with a simple thing: for each block, the struct TranslationBlock is allocated with its description. Guess where ... That's right, directly in front of the block, right in the buffer. Realizing this, I decided to tie it with crutches (at least a few), and just threw out the magic number, and transferred the remaining words to struct TranslationBlock , making a simply connected list that you can quickly go through when resetting the broadcast cache, and free up memory.


Some crutches remain: for example, marked pointers in the code buffer - some of them are just BinaryenExpressionRef , that is, they look at expressions that need to be linearly put into the generated base unit, part is a condition for switching between BB , part - where to go. Well, there are already prepared blocks for Relooper, which need to be connected according to the conditions. In order to distinguish them, the assumption is that they are all aligned at least four bytes, so you can safely use the lower two bits under the label, you just need to remember to remove it if necessary. By the way, such labels are already used in QEMU to indicate the reason for leaving the TCG cycle.


Using Binaryen


Modules in WebAssembly contain functions, each of which contains a body, which is an expression. Expressions are unary and binary operations, blocks consisting of lists of other expressions, control flow, etc. As I said, the control flow is organized here as high-level branches, cycles, function calls, etc. Arguments are passed to functions not on the stack, but explicitly, as in JS. There are global variables, but I have not used them, so I will not tell about them.


Also, functions have local-numbered locals of type: int32/int64/float/double. In this case, the first n local variables are the arguments passed to the functions. Please note that even though everything is not quite low-level in terms of control flow, but whole numbers still do not carry the sign/unsigned attribute: how the number behaves depends on the operation code.


Generally speaking, Binaryen provides a simple C-API : you create a module, in it you create expressions — unary, binary, blocks from other expressions, control flow, etc. Then you create a function whose body you want to specify an expression. If you, like me, have a low-level transition graph, the relooper component will help you. As far as I understand, you can use high-level flow control in a block as long as it does not go beyond the block limit — that is, you can do the internal branch of the fast path/slow path inside the embedded TLB cache processing code, but you cannot interfere with the outer control flow - no . When you release a relooper, its blocks are released, when you release a module - the expressions, functions, etc. that are highlighted in its arena disappear.


However, if you want to interpret the code on the go without unnecessary creations and deletions of the interpreter instance, it may make sense to bring this logic to a file in C ++, and from there directly manage the entire C ++ API library, bypassing the ready-made wrappers.


So in order to generate the code, you need to


 //set global parameters (can be changed later)
 BinaryenSetAPITracing (0);

 BinaryenSetOptimizeLevel (3);
 BinaryenSetShrinkLevel (2);
//create module
 BinaryenModuleRef MODULE = BinaryenModuleCreate ();
//describe the types of functions (both created and called)
 helper_type BinaryenAddFunctionType (MODULE, "helper-func", BinaryenTypeInt32 (), int32_helper_args, ARRAY_SIZE (int32_helper_args));//(int23_helper_args purchase ^ W are created separately)
//construct a super-mega expression//... well here you are somehow :)
//then create a function
 BinaryenAddFunction (MODULE, "tb_fun", tb_func_type, func_locals, FUNC_LOCALS_COUNT, expr);
 BinaryenAddFunctionExport (MODULE, "tb_fun", "tb_fun");
 ...
 BinaryenSetMemory (MODULE, (1 & lt; & lt; 15) - 1, -1, NULL, NULL, NULL, NULL, NULL, 0, 0);
 BinaryenAddMemoryImport (MODULE, NULL, "env", "memory", 0);
 BinaryenAddTableImport (MODULE, NULL, "env", "tb_funcs");
//request validation and optimization if desired
 assert (BinaryenModuleValidate (MODULE));
 BinaryenModuleOptimize (MODULE);  

... if you forgot something - sorry, this is just to represent the scope, and the details are in the documentation.


And now begins the cracks-fex-peks, something like this:


  static char buf [1 & lt; & lt;  20];
 BinaryenModuleOptimize (MODULE);
 BinaryenSetMemory (MODULE, 0, -1, NULL, NULL, NULL, NULL, NULL, 0, 0);
 int sz = BinaryenModuleWrite (MODULE, buf, sizeof (buf));
 BinaryenModuleDispose (MODULE);
 EM_ASM ({
  var module = new WebAssembly.Module (new Uint8Array (wasmMemory.buffer, $ 0, $ 1));
  var fptr = $ 2;
  var instance = new WebAssembly.Instance (module, {
  'env': {
  'memory': wasmMemory,
//...
  }
  );
//and now you have an instance!
 }, buf, sz);  

In order to somehow interconnect the QEMU and JS world and at the same time enter the compiled functions quickly, an array was created (table of functions for importing into the starter), and the generated functions were placed there. In order to quickly calculate the index, the index of the zero word of the translation block was originally used as it, but then the index calculated using this formula simply began to fit into the field in the struct TranslationBlock .


By the way, demo (for now with a muddy license) works fine only in Firefox. Chrome developers were somehow not ready to the fact that someone wants to create more than a thousand instances of WebAssembly modules, so they simply allocated a gigabyte of virtual address space to each ...


For now, that's all. Perhaps there will be another article if it is interesting to someone. Namely, it remains at least only to make the block devices work.It may also make sense to make the compilation of WebAssembly modules asynchronous, as is customary in the JS world, since there is still an interpreter that can do all this while the native module is not ready.


Finally a riddle: you have compiled a binary on a 32-bit architecture, but the code through memory operations climbs from Binaryen, somewhere on the stack or somewhere else in top 2 GB 32-bit address space. The problem is that, from the point of view of Binaryen, this is an address to a too large resulting address. How to get around this?


admin-like

I didn't test it in the end, but my first thought was “What if put 32-bit Linux? ”Then the upper part of the address space will be occupied by the kernel. The only question is how much will be occupied: 1 or 2 Gb.


Programmerly (option for practitioners)

Nadue bubble in the upper part of the address space. I myself do not understand why it works - there should already be a stack. But "we practice: everything works for us, but no one knows why ...".


 //2gbubble.c//Usage: LD_PRELOAD = 2gbubble.so & lt; program & gt;

 #include & lt; sys/mman.h & gt;
 #include & lt; assert.h & gt;

 void __attribute __ ((constructor)) constr (void)
 {
  assert (MAP_FAILED! = mmap (1u & gt; & gt; 31, (1u & gt; & gt; 31) - (1u & gt; & gt; 20), PROT_NONE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0));
 }  

... with Valgrind, however, is not compatible, but, fortunately, Valgrind itself very effectively displaces everyone from there:)


Maybe someone will give a better explanation of how this code works ...

Source text: QEMU.js: now in earnest and with WASM