Update README.md

Steven Massey 5 years ago committed by GitHub
parent 2943802354
commit 250cc025f7
No known key found for this signature in database

@ -1,2 +1,151 @@
# m3
A high performance WebAssembly interpreter in C.
# M3/Wasm
This is a work-in-progress WebAssembly interpreter written in C using a high performance and novel interpreter topology. The interpreter implementation is discussed some below.
## Purpose
I don't know. I just woke up one day and started hacking this out. Some ideas:
* It could be useful for embedded systems.
* It might be a good warm-up, pre-JIT interpreter in more complex Wasm compiler stack.
* It could serve as a Wasm validation library.
* The interpreter topology might be inspiring to others.
## Current Status
Its foundational is solid but the edges are still quite rough. Many of the WebAssembly opcodes are lacking an implementation. The compilation logic is a tad unfinished. Most execution trap cases are unhandled.
## Benchmarks
It's at least fleshed out enough to run some simple benchmarks.
Tested on a 4 GHz Intel Core i7 iMac (Retina 5K, 27-inch, Late 2014). M3 was compiled with Clang -Os. C benchmarks were compiled with gcc with -O3
### Mandelbrot
C code lifted from: https://github.com/ColinEberhardt/wasm-mandelbrot
|Interpreter|Execution Time|Relative to GCC| |
|Life |547 s |133 x | https://github.com/perlin-network/life
|Lua |122 s |30 x | This isn't Lua running some weird Wasm transcoding; a manual Lua conversion of the C benchmark, for an additional reference point.
|M3 |17.9 s |4.4 x |
|GCC |4.1 s | |
### CRC32
|Interpreter|Execution Time|Relative to GCC|
|Life |153 s |254 x |
|M3 |5.1 s |8.5 x |
|GCC |600 ms | |
In general, the M3 strategy seems capable of executing code around 5-15X slower than compiled code on a modern x86 processor. (Older CPUs don't fare as well. I suspect branch predictor differences.) I have yet to test on anything ARM.
## Building
There's only an Xcode project file currently.
## M3: Massey Meta Machine
Over the years, I've mucked around with creating my own personal programming language. It's called Gestalt. The yet unreleased repository will be here: https://github.com/soundandform/gestalt
Early on I decided I needed an efficient interpreter to achieve the instant-feedback, live-coding environment I desire. Deep (traditional) compilation is too slow and totally unnecessary during development. And, most importantly, compilation latency destroys creative flow.
I briefly considered retooling something extant. The Lua virtual machine, one of the faster interpreters, is too Lua centric. And everything else is just way too slow.
I've also always felt that the "spin in a loop around a giant switch statement" thing most interpreters do was clumsy and boring. My intuition said there was some more elegant and efficient to be found.
The structure that emerged I named a "meta machine" since it mirrors the execution of compiled code much more closely than the loop-based virtual machine.
### How it works
This is rough information that's probably not immediately clear without also referencing the source code.
#### Reduce bytecode decoding overhead
* Bytecode/opcodes are translated into more efficient "operations" during a compilation pass, generating pages of meta-machine code
* M3 trades a little space for time. Most opcodes map to 3 different operations depending on the source operands.
* Commonly occurring sequences of operations can be optimized into a "fused" operation. This *sometimes* results in improved performance.
* the modern CPU pipeline is a mysterious beast
* In M3/Wasm, the stack machine model is translated into a more direct and efficient "register file" approach.
#### Tightly Chained Operations
* M3 operations are C functions with a single, fixed function signature.
void * Operation_Whatever (pc_t pc, u64 * sp, u8 * mem, reg_t r0, f64 fp0);
* The arguments of the operation are the M3's virtual machine registers
* program counter, stack pointer, etc.
* The return argument is a trap/exception and program flow control signal
* The M3 program code is traversed by each operation calling the next. The operations themselves drive execution forward. There is no outer control structure.
* Because operations end with a call to the next function, the C compiler will tail-call optimize most operations.
* Finally, note that x86/ARM calling conventions pass initial arguments through registers, and indirect jumps are branch predicted.
#### The End Result
Since operations all have a standardized signature and arguments are tail-call passed through to the next, the M3 "virtual" machine registers end up mapping directly to real CPU registers. Instead, it's a meta machine with very low execution impedance.
|M3 Register |x86 Register|
|program counter (pc) |rdi |
|stack pointer (sp) |rsi |
|linear memory (mem) |rdx |
|integer register (r0) |rcx |
|floating-point register (fp0)|xmm0 |
For example, here's a bitwise-or operation in the M3 compiled on x86.
0x1000062c0 <+0>: movslq (%rdi), %rax ; load operand stack offset
0x1000062c3 <+3>: orq (%rsi,%rax,8), %rcx ; or r0 with stack operand
0x1000062c7 <+7>: movq 0x8(%rdi), %rax ; fetch next operation
0x1000062cb <+11>: addq $0x10, %rdi ; increment program counter
0x1000062cf <+15>: jmpq *%rax ; jump to next operation
#### Registers and Operational Complexity
* The conventional Windows calling convention isn't compatible with M3, as-is, since it only passes 4 arguments through registers. Applying the vectorcall calling convention should resolve this problem. (I haven't tried compiling this on Windows yet.)
* It's possible to use more CPU registers. For example, adding an additional floating-point register to the meta-machine did marginally increase performance in prior tests. However, the operation space increases exponentially. With one register, there are up to 3 operations per opcode (e.g. a non-commutative math operation). Adding another register increases the operation count to 10. However, as you can see above, operations tend to be tiny.
#### Stack Usage
* Looking at the above assembly code, you'll notice that once an M3 operation is optimized, it doesn't need the regular stack (no mucking with the ebp/esp registers). This is the case for 90% of the opcodes. Branching and call operations do require stack variables. Therefore, execution can't march forward indefinitely; the stack would eventually overflow.
Loops unwind the stack. When a loop is continued, the Continue operation returns, unwinding the stack. Its return value is a pointer to the loop opcode it wants to unwind to. The Loop operations checks for its pointer and responds appropriately, either calling back into the loop code or returning, passing through the loop pointer.
* Traps work similarly. A trap pointer is returned from the trap operation which has the effect of unwinding the entire stack.
* Returning from a (Wasm) function also unwinds the stack, back to the point of the Call operation.
* But, because M3 execution leans heavily on the native stack, this does create a runtime usage issue.
A conventional interpreter can save its state, break out of its processing loop and return program control to the client code. This is not the case in M3 since the C stack might be wound up in a loop for long periods of time.
With Gestalt, I resolved this problem with fibers (built with Boost Context). M3 execution occurs in a fiber so that control can effortlessly switch to the "main" fiber. No explicit saving of state is necessary since that's the whole purpose of a fiber.
More simplistically, the interpreter runtime can also periodically call back to the client. This is necessary, regardless, to detect hangs and break out of infinite loops.
### Thoughts & Questions about WebAssembly
There are some really fundamental things about WebAssembly that I don't understand yet. There are a bunch of mysteries about this thing that don't seem sufficiently explained in plain English anywhere. Can someone orient me?
Some examples:
* Linear memory: Modules seem hardcoded to request 16MB of memory. What? Why? As the host, where can I safely allocate things in linear memory? Who's in charge of this 16MB?
* Traps: The spec says "Signed and unsigned operators trap whenever the result cannot be represented in the result type." Really? That's cool, but how can the behavior of source languages be modeled (efficiently)? C integer operations wrap and sometimes that's what you want.
