The JIT compiler for UAE Introduction ============ UAE has always been hampered by being somewhat lacking in CPU performance. Undoubtedly, it was amazingly fast for what it did, but it never really was an alternative to a decked out, no expense spared Amiga. The Just In Time compiler should help change that. If you just want to know how to *use* it, rather than how it works, stop right now and read the file README.umisef. Yep, that name is obscure; I just wanted to make sure you at least look at this one once ;-) Normal Emulation ================ In order to understand how the JIT compiler works, and especially how it speeds up emulation, one has to first understand how the normal UAE CPU emulation works. As an example, here is a small 68k routine that copies a block of memory, from A1 to 0x42000000. mov.l 0x42000000,A0 loop: mov.l (A1+),(A0+) cmp.l 0x42010000,A0 bne loop The syntax might be a bit off, but you get the idea. The normal emulation works one instruction at a time. It maintains a pointer to where the next instruction can be found in the host's memory (which means you can only execute stuff from "real" memory --- can't execute the CIA registers, for example ;-). As a first step, the emulation loads the 16 bit value that pointer points to, i.e. the opcode. The opcode is then used as an index into a lookup table that contains 65536 pointers to functions; The functions pointed to are the respective handlers for the instructions matching the opcode. Next comes the actual call of that routine. In order to make life easier (and UAE a managable size), most handler routines handle more than one opcode --- for example, our mov.l (A1+),(A0+) would be handled by the same routine as mov.l (A7+),(A3+). In order to handle all the different opcodes for which they can be called, most handler routines extract the numbers of the registers used from the opcode (which gets that passed as an argument). In order to do so, they have to shift the bits in the opcode around a little, and then mask out the needed bits. That then gives indices into the array that holds the current values for the registers; If a register is read from or written to, the indices are used to calculate the actual address to read from or write to. OK, almost done --- one rather small part of emulating an instruction is actually *doing* it. The mov.l instructions up there basically just realize that they should store their source in their destination, the cmp.l subtracts the immediate from the register's value, and the bne looks at the flags and decides whether or not to change the instruction pointer. Last but not least comes the cleanup --- if the instruction is supposed to produce flags, these need to be calculated; Also, the instruction pointer gets moved to the next instruction. Then any changed registers, any changed flags and changed state information is written back to the respective memory locations, the handler routine executes a return, and the whole cycles starts again. Bottlenecks =========== This way of doing it has several big performance drawbacks. At the very start, the instruction fetch and handler lookup look sort of like this: opcode=*instruction_pointer; handler=handlers[opcode]; call handler; That's two memory accesses before we even reach the "call", and it's a straight dependency chain --- each instruction depends on the result of the previous. Hopefully, all those memory locations are in the L1, but even then, the latency is 3 cycles each time (on a PII), for 6 cycles minimum so far. Another real biggie is the call instruction --- it is always the *same* instruction, but the target changes around all the time. This means that the CPU has little hope of ever predicting the target correctly, leading to a pipeline stall. Uh-oh! Pipeline stalls are *really* expensive on the PII, to the tune of 10 or so cycles. The total is up to 16. Most 68k instructions read at least one register, so most handler start out with a shift, an AND, and then an indexed memory access. Of course, they depend on each other, so the minimum time for this is another 5 cycles before the register value can be used. Total 21, and we haven't done a thing yet. Then most instructions create flags. For many of the most often used instructions, the x86 instruction that does the calculation will produce the right flags (and for mov's, an extra TEST x86 instruction will do the trick). UAE is clever to pick these up --- but isn't clever in the way it does it. It uses pushf --- which translates in 16 micro-ops and takes at least 7 cycles. There is a better way (using LAHF and TESTO), but unless one is very careful, that way can easily lead to a so called "partial register stall" --- which just means that we get screwed in a whole new and interesting way, but screwed we get, for another 7 cycles or so. We are now up to 28, just for the overhead of a typical instruction. [ Update: The latest versions of UAE-JIT contains patches that overcome some of the flag-handling problems of the normal emulation. By and large, the above is still true, though] Considering the need to increment the instruction pointer, write back the registers and flags to memory, and the "ret", and also considering that the handlers are allowed to clobber any flags they like, and that thus there is an extra memory read at the very start of the whole thing (rereading the instruction pointer from memory), the total overhead is something lik 35 cycles. That easily dwarfs the time it actually takes to *do* whatever the instrutions are supposed to do (and 35 is assuming the ideal case, where all the memory accesses are satisfied from L1). There is one more performance killer --- 68k instructions that access memory (and let's face it, that's most of them ;-). Unfortunately, the 68k uses memory mapped I/O, so an innocent mov.l (A1+),(A0+) could access just about *anything*, depending on the values of A1 and A0. This means that each memory access has to go through several steps --- first, the address gets shifted 16 bits to the right and the result taken as an index into a lookup table. That yields a pointer to a structure; In that structure, the addresses of handler routines for reading and writing bytes, words or longwords can be found. So one more memory access gets the address of an appropriate handler, which then gets called (yep, a calculated subroutine call for every single memory access!). The handlers for "real" memory then mask the address, add it to the base address for their memory segment, and perform the requested operation. This takes a *lot* of time. It also (potentially) clobbers many registers, so that the instruction handler routine often has to temporarily store stuff on the stack to keep it past the memory access. Take all that together, and you get to a pretty constant average of 70 or so x86 cycles per emulated m68k instruction. If it wasn't for great fast x86 chips, this would be really bad. As it stands, it's just annoying. Solving the Bottlenecks ======================= Right! Time to solve some of these problems. Let's start at the top: The 2 memory lookups and the calculated call at the start of emulating each instruction will always give the exact same results for any given instruction. So the obvious thing to do is to do it all once, and then keep the state around. At this point, the concept of a "block" needs to be introduced. A "block" is a bunch of consecutive instructions that will *always* execute all together, one after the other, as long as the first one is executed. In the example, there are two interesting blocks. Here is the example again: mov.l 0x42000000,A0 loop: mov.l (A1+),(A0+) cmp.l 0x42010000,A0 bne loop The first interesting block consists of all 4 instructions. The second interesting block starts at "loop" and consists of three instructions. The important thing here is that the "bne loop" instruction definitely ends a block --- what instruction comes after it depends on the state of the flags, and thus any "find out once and keep the info" scheme cannot work it out. One of the ideas behind the x86 compiler is that it operates on blocks. So whenever a block ends, we *then* check whether there is some pretranslated stuff for the block starting with the next instruction. If yes, that pretranslated stuff (a whole block) is executed. If not, the normal emulation is done in the normal way until it hits an end-of-block instruction. This means that for each block, the cache of pretranslated stuff is only checked once, reducing the extra overhead introduced by those checks. However, while doing the normal emulation, we also keep track of what locations the opcodes in the block were fetched from. A simple compiler can do something rather ingenious --- whenever the normal emulation finishes a block, it "compiles" it into a series of native x86 instructions along the lines of mov $opcode1,%eax call $handler1 mov $opcode2,%eax call $handler2 mov $opcode3,%eax call $handler3 and so on (the opcodes get loaded into %eax so that the handler routines can shift and mask them like they expect to). This has several advantages: * The whole get-instruction-pointer-then-fetch-opcode-and-look-up-handler stuff is done once, during compile time, and then is encoded directly in the x86 instruction stream. Saves about 9 cycles per instruction. * The calls are no longer calculated, but rather constant, and each 68k instruction has a call of its own --- which will give the host CPU ample opportunity to correctly predict the targets, thus saving another 10 or so cycles per instruction. "Compiling" blocks in this way (and recognizing blocks in the first place) creates another opportunity to save a few cycles. In the example code, the mov.l (A1+),(A0+) is *always* followed by the comparison instruction. The comparison doesn't use any flags, and overwrites any flags the mov.l might have set or cleared. In other words, the flags generated by the mov.l can never make any difference. Because this can be determined during compilation time, and because the particular mov.l gets compiled into a particular call, we can call a different handler for it --- one which does the same thing, but just doesn't bother generating any flags. Of course, those handlers need to be generated somehow, but that's trivial. And every time generating the flags can be avoided by this, it save another 7 or more cycles. One more problem --- what happens when the 68k overwrites some memory we have pretranslated? Well, Motorola did the right thing, and didn't fudge some silly logic into their processors that transparently detects this situation. Instead, if you want the CPU to acknowledge new code, you need to explicitly flush the icache. And every time the emulated 68020 does that, we simply throw away all translations and restart with a blank slate. Implementing these rather simple compilation options will result in 30-100% speedup, depending on the task. Getting more aggressive ======================= Once we have taken that first step, however, and have all the infrastructure for detecting and handling blocks in place, it becomes soooo much easier to get a little bit more aggressive, step by step. First, instead of outputting the mov $opcode_x,%eax call $handler_x sequence, we can directly output code to just do what the instruction is supposed to do. And we can do this step by step, because if we can't handle a particular instruction, there is always the simple mov/call pair to fall back on. Of course, the best choices for opcodes to do this for first are the opcodes that get executed the most. And because at compile time, we already know the exact opcode, we can work out what registers get used *then*, and can simply address them directly (rather than doing the shift-mask-and-address_indexed routine of the normal handler routine). And of course it also saves the call/return overhead, not to mention that some of the hand-designed x86 code tends to be a fair bit more efficient than what gcc creates from the portable C code. A word about how to do the translating --- UAE has a wonderful framework in place to generate the handler routines for the normal emulation. Simply copying that framework allows us to create translation routines in much the same way. Of course, instead of actually *doing* stuff, those routines must themselves *generate code* that will eventually do stuff, and this extra level of indirection (writing a program which outputs routines which generate routines which emulate 68k instructions) can be a bit mind-twisting at times, but it definitely beats the alternatives. And here is another word of hard earned wisdom (aka "Tried it, and after a few days, had myself thoroughly backed into a corner, so had to scrap it" ;-): Keep it simple. Don't try to cram too much into one layer, rather introduce more layers, just as long as you can keep each one simple and manageable. In the UAE JIT compiler, the translation routines generated don't actually contain code to emit x86 code. Instead, they contain lots of calls along the lines of mov_l_rr(src,dst); cmp_l_ri(dst,1889); and so on. Here is an almost-actual example (I removed some stuff that we haven't dealt with yet, and formatted more nicely): { uae_s32 srcreg = imm8_table[((opcode >> 1) & 7)]; uae_s32 dstreg = (opcode >> 8) & 7; uae_u8 scratchie=S1; int src = scratchie++; mov_l_ri(src,srcreg); { int dst=dstreg+8; int tmp=scratchie++; sign_extend_16_rr(tmp,src); add_l(dst,tmp); mov_l_rr(dstreg+8,dst); } } That's the code that would get called for "mov.w 3,a0". The way it works is like this --- the routine targets an x86-like processor, but one with 32 registers. The first 16 of these are mapped to D0-D7 and A0-A7; The next three hold flags and the instruction pointer, and then there are 13 scratch registers, S1 to S13. First, the routine works out what register and immediate to use. It then calls mov_l_ri(), which will generate code to move the immediate into a register (in this case, a scratch register). That value then gets sign extended (pointless in this case, but that's what you get for generating these routines with a program...) into yet another scratch register, then adds the immediate to the register, and moves the result back into the register. But all the translation routine does is to call other routines which will (eventually) generate the code. Writing this sort of code is manageable. In the next layer down, the 32 register x86 gets mapped onto the real, 7 register PII, i.e. register allocation takes place. Also, some optimization takes place here --- in the example above, the last mov_l_rr() gets ignored, because its source and destination are the same. Also, if an immediate is moved into a register (and that register thus has a known value, like "src" in the example), more optimization is possible, and is indeed done. As a result of that optimization, the superfluous sign extension never actually makes it into x86 code for this example. Last but not least, this layer keeps track of what 68k registers need to be written back to memory to complete the emulated instruction. On the lowest layer, actual bits and bytes are written to memory to encode raw x86 instructions. If you happen to have an x86-like processor that just happens to use different instruction encodings, this layer is all you'd need to change. To actually port to a different target processor, such as a PPC or StrongArm, you'd have to take care of some more details related to the way x86 instructions do and don't set flags, but you should still benefit greatly from the clear separation of layers (hint, hint !-). Each of those layers is fairly complex (and the ^%&$$#$%&*& x86 certainly doesn't make it any easier; Orthogonality, what's that?), but I can wrap my mind around each one individually. Trying to do it all in one go would be far beyond me. Using these techniques, it is fairly simple to cover 90+ percent of all the instructions that get executed in a typical Amiga application, and the resulting speedup is rather nice. Beyond Individual Instructions ============================== When 90+ percent of instructions are covered, it's fairly likely that a real translated instruction is followed by another. In that case, there really isn't much point in writing back all the 68k registers at the end of the first, only to reload them at the start of the second. So instead of writing back between any two instructions, the whole state of the middle layer (which handles register allocation) is simply kept alive. Of course, we still need to write it all back each time we fall back on the mov/call method, because those handlers expect the 68k registers/flags/other state to live in certain places in memory. That's why it can pay off to generate translation routines for rarely used instructions --- the average lifetime of a middle-layer state might increase dramatically. Oh, and of course at the end of a block everything needs to be written back. Memory Access ============= If you do all that, you run up against the last remaining performance killer: Memory access. Nothing will screw up your performance like the occasional call to a C function that will not only take several cycles for itself, but will also potentially clobber all your registers --- meaning you have to store it all back before the call, and start from scratch afterwards. The *vast* majority of memory accesses are to "real" memory. The JIT compiler has two methods for speeding these up. The first method is simple, safe and a good bit faster than the default memory access (see above). For this method, instead of having a lookup table with 65536 pointers to structures that tell us how to get to memory, we have another table with 65536 values. For memory areas which are *not* real memory, this table holds the same info as the other one, except that the LSB is set. However, for memory areas which *are* real memory, this table holds the difference between the address the x86 sees and the address the emulated Amiga sees. So a memory access looks up the content of the appropriate slot in the table. If the LSB is set, the usual song and dance has to be done. But if the LSB is clear, we simply add the Amiga address to what we got from the table, and voila, there is the x86 address of real memory, from which we can load (or to which we can write) our value, without calling any C routines, and without clobbering registers. The other method is much more radical. The Amiga, while it has an address space of 4GB, only seems to use less than 1.5G of that (From 0x00000000 to 0x60000000). Linux, at least in the 2.4 flavour, allows user programs pretty much full control over 3GB; But it, too, tends to leave 0x50000000 to 0xbfff0000 unused. In order to make things fast, we simply map all the real memory from the Amiga into Linux's address space, with a constant offset. In particular, the Amiga address X ends up at Linux address 0x50000000+X; So when a translation routine wants to generate a memory access at address Y *and* knows that this access goes to real memory, then it can simply access the memory at 0x50000000+Y. The magic behind the scenes that is needed to keep those mappings accurate is a bit complicated, mainly because the Amiga tends to overmap some areas over and over, but none of that is really worrysome. The much bigger problem is "How the heck do we know whether a particular instruction accesses real mem or I/O?". And here comes the major hand-waving: We guess! We are making the assumption that any given instruction will either always access real memory, or always access I/O. In other words, we expect this to be consistent. Based on those assumptions, during the normal emulation (i.e. the data-gathering phase for the compiler), we simply set a flag to 0 before emulating each instruction; All the I/O memory handlers contain code that set this flag to non-zero, so if the instruction completes and the flag is still 0, then that instruction didn't access any I/O. During the data gathering, the value of this flag is saved after each instruction, and thus when it comes to compiling, the translation routines know whether or not the instruction they are translating can use the aggressive method for memory access. Alas, these assumptions don't always hold. That's why there are many options for forcing memory access to the slower but safer method. If the assumptions don't hold, you will notice --- because there is no safety net, you *will* get a lovely core dump. (Theoretically, it would be possible to write a SIGSEGV handler that works out what the instruction that failed tried to do, and calls the appropriate I/O handler; I have done that in the past for the Alpha, but the idea of trying to decode x86 instructions fills me with dread and fear. Any takers? ;-). [Update: Hey, that was easier than I imagined --- such a SIGSEGV handler is now in place, and apparently working just fine. Of course, it isn't completely general, but instead only handles the instructions I use for direct memory access....] Conclusion ========== UAE can now emulate the CPU at considerably more than 060 speeds. That should do for a while --- although I am sure that further speedups are still possible. Patches welcome ;-)