ISTG, gcc spills registers to stack more often than a glass without a bottom.
Like seriously, is it really necessary to push and use r4, when r2 and r3 are not used? Or to push r4 and pop it back without ever getting used?
Bleh.
Thumb has no imm8 version encoding for AND and ORR opcodes, so this change should bring some speedup, as no register trashing (and thus extra register PUSHed to and POPped from stack is saved), and no LDR from constant pool necessary.
This also speeds up x86 target by a great margin for some reason.
It was present in prerelease alpha version, but when porting to the RP2040, it was converted to double indirection, as it worked with the shitty LRU algo that I used at the time.
However, double indirection is not only slower just by requiring an extra load instruction, it's also slower by trashing the cache. On Windows, I get a 10% performance decrease with double indirection, which is pretty wild.
Although gcc makes me pull my hair out with its aggressive nonsense spilling to the stack, and me having to resort to these nonsensical attribute spam just to get it to stop spilling my registers :(
- Added extra comments for spicy code (to be refactored)
- Made code more risky (due to assumptions, this is perfectly fine, edge cases are accounted for)
- Made code more Thumb-friendly (replaced binary (AND/OR) ops with constant with ALU (ADD/SUB) ops with constant, saving on register pressure and constant loads)
- Tuned code for gcc without inline ASM for better register allocation, and thus faster code due to less PUSHes/POPs
- Introduced an off-by-one error intentionally (normally not a problem, and makes a lot of sense for such application-specific emulator core like this)
- Refactored code for readability (duplicated same code with different function name)
- Gated unused functions to save code space (where --gc-sections is not used)