|
Post by reapersms on Jan 29, 2023 19:20:03 GMT -5
Symptoms: UI is very unresponsive, if you are able to get into the game, you run around at warp speed, jumps are instant, camping is sped up, etc. People who would run into this: Nominally any cpu faster than 4.2 GHz, with some variation between AMD and Intel Root cause: eqgame.exe, as part of initializing its timers, asks EQGraphicsDX9.dll for the CPU speed. EQGraphics cut some corners, and used entirely 32 bit math when calculating the speed over a 1 second period, and returns drastically wrong values past 4.294 GHz or so. At 4.7 GHz, it means eqgame thinks time is progressing about 11x faster than it should. The Fix: by patching about 60ish bytes of EQGraphicsDX9.dll, the math can be corrected to handle things correctly until CPUs hit 17 GHz. An IDA DIF file is attached with the particulars, applying it is a bit of an exercise for the reader right now with a hex editor or some tool that can apply those. Lengthy explanation and line by line description of what exactly the patch does to follow. Attachments:EQGraphicsDX9.dll.dif (1.54 KB)
|
|
|
Post by reapersms on Jan 29, 2023 19:57:13 GMT -5
During startup, eqgame.exe loads the EQGraphicsDX9.dll into memory, and asks it for a "timestamp ticks per millisecond" value from the exports EQG_GetCpuSpeed2 and EQG_GetCpuSpeed3. These two functions work nearly identically to each other. They poll one of the windows time functions until they see it change value, grab a baseline timestamp counter value, poll the time function again until a second has passed, grab a new timestamp value, and return (new - old) / 1000. They also have some disabled debug prints in there, but those are leftover cruft. The difference between the two is which windows time function they use -- GetCpuSpeed2 uses the windows multimedia timer (timeGetTime), which has millisecond precision, but not necessarily millisecond accuracy. GetCpuSpeed3 uses the _time32 function, which has second precision, and a somewhat variable accuracy. A rough C reconstruction of EQG_GetCpuSpeed2 looks something like this: uint32_t EQG_GetCpuSpeed2() { uint32_t time_begin, time_base, timestamp_base, timestamp_end; uint32_t rate; char Buffer[64]; time_begin = timeGetTime(); do { time_base = timeGetTime(); } while ((time_base - time_begin) <= 1); timestamp_base = __rdtsc();
while (timeGetTime() - time_base <= 1000); timestamp_end = __rdtsc();
rate = timestamp_end - timestamp_base; rate /= 1000; sprintf(Buffer, "TimeGetTime-cpuSpeed: %d\n", rate);
return rate; }
The structure is a little messy, as it's back-constructed from the raw assembly. The first loop waits for when more than 2 milliseconds have passed, so it knows it's relatively close to a transition. The second loop waits for 1 second. Where the problem comes in is that it's only saving the lower 32 bits of the timestamp counter, and the division code only deals with 32 bit values. The fix effectively changes timestamp_base and timestamp_end into 64 bit values, and splits the math up into rate = (timestamp_end - timestamp_base) / 4; rate /= 250. I was able to do this without shifting any of the other code around, moving any of the function calls, or requiring any more stack space. The assembly view of the EQG_GetCpuSpeed2 part of the patch looks like this, original on the left, patched instructions on the right, minus a couple of nops where some things got shorter: .text:10011CD6 054 83 F8 01 cmp eax, 1 .text:10011CD9 054 7E F5 jle short loc_10011CD0 ; spin until we see it tick from 1->2 .text:10011CDB 054 33 DB xor ebx, ebx .text:10011CDD 054 89 5C 24 10 mov [esp+54h+var_44], ebx rdtsc .text:10011CE1 054 0F 31 rdtsc ; grab the TSC mov [esp+54h+var_40], edx .text:10011CE3 054 89 44 24 10 mov [esp+54h+var_44], eax .text:10011CE7 .text:10011CE7 loc_10011CE7: ; CODE XREF: EQG_GetCpuSpeed2+30↓j .text:10011CE7 054 FF D7 call edi ; timeGetTime .text:10011CE9 054 2B C6 sub eax, esi .text:10011CEB 054 3D E8 03 00 00 cmp eax, 1000 ; spin for 1 second .text:10011CF0 054 7E F5 jle short loc_10011CE7 .text:10011CF2 054 89 5C 24 0C mov [esp+54h+var_48], ebx rdtsc .text:10011CF6 054 0F 31 rdtsc sub eax, [esp+54h+var_44] .text:10011CF8 054 89 44 24 0C mov [esp+54h+var_48], eax sbb edx, [esp+54h+var_40] .text:10011CFC 054 8B 4C 24 0C mov ecx, [esp+54h+var_48] shrd eax, edx, 2 .text:10011D00 054 2B 4C 24 10 sub ecx, [esp+54h+var_44] mov ecx, eax .text:10011D04 054 B8 D3 4D 62 10 mov eax, 10624DD3h ; / 1000 .text:10011D09 054 F7 E1 mul ecx .text:10011D0B 054 8B F2 mov esi, edx .text:10011D0D 054 C1 EE 06 shr esi, 6 shr esi, 4 .text:10011D10 054 56 push esi .text:10011D11 058 8D 54 24 18 lea edx, [esp+58h+Buffer] .text:10011D15 058 68 14 2F 13 10 push offset aTimegettimeCpu ; "TimeGetTime-cpuSpeed: %d\n"
RDTSC leaves the 64-bit timestamp result spread across EDX and EAX. The C compiler explicitly wrote some zeros into the stack that would get completly obliterated, which left room in the code to save the upper half of the timestamp, and I used the first dword of the debug text buffer to store it. The first change just swaps the zero store and timestamp read, and changes the zero store into an upper word store. The second patch takes advantage of the compiler wasting some time and space stuffing the later timestamp value through the stack. The RDTSC is moved up an instruction, the real 64-bit difference is calculated into EDX:EAX, subtracting from the baseline value, which is then shifted right 2 bits for a divide by 4, and the lower half put into ECX to move right into the divide by 1000 section. That is that strange multiply and shift chunk. The compiler takes advantage of the fact that it knows the divisor, and changes it from (d / 1000) into something along the lines of d * (2^32 / 1000) >> 6. The nasty details of how it comes up with those numbers can be found in chapter 8 of the Athlon Optimization GuideAs it so happens, the only difference between divide-by-1000 and divide-by-250 is the shift amount. By dividing by 4 with the 64-bit shift, the intermediate value only overflows 32 bits for the later divide when it clears 16.8 billion or so. Those are the patch entries for 110DD through 1110F, the 11151 through 1118F range is the same, but for EQG_GetCpuSpeed3. The salient differences between them are a slightly different stack offset within the function, and the _time32 one cleared the existing space with immediate 0 writes, which were much larger instructions, so there are several more NOPs dropped in.
|
|
|
Post by reapersms on Jan 29, 2023 20:25:36 GMT -5
Historical Timestamp Counter Trivia, and a hypothesis as to why this might not come up on Intel as often, and why it's not quite the same as the other historical AMD issues with it:
Back in the dark ages, when cellphones were beefy enough to kill a man, if you wanted to know how long a sequence of code took, you either had to look at some system timers, and do some math, to get an approximate number, or spend a long time manually tallying up each instruction that the compiler generated, taking into account loops, etc, etc. It was rather tedious and error prone.
Once the Pentium showed up, and started executing multiple instructions at once, with some arcane restrictions, Intel threw everyone a bone and provided the RDTSC instruction. It would return a 64-bit count of cycles since reset. There was much rejoicing, as it it was generally far more accurate and lower overhead than most of the other approaches. There were issues with multi-processor setups, where the two counters wouldn't necessarily be in sync with each other, but those issues were generally ignorable for the consumer level until much later. Another issue that could come up was that since it only tracked cycles since reset, you had to work around what happened if the OS set you aside and ran some other long, slow process between your timestamp samples.
Windows provided a somewhat abstracted interface to it, via QueryPerformanceCounter, but while it works, it has a slight bit of overhead above and beyond raw RDTSC instructions -- so naturally game developers had a tendency to bypass it.
Later, after there were enough Pentium and Pentium-compatible chips out there for developers to actually start relying on the feature, that particular quirk came to a head when dual-core processors started showing up in volume. The original AMD dual-core issue reared it's head here, as if the OS bounced the thread around the cores, it could see drastic shifts in the perceived rate of time. Both Intel and AMD had some mechanisms to let the OS smooth that over a bit (generally by letting it reset the count value whenever it changed a task around, so software could see the TSC as a process-relative time), but there were teething issues with getting Windows updated to take advantage of that.
Things were fine then for a while, until the thermal issues started popping up, and then the CPU fellows had the grand idea to start downclocking or overclocking the cores to stretch that heat budget further. Suddenly, that reliable TSC tick rate went right out the window -- hence the second round of AMD issues, and the "turn off Cool & Quiet", "Have other things running so the core is woken up at startup", etc. fixes came around. Around that time, the motherboard started providing its own high-quality timer, usually something closer to the bus frequency than the CPU frequency. Windows would shift QPC over to that on appropriate systems. It wouldn't be as precise as the timestamp, but still plenty for things like making sure your game doesn't suddenly jump to 1000 fps and turn inside out.
Somewhat more recently, AMD & Intel have provided indications as to whether the timestamp counter rate changes during execution or not. I believe what happened is Intel chose something along the lines of some small multiplier to the bus frequency, and AMD decided to just have it always track with either the normal clock or the boost clock, regardless of whether the core was running at full blast. My intel machine got hit by lightning though, so I don't have anything on-hand to test that theory, but it would somewhat explain the apparent AMD-specificity of the issue.
|
|