However, we can wring several more interesting lessons from this tiny example. I have marked these sections below as “Optional-1” through “Optional-4”.
Why is our /O2 version (including a printf to defeat DCE), so much faster than the /Od version? Here is the code generated for the loop of the /O2 version, extracted from the Sum.asm file:
xor edx, edx
mov eax, 1
mov ecx, edx
mov r8d, edx
mov r9d, edx
npad 13
$LL3@main:
inc r9
add r8, 2
add rcx, 3
add r9, rax ;; r9 = 2 8 18 32 50 …
add r8, rax ;; r8 = 3 10 21 36 55 …
add rcx, rax ;; rcx = 4 12 24 40 60 …
add rdx, rax ;; rdx = 1 6 15 28 45 …
add rax, 4 ;; rax = 1 5 9 13 17 …
cmp rax, 1000000000 ;; i jle SHORT $LL3@main ;; yes, so loop back
Note that the loop body contains approximately the same number of instructions as the non-optimized build, and yet it runs much faster. That’s mostly because the instructions in this optimized loop use registers, rather than memory locations. As we all know, accessing a register is much faster than accessing memory. Here are approximate latencies that demonstrate how memory access can slow your program to a snail’s pace:
So, the non-optimized version is reading and writing to stack locations, which will quickly migrate into L2 (10 cycle access time) and L1 cache (4 cycle access time). Both are slower than if all the calculation is done in registers, with access times around a single cycle.
But there’s more going on here to make the code run faster. Notice how the /Od version increments the loop counter by 1 each time around the loop. But the /O2 version increments the loop counter (held in register RAX) by 4 each time around the loop. Eh?
The optimizer has unrolled the loop. So it adds four items together on each iteration, like this:
s = (1 + 2 + 3 + 4) + (5 + 6 + 7 + 8) + (9 + 10 + 11 + 12) + (13 + . . .
By unrolling this loop, the test for loop-end is made every four iterations, rather than on every iteration – so the CPU spends more time doing useful work than forever checking whether it can stop yet!
Also, rather than accumulate the results into a single location, it has decided to use 4 separate registers, to accumulate 4 partial sums, like this:
RDX = 1 + 5 + 9 + 13 + … = 1, 6, 15, 28 …
R9 = 2 + 6 + 10 + 14 + … = 2, 8, 18, 32 …
R8 = 3 + 7 + 11 + 15 + … = 3, 10, 21, 36 …
RCX = 4 + 8 + 12 + 16 + … = 4, 12, 24, 40 …
At the end of the loop, it adds the partial sums, in these four registers, together, to get the final answer.
(I will leave it as an exercise for the reader how the optimizer copes with a loop whose trip count is not a multiple of 4)
Earlier, I said that the /O2 version of the program, without a printf inhibitor, “runs so fast there’s no perceptible delay”. Here is a program that makes that statement more precise:
#include
#include
int main() LARGE_INTEGER start, stop;
QueryPerformanceCounter(&start);
long long s = 0;
for (long long i = 1; i QueryPerformanceCounter(&stop);
double diff = stop.QuadPart – start.QuadPart;
printf(“%f”, diff);
>
It uses QueryPerformanceCounter to measure the elapsed time. (This is a “sawn-off” version of a high-resolution timer described in a previous blog). There are lots of caveats to bear in-mind when measuring performance (you might want to browse a list I wrote previously ), but they don’t matter for this particular case, as we shall see in a moment:
On my PC, the /Od version of this program prints a value for diff of about 7 million somethings. (The units of the answer don’t matter – just know that the number gets bigger as the program takes longer to run). The /O2 version prints a value for diff of 0. And the cause, as explained above, is our friend DCE.
When we add a printf to prevent DCE, the /Od version runs in about 1 million somethings – a speedup of about 7X.
If we look carefully again at the assembler listings in this post, we find a few surprises in the part of the program that initializes our registers:
xor edx, edx ;; rdx = 0 (64-bit!)
mov eax, 1 ;; rax = i = 1 (64-bit!)
mov ecx, edx ;; rcx = 0 (64-bit!)
mov r8d, edx ;; r8 = 0 (64-bit!)
mov r9d, edx ;; r9 = 0 (64-bit!)
npad 13 &nbs p; ;; multi-byte nop alignment padding
$LL3@main:
Recall first that the original C++ program uses long long variables for both the loop counter and the sum. In the VC++ compiler, these map onto 64-bit integers. So we should expect the generated code to make use of x64’s 64-bit registers.
We already explained, in a previous post, that xor reg, reg is a compact way to zero out the contents of reg . But our first instruction is apply x