|
About ::
TODO ::
Blog ::
RSS ::
Old blog ::
Projects ::
GIT ::
Gallery ::
Notes
Tue, 06 Mar 2007
How expensive is cache miss? Tlb miss?
#define rdtscll(val) \
__asm__ __volatile__("rdtsc" : "=A" (val))
...
for (i=0; i< num; i++) {
rdtscll(start);
asm volatile(
"movl %1, %%eax\n"
"leal (%2, %%eax), %%ecx\n"
"movl (%%ecx), %0\n"
: "=r"(data)
: "g"(area), "g"(i)
: "eax", "ecx");
rdtscll(stop);
idx = (stop-start)/CPUFREQ;
if (idx >= MAX_RESULTS)
results[MAX_RESULTS-1]++;
else
results[idx]++;
}
Here is an overall picture with results:![]() Horizontal axis shows number of cycles needed to fetch one entry from memory. Vertical axis shows number of elements fetched with the same speed. But let's see into the end of the picture, where the most expensive fetches live (TLB miss): ![]() Above graph has nanoseconds as horizontal axis. We have one TLB miss roughly per 4k reads - exactly on 4k tlb entry boundary: ![]() Analyzing first picture we can find that most of the time fetching of 4 bytes from memory takes about 100 cycles, about 1/8 of that time 4 bytes are fetched for 86 cycles. Strange results for DDR (or DDR2 I do not remember actually) memory, doesn't it? That is because they are absolutely incorrect. Well, not absolutely, but they do not show anything interesting for L1/L2 latencies (only TLB miss test is useful), just because above 100 cycles is exactly overhead of the second rdtsc
instruction. I've added additional rdtsc and got peak at 200 cycles.
Real memory access is fully hidden behind rdtsc overhead.This problem is only related to my Core Duo machine: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 6 model name : Intel(R) Pentium(R) D CPU 3.40GHz stepping : 5 cpu MHz : 3740.058 cache size : 2048 KB physical id : 0 siblings : 2 core id : 0 cpu cores : 1 fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 3 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni monitor ds_cpl est cid cx16 xtpr lahf_lm bogomips : 7485.68Second CPU is the same. Never use that processor and rdtsc instruction -
it is implemented extremely slow.Better check AMD Athlon64 3200 CPU. Its rdtsc takes 8 cycles, which is very good result.So, the same memory access test: ![]() As you can see, there are two peaks - around 20 and 65 cycles, getting that rdtsc
is 8 cycles and reducing 4 reading instructions were get about 8 cycles for L2 cache hit and about 53 cycles
for L2 cache miss. Test performs fetch of 4 bytes aligned to 4 bytes boundary, so getting square under peaks
(theirs correlation is about 1/15) we get about 64 byte cache line (4*15) - you know, it is exactly
the cache line size on that processor.TLB cache miss picture for that CPU is essentially the same: ![]() except that TLB miss happens on 1k boundary and takes 2 times longer than on Intel (maybe it is x86_64 Linux port which can cover userspace with 1kb tlb entries, I did not check, but getting, that /proc/cpuinfo shows this: "TLB size : 1024 4K pages,
it might be true).That's all, what I wanted to say about it - actually I just did not want to do something serious, so to kill some time I ran that tests. I hope it will be somehow interesting for you. /devel/other :: Link / Comments (0) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||