Zbr's days.

About :: TODO :: Blog :: RSS :: Old blog :: Projects :: GIT :: Gallery :: Notes

Tue, 06 Mar 2007

How expensive is cache miss? Tlb miss?


I've ran some trivial tests on my Intel Core Duo desktop to solve that questions for me.
Tests basically allocate a region in memory and perform reading from different addreses in a loop.
I did them in userspace to determine TLB miss price (each tlb entry in userspace on x86 covers 4k).
Here is a code snippet:

#define rdtscll(val) \
	__asm__ __volatile__("rdtsc" : "=A" (val))

	...
	
	for (i=0; i< num; i++) {
		rdtscll(start);
		asm volatile(	
			"movl %1, %%eax\n"
			"leal (%2, %%eax), %%ecx\n"
			"movl (%%ecx), %0\n"
			: "=r"(data)
			: "g"(area), "g"(i)
			: "eax", "ecx");
		rdtscll(stop);
		idx = (stop-start)/CPUFREQ;
		if (idx >= MAX_RESULTS)
			results[MAX_RESULTS-1]++;
		else
			results[idx]++;
	}
Here is an overall picture with results:

memory access speed

Horizontal axis shows number of cycles needed to fetch one entry from memory.
Vertical axis shows number of elements fetched with the same speed.

But let's see into the end of the picture, where the most expensive fetches live (TLB miss):

TLB miss

Above graph has nanoseconds as horizontal axis.
We have one TLB miss roughly per 4k reads - exactly on 4k tlb entry boundary:

TLB miss details

Analyzing first picture we can find that most of the time fetching of 4 bytes from memory takes about 100 cycles, about 1/8 of that time 4 bytes are fetched for 86 cycles.

Strange results for DDR (or DDR2 I do not remember actually) memory, doesn't it?
That is because they are absolutely incorrect.
Well, not absolutely, but they do not show anything interesting for L1/L2 latencies (only TLB miss test is useful), just because above 100 cycles is exactly overhead of the second rdtsc instruction. I've added additional rdtsc and got peak at 200 cycles. Real memory access is fully hidden behind rdtsc overhead.
This problem is only related to my Core Duo machine:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 6
model name      : Intel(R) Pentium(R) D CPU 3.40GHz
stepping        : 5
cpu MHz         : 3740.058
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 3
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc pni 
monitor ds_cpl est cid cx16 xtpr lahf_lm
bogomips        : 7485.68
Second CPU is the same. Never use that processor and rdtsc instruction - it is implemented extremely slow.

Better check AMD Athlon64 3200 CPU.
Its rdtsc takes 8 cycles, which is very good result.
So, the same memory access test:

memory access speed on amd

As you can see, there are two peaks - around 20 and 65 cycles, getting that rdtsc is 8 cycles and reducing 4 reading instructions were get about 8 cycles for L2 cache hit and about 53 cycles for L2 cache miss. Test performs fetch of 4 bytes aligned to 4 bytes boundary, so getting square under peaks (theirs correlation is about 1/15) we get about 64 byte cache line (4*15) - you know, it is exactly the cache line size on that processor.

TLB cache miss picture for that CPU is essentially the same:

TLB miss details for AMD

except that TLB miss happens on 1k boundary and takes 2 times longer than on Intel (maybe it is x86_64 Linux port which can cover userspace with 1kb tlb entries, I did not check, but getting, that /proc/cpuinfo shows this: "TLB size : 1024 4K pages, it might be true).

That's all, what I wanted to say about it - actually I just did not want to do something serious, so to kill some time I ran that tests.
I hope it will be somehow interesting for you.

/devel/other :: Link / Comments ()