To measure the impact of cache-misses in a program, I want to latency caused by cache-misses to the cycles used for actual computation.
I use perf stat
to measure the cycles, L1-loads, L1-misses, LLC-loads and LLC-misses in my program. Here is a example output:
467?769,70 msec task-clock # 1,000 CPUs utilized
1?234?063?672?432 cycles # 2,638 GHz (62,50%)
572?761?379?098 instructions # 0,46 insn per cycle (75,00%)
129?143?035?219 branches # 276,083 M/sec (75,00%)
6?457?141?079 branch-misses # 5,00% of all branches (75,00%)
195?360?583?052 L1-dcache-loads # 417,643 M/sec (75,00%)
33?224?066?301 L1-dcache-load-misses # 17,01% of all L1-dcache hits (75,00%)
20?620?655?322 LLC-loads # 44,083 M/sec (50,00%)
6?030?530?728 LLC-load-misses # 29,25% of all LL-cache hits (50,00%)
Then my question is:
How to convert the number of cache-misses into a number of "lost" clock cycles?
Or alternatively, what is the proportion of time spent for fetching data?
I think the factor should be known by the constructor. My processor is Intel Core i7-10810U, and I couldn't find this information in the specifications nor in this list of benchmarked CPUs.
This related problem describes how to measure the number of cycles lost in a cache-miss, but is there a way to obtain this as hardware information? Ideally, the output would be something like:
L1-hit: 3 cycles
L2-hit: 10 cycles
LLC-hit: 30 cycles
RAM: 300 cycles
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…