So, you want to time a piece of code using a high-resolution timer on the x86 architecture. Okay, great! You can now time the execution of codes that run in sub-microsecond range! In this post, I’ll go into details on how to do things properly and what to watch out for. I’ve pulled together different resources so that we can cover the intricacies and minutia involved in using this hardware timer properly.
I’ve verified everything here on my personal system which has an AMD Ryzen CPU with clock rate of 2.3 GHz with gcc 14.1 as my compiler, and on my lab’s mini cluster that has an Intel Icelake server CPU with gcc 11.4.
Before we start, let me make the case that why we would even want to do this in the first place.
Why not use OS provided APIs?
Relying on “high-level” APIs such as gettimeofday
and clock_getres
, or
language provided APIs like std::chrono::steady_clock
and
std::high_resolution_clock
aren’t an option for me in most cases. Here’s a
list of issues I have run into when trying to use them:
- There’s no guarantee on the accuracy/resolution of either of
gettimeofday
andstd::chrono::steady_clock
12. - There’s no guarantee that the time returned by
gettimeofday
orstd::chrono::high_resolution_clock
is monotonically increasing31. - Using
clock_getres
withCLOCK_MONOTONIC_RAW
is the closest thing that I could find to a high level API that is both monotonic and HW-based. 4 Though, there’s no info on the granularity of the timer behind it. And there’s no guarantee by the POSIX API that on x86_64, it’ll map tordtsc
or similar instructions.
For the reasons mentioned above, I’d rather rely on a hand rolled timing using TSC. Other architectures have similar functionality to TSC and instructions to read the TSC, so porting this to a new arch won’t be a big issue.
Before we continue, let me be clear about something: I’m not saying that these high-level timing APIs are somehow flawed and should be avoided. Many people happily use them without any issues. People either don’t care about the implementation of these APIs or roughly know what it does under-the-hood, on the range of the HW that their library/program supports. With all this said, I’d rather be explicit in the functionality that I require from my code. Other parts of my code rely on the behavior of the timing API being as non-intrusive to the state of CPU and memory as possible. Also, since I have no control over the implementation of those timing APIs, my code won’t be immune to neither API breakages nor regressions which are caused them.
Using TSC
First, let’s discuss what TSC is. On x86_64, there is a register, called TSC (Time Step Counter), that is incremented every clock-cycle. You wouldn’t want the clock source of the counter to be tied to clock of the CPU, because the CPU’s clock rates aren’t steady and can change due dynamic frequency scaling. So, you have make sure your CPU has an “invariant” TSC. All Intel CPUs from Nehalem and onward have an invariant TSC. On AMD side, things are a bit complicated5, but generally the support is there.
In order to read the contents of the TSC register, we use the rdtsc
instruction. It
returns the value in two chunks, in EDX:EAX
registers6. We’ll just pack them into
a single uint64_t
using a shift and an or:
inline uint64_t __attribute__((always_inline))
rdtsc() {
uint32_t lo, hi;
asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
return ((uint64_t)hi << 32) | lo;
}
If you’re using GCC or LLVM clang, and your compiler and headerfiles aren’t half-broken, you can use the compiler intrinsic provided by them:
#include <x86intrin.h> // For __rdtsc
inline uint64_t __attribute__((always_inline))
rdtsc() { return __rdtsc(); }
Now, the unit of the return value of rdtsc()
is in “cycles”, which is nice to work with.
We can use it to time the latency of a block of code by calling it twice:
{
uint64_t before = rdtsc();
// Code you want to measure the latency of
uint64_t after = rdtsc();
uint64_t latency = after - before; // This also includes the latency of a single rdtsc()
}
We also need a way to convert these values to seconds. Since we don’t put our CPU into deep-sleep state during the execution of our code and the clock source of TSC is steady, the unit change can be done by a simple affine transform. I’ll discuss how this can be done on a future post.
Using rdtsc
alone may not be enough
The problem is that both CPU and the compiler can and will re-order your rdtsc
instructions around. Let’s start with the CPU.
Out-of-order Execution
Almost all modern CPUs use a trick called “out-of-order execution (OoO)”.
Out-of-order (OoO) execution essentially means that the instruction stream will
not be executed in the order that they’re defined in the executable binary. This
is done to increase ILP
(instruction level parallelism). But, the CPU will make sure that you (as the
user) will not see any difference between OoO and serial execution (kinda like
the as-if rule). So, the
observed external behavior of OoO execution and serial execution should be
the same. However, how external are you to the CPU? If you’re running
python, for example you won’t notice a thing. But if you need cycle accurate
timing (which we kinda do), then OoO execution is going to ruin your day(s). I
suspect that this problem arises from the fact that rdtsc
was introduced in
Intel Pentium, but OoO execution added later on, in Pentium Pro series. So, we
need a way to serialize the execution, or at least prevent other the execution
of other instructions after rdtsc
.
Since rdtsc
is not a serializing instruction, other instructions before or
after it may start executing; or they may even have not finished executing. This
means instructions around rdtsc
will change the behavior of the CPU and may
appear as noise in your results.
Apparently, load and stores to the main memory are among the worst kind of such
instructions, as they happen often and interact with a resource that is outside
the CPU.
To quote 6:
It (=
rdtsc
) does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed.
And Intel’s docs 7:
The rdtsc instruction is not serializing or ordered with other instructions. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the rdtsc instruction operation is performed.
To fix this, I’ve seen people use rdtsc
with an load-fence(=lfence
) before
it to enforce ordering. There’s also rdtscp
which is “more” serializing than
plain rdtsc
, but I’ll discuss it later in the post.
In order to tell the CPU to wait and finish all the memory loads that happened
before it, we can use the lfence
instruction. Apparently, according to 8,
it also makes sure all instructions before it have finished execution. I highly
recommend you give 8 a read.
So, depending on how exact you’d want to be, either put an lfenece
before
rdtsc
, or use two lfence
s one before and one after:
#include <x86intrin.h>
inline uint64_t __attribute__((always_inline))
rdtsc_fenced() {
_mm_lfence(); // wait for earlier insns to retire before reading the clock
const uint64_t tsc = __rdtsc();
// _mm_lfence(); // optional: block later insns until rdtsc retires
return tsc;
}
Or, in inline assembly:
static inline uint64_t
__attribute__((always_inline)) rdtsc() {
asm volatile("lfence");
uint32_t lo, hi;
asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
const uint64_t res = ((uint64_t)hi << 32) | lo;
// asm volatile("lfence");
return res;
}
From what I understand, if the 2nd lfence
is removed, the instructions after
rdtsc
can start executing before rdtsc
itself has finished. This is because
the re-ordering of instructions that happen at CPU level. (If you want to know
more, search the terms such as re-order buffers (ROB), register renaming, and
out of order execution.) Depending on what you’re doing, this may or may not be
ideal.
Instruction Reordering Done by Compiler
The compiler itself will also reorder some of the instructions within
rdtsc_fenced()
. As an example, take a look at the generated assembly of
unroll_me
function on godbolt9. For the sake
of brevity, let’s just look at the last iteration of the loop. We can see the
F(fdata)
call being sandwiched between lfence
s:
rdtsc ; on line 146
lfence
mov edi, DWORD PTR fdata[rip] ; load value of fdata into edi
sal rdx, 32 ; rdtsc() logic: shift high 32 bits of rdx
or rax, rdx ; rdtsc() logic: set the high 32 bits of rax to rdx's
mov r12, rax ; back up rax set by previous rdtsc to r12
call F(int)
lfence
rdtsc
; Rest of the code: subtract new TSC from old value in r12 and
; store it into memory.
As you can see, there are three instructions (sal
, or
, and mov
) that
should have happened before the first lfence
according to our C code. But, the
compiler has reordered the instructions and put them after the lfence
. As I
said, this may or may not be proper behavior for your usecase. Read this10 if
you care about this. In my usecase, a couple instructions here and there don’t
make a difference. So I’ve decided not to care about this. Also, adding a
store-fence (=sfence
) after the store to tscs
array didn’t help. The only
solution that I can think of that avoids compiler instruction reordering
is to by pass it completely: handwritten assembly.
Tying Up Loose Ends
There’s the option of using rdtscp
. I opted to use rdtsc
with lfence
s
instead, since using the two together provides better ordering compared to
rdtscp
11. So, I’ll skip rdtscp
.
Unfortunately, I haven’t found any straightforward way of telling the compiler
not to re-order those instructions, other than writing the whole
rdtsc_fenced()
function in inline assembly. I had a discussion about this
with my supervisor , and reached these two conclusions:
- The noise that the reordering of instructions produces is negligible. For my usecase, absolute cycle accuracy isn’t a goal. We care about accuracy from the hundreds digit onwards.
- The compiler itself knows that the target CPU does out-of-order execution, so apparently having the instructions swapped doesn’t really matter. The reasoning is that the CPU will see the instructions all at same time, so it doesn’t matter really.
In 12, Intel mentions that a cpuid
instruction will completely serialize the
whole CPU. They use it with rdtsc
for their timings. But, I’m not goning to use
it because cpuid
costs a lot of cycles and halts the entire CPU. Plus, the
document is from 2010. On Intel 12th Gen (Alder Lake) and newer, there’s a new
instruction introduced called serialize
, that serializes instruction
execution.
In section 18.17 (19.17 as of June 2025) of Intel’s Architectures Software
Developer Manual Volume 3, they suggest to use rdtsc
with
lfence
, just like we did in rdtsc_fenced()
.
There’s also rdpmc
, but I got a segfault every time I tried to use it. So, I
gave up on it. John D McCalpin from TACC uses it for measuing latencies at
cache level13.
-
gettimeofday manpage: https://man7.org/linux/man-pages/man2/settimeofday.2.html archive ↩︎ ↩︎
-
cppreference docs on std::chrono::steady_clock: https://en.cppreference.com/w/cpp/chrono/steady_clock.html archive ↩︎
-
cppreference note on std::chrono::high_resolution_clock: https://en.cppreference.com/w/cpp/chrono/high_resolution_clock.html#Notes archive ↩︎
-
clock_getresdocs manpage: https://man7.org/linux/man-pages/man2/clock_gettime.2.html archive ↩︎
-
TSC on AMD Ryzen CPUs is weird: https://www.agner.org/optimize/blog/read.php?i=838 archive ↩︎
-
Section 4.3 of 7, volume 2, section 4.3: rdtsc archive ↩︎ ↩︎
-
Intel Architectures Software Developer - March 2024 ↩︎ ↩︎ ↩︎ ↩︎
-
Section 4.3 of 7, volume 2, section 4.3: lfence archive ↩︎ ↩︎
-
Stackoverflow: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time? archive ↩︎
-
TACC: Comments on timing short code sections on Intel processors: https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ archive ↩︎
-
How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures - September, 2010 (324264-001) pdf ↩︎
-
Very low-overhead timer/counter interfaces for C on Intel 64 processors. https://github.com/jdmccalpin/low-overhead-timers ↩︎