Timing Using RDTSC on x86_64

So, you want to time a piece of code using a high-resolution timer on the x86 architecture. Okay, great! You can now time the execution of codes that run in sub-microsecond range! In this post, I’ll go into details on how to do things properly and what to watch out for. I’ve pulled together different resources so that we can cover the intricacies and minutia involved in using this hardware timer properly.

I’ve verified everything here on my personal system which has an AMD Ryzen CPU with clock rate of 2.3 GHz with gcc 14.1 as my compiler, and on my lab’s mini cluster that has an Intel Icelake server CPU with gcc 11.4.

Before we start, let me make the case that why we would even want to do this in the first place.

Why not use OS provided APIs?

Relying on “high-level” APIs such as gettimeofday and clock_getres, or language provided APIs like std::chrono::steady_clock and std::high_resolution_clock aren’t an option for me in most cases. Here’s a list of issues I have run into when trying to use them:

There’s no guarantee on the accuracy/resolution of either of gettimeofday and std::chrono::steady_clock¹².
There’s no guarantee that the time returned by gettimeofday or std::chrono::high_resolution_clock is monotonically increasing³¹.
Using clock_getres with CLOCK_MONOTONIC_RAW is the closest thing that I could find to a high level API that is both monotonic and HW-based. ⁴ Though, there’s no info on the granularity of the timer behind it. And there’s no guarantee by the POSIX API that on x86_64, it’ll map to rdtsc or similar instructions.

For the reasons mentioned above, I’d rather rely on a hand rolled timing using TSC. Other architectures have similar functionality to TSC and instructions to read the TSC, so porting this to a new arch won’t be a big issue.

Before we continue, let me be clear about something: I’m not saying that these high-level timing APIs are somehow flawed and should be avoided. Many people happily use them without any issues. People either don’t care about the implementation of these APIs or roughly know what it does under-the-hood, on the range of the HW that their library/program supports. With all this said, I’d rather be explicit in the functionality that I require from my code. Other parts of my code rely on the behavior of the timing API being as non-intrusive to the state of CPU and memory as possible. Also, since I have no control over the implementation of those timing APIs, my code won’t be immune to neither API breakages nor regressions which are caused them.

Using TSC

First, let’s discuss what TSC is. On x86_64, there is a register, called TSC (Time Step Counter), that is incremented every clock-cycle. You wouldn’t want the clock source of the counter to be tied to clock of the CPU, because the CPU’s clock rates aren’t steady and can change due dynamic frequency scaling. So, you have make sure your CPU has an “invariant” TSC. All Intel CPUs from Nehalem and onward have an invariant TSC. On AMD side, things are a bit complicated⁵, but generally the support is there.

In order to read the contents of the TSC register, we use the rdtsc instruction. It returns the value in two chunks, in EDX:EAX registers⁶. We’ll just pack them into a single uint64_t using a shift and an or:

inline uint64_t __attribute__((always_inline)) 
rdtsc() {
    uint32_t lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

If you’re using GCC or LLVM clang, and your compiler and headerfiles aren’t half-broken, you can use the compiler intrinsic provided by them:

#include <x86intrin.h> // For __rdtsc

inline uint64_t __attribute__((always_inline)) 
rdtsc() { return __rdtsc(); }

Now, the unit of the return value of rdtsc() is in “cycles”, which is nice to work with. We can use it to time the latency of a block of code by calling it twice:

{
    uint64_t before = rdtsc();
    // Code you want to measure the latency of
    uint64_t after = rdtsc();
    uint64_t latency = after - before; // This also includes the latency of a single rdtsc()
}

We also need a way to convert these values to seconds. Since we don’t put our CPU into deep-sleep state during the execution of our code and the clock source of TSC is steady, the unit change can be done by a simple affine transform. I’ll discuss how this can be done on a future post.

Using `rdtsc` alone may not be enough

The problem is that both CPU and the compiler can and will re-order your rdtsc instructions around. Let’s start with the CPU.

Out-of-order Execution

Almost all modern CPUs use a trick called “out-of-order execution (OoO)”. Out-of-order (OoO) execution essentially means that the instruction stream will not be executed in the order that they’re defined in the executable binary. This is done to increase ILP (instruction level parallelism). But, the CPU will make sure that you (as the user) will not see any difference between OoO and serial execution (kinda like the as-if rule). So, the observed external behavior of OoO execution and serial execution should be the same. However, how external are you to the CPU? If you’re running python, for example you won’t notice a thing. But if you need cycle accurate timing (which we kinda do), then OoO execution is going to ruin your day(s). I suspect that this problem arises from the fact that rdtsc was introduced in Intel Pentium, but OoO execution added later on, in Pentium Pro series. So, we need a way to serialize the execution, or at least prevent other the execution of other instructions after rdtsc.

Since rdtsc is not a serializing instruction, other instructions before or after it may start executing; or they may even have not finished executing. This means instructions around rdtsc will change the behavior of the CPU and may appear as noise in your results.
Apparently, load and stores to the main memory are among the worst kind of such instructions, as they happen often and interact with a resource that is outside the CPU. To quote ⁶:

It (=rdtsc) does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the read operation is performed.

And Intel’s docs ⁷:

The rdtsc instruction is not serializing or ordered with other instructions. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the rdtsc instruction operation is performed.

To fix this, I’ve seen people use rdtsc with an load-fence(=lfence) before it to enforce ordering. There’s also rdtscp which is “more” serializing than plain rdtsc, but I’ll discuss it later in the post.

In order to tell the CPU to wait and finish all the memory loads that happened before it, we can use the lfence instruction. Apparently, according to ⁸, it also makes sure all instructions before it have finished execution. I highly recommend you give ⁸ a read.

So, depending on how exact you’d want to be, either put an lfenece before rdtsc, or use two lfences one before and one after:

#include <x86intrin.h>

inline uint64_t __attribute__((always_inline)) 
rdtsc_fenced() {
    _mm_lfence();  // wait for earlier insns to retire before reading the clock
    const uint64_t tsc = __rdtsc();
    // _mm_lfence();  // optional: block later insns until rdtsc retires
    return tsc;
}

Or, in inline assembly:

static inline uint64_t
__attribute__((always_inline)) rdtsc() {
    asm volatile("lfence");
    uint32_t lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    const uint64_t res = ((uint64_t)hi << 32) | lo;
    // asm volatile("lfence");
    return res;
}

From what I understand, if the 2nd lfence is removed, the instructions after rdtsc can start executing before rdtsc itself has finished. This is because the re-ordering of instructions that happen at CPU level. (If you want to know more, search the terms such as re-order buffers (ROB), register renaming, and out of order execution.) Depending on what you’re doing, this may or may not be ideal.

Instruction Reordering Done by Compiler

The compiler itself will also reorder some of the instructions within rdtsc_fenced(). As an example, take a look at the generated assembly of unroll_me function on godbolt⁹. For the sake of brevity, let’s just look at the last iteration of the loop. We can see the F(fdata) call being sandwiched between lfences:

        rdtsc   ; on line 146
        lfence  
        mov     edi, DWORD PTR fdata[rip]  ; load value of fdata into edi
        sal     rdx, 32        ; rdtsc() logic: shift high 32 bits of rdx
        or      rax, rdx       ; rdtsc() logic: set the high 32 bits of rax to rdx's
        mov     r12, rax       ; back up rax set by previous rdtsc to r12
        call    F(int)
        lfence
        rdtsc
        ; Rest of the code: subtract new TSC from old value in r12 and
        ; store it into memory.

As you can see, there are three instructions (sal, or, and mov) that should have happened before the first lfence according to our C code. But, the compiler has reordered the instructions and put them after the lfence. As I said, this may or may not be proper behavior for your usecase. Read this¹⁰ if you care about this. In my usecase, a couple instructions here and there don’t make a difference. So I’ve decided not to care about this. Also, adding a store-fence (=sfence) after the store to tscs array didn’t help. The only solution that I can think of that avoids compiler instruction reordering is to by pass it completely: handwritten assembly.

Tying Up Loose Ends

There’s the option of using rdtscp. I opted to use rdtsc with lfences instead, since using the two together provides better ordering compared to rdtscp ¹¹. So, I’ll skip rdtscp.

Unfortunately, I haven’t found any straightforward way of telling the compiler not to re-order those instructions, other than writing the whole rdtsc_fenced() function in inline assembly. I had a discussion about this with my supervisor , and reached these two conclusions:

The noise that the reordering of instructions produces is negligible. For my usecase, absolute cycle accuracy isn’t a goal. We care about accuracy from the hundreds digit onwards.
The compiler itself knows that the target CPU does out-of-order execution, so apparently having the instructions swapped doesn’t really matter. The reasoning is that the CPU will see the instructions all at same time, so it doesn’t matter really.

In ¹², Intel mentions that a cpuid instruction will completely serialize the whole CPU. They use it with rdtsc for their timings. But, I’m not goning to use it because cpuid costs a lot of cycles and halts the entire CPU. Plus, the document is from 2010. On Intel 12th Gen (Alder Lake) and newer, there’s a new instruction introduced called serialize, that serializes instruction execution.

In section 18.17 (19.17 as of June 2025) of Intel’s Architectures Software Developer Manual Volume 3, they suggest to use rdtsc with lfence, just like we did in rdtsc_fenced().

There’s also rdpmc, but I got a segfault every time I tried to use it. So, I gave up on it. John D McCalpin from TACC uses it for measuing latencies at cache level¹³.

gettimeofday manpage: https://man7.org/linux/man-pages/man2/settimeofday.2.html archive ↩︎ ↩︎
cppreference docs on std::chrono::steady_clock: https://en.cppreference.com/w/cpp/chrono/steady_clock.html archive ↩︎
cppreference note on std::chrono::high_resolution_clock: https://en.cppreference.com/w/cpp/chrono/high_resolution_clock.html#Notes archive ↩︎
clock_getresdocs manpage: https://man7.org/linux/man-pages/man2/clock_gettime.2.html archive ↩︎
TSC on AMD Ryzen CPUs is weird: https://www.agner.org/optimize/blog/read.php?i=838 archive ↩︎
Section 4.3 of ⁷, volume 2, section 4.3: rdtsc archive ↩︎ ↩︎
Intel Architectures Software Developer - March 2024 ↩︎ ↩︎ ↩︎ ↩︎
Section 4.3 of ⁷, volume 2, section 4.3: lfence archive ↩︎ ↩︎
https://godbolt.org/z/vzMoTTEhc ↩︎
Stackoverflow: Is there any difference in between (rdtsc + lfence + rdtsc) and (rdtsc + rdtscp) in measuring execution time? archive ↩︎
TACC: Comments on timing short code sections on Intel processors: https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/ archive ↩︎
How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures - September, 2010 (324264-001) pdf ↩︎
Very low-overhead timer/counter interfaces for C on Intel 64 processors. https://github.com/jdmccalpin/low-overhead-timers ↩︎

Why not use OS provided APIs?¶

Using TSC¶

Using rdtsc alone may not be enough¶

Out-of-order Execution¶

Instruction Reordering Done by Compiler¶

Tying Up Loose Ends¶