Advanced Skylake Deep Dive

Matt Godbolt

Creator of Compiler Explorer

Matt Godbolt goes deep into the actual implementation details of modern Intel microarchitecture. We’ll explore how the execution engine really works - register files, schedulers, and memory disambiguation - and how branch predictors track history and correlate patterns. Through microbenchmarks and performance counter analysis, we’ll demonstrate surprising behaviours and show how to investigate them when they appear in your code.

Transcript

Jasper Chapman Black (00:00:00)

Hi everyone. Welcome to this Tech Talk. Really glad everyone could make it. I’m here to introduce Matt Godbolt. Matt Godbolt is a C++ developer who loves looking under the hood of compilers, operating systems and silicon. By day, he writes software for finance like many of us here, by night, he emulates vintage computers and maintains Compiler Explorer. Well, Compiler Explorer is the official name, but everybody knows it as Godbolt. So naturally, when someone set up the Jane Street internal instance, they named it Godbolt making Matt the first external speaker who already has an internal domain name named after him. Please join me in welcoming Matt Godbolt.

Matt Godbolt (00:00:45)

Thank you, Jasper. Well, thank you for having me. This is fantastic. I’m looking forward to talking to you about the kind of things that I love digging into. So this is Microarchitecture: What Happens Beneath, which is not my best title. So yeah, as Jasper says, Compiler Explorer is why you’ve probably heard of me before. I love making emulators for old computers. That’s what I’ve been doing for the last year as I’ve been on a non-compete. My non-compete ended today, which is why I’m able to be here talking to you. But next week I’ll be working at HRT, so we’ll be neighbors and I’ll be up the street, friendly competition.

(00:01:14)

I also love trying to reverse engineer what the heck is going on inside your CPU, right? Intel famously doesn’t give a lot of information about what’s going on inside the computer, and that’s for obvious reasons to do with, you know, intellectual property, whatever. But there’s a sort of small but dedicated community of folks who are trying to work out what actually is going on inside. This is my contribution to it, where I was trying to work out how the branch target buffer worked inside some of the older chips. We’re gonna be talking about some old chips today, unfortunately, because I don’t have access to some very, very new stuff because I’ve been on non-compete for a year and it’s very expensive to rent them. But anyway, I’ve done some research before and my sort of claim to fame on this is that I was cited in the Meltdown and Spectre paper, which was a kind of nice moment of recognition for this kind of stuff.

(00:02:08)

You’ve probably seen a CPU pipeline before in textbooks and I know I’ve been speaking to some folks who are now sitting very threateningly in the front row and they have a very deep understanding of everything we’re gonna cover here. But for the rest of y’all, hopefully you’ll forgive me doing a quick sort of introduction reminder of what a CPU pipeline looks like. So when I grew up, this is what a CPU pipeline looked like. You fetched an instruction, you decoded it, you executed it, and you wrote it back in sequence. And then you know, you could do this as a sort of production line, as a pipeline one step at a time. So one instruction was being fetched while the previous one was being decoded and the one before that’s already being executed and the one after that, before that is being written back. And therefore, although it takes four cycles to get through here, we’re still doing one useful piece of work every clock cycle. And that was a big win.

(00:02:50)

Modern CPUs though, realize that if we can do better than just this sort of single step at a time, we can break the system into sort of three broad categories. We’ve got a front end where we’ve got a branch prediction unit. I wanted to talk to you a lot about branch prediction. I actually did a whole bunch of reverse engineering work for it, but I don’t have time so we’ll have to talk about it afterwards or in the pub or something like that.

(00:03:33)

But anyway, there’s a fetch and a decoder as before, we have a rename step, which is sort of the gateway into a new world. And then after that, this middle section here, this is the backend things can happen out of order, which sounds counterintuitive, but in programs very often we write code which has lots of little parts that aren’t totally dependent on each other. You know, you read a value, add 10 to it, you store it somewhere else, you didn’t know the value, you add 20 to it, you store it somewhere else. It’s like, well those two things could happen at the same time. And so if you can pick up enough instructions in one go, you can actually do more than one piece of work per clock cycle. And that’s what this sort of stack is here in the Execute. And that’s what the out of order system is doing, it’s doing some clever dependency tracking and then everything happens in whenever execution units are available and results, inputs are ready. And then finally we come to the retirement stage and that’s where everything gets put back in the order that the program was written in so that you don’t notice the monkey business that’s been going on in the middle bit.

(00:04:17)

Okay. But we’re gonna be talking specifically about Skylake, which is old, 2019, 2020-ish era is when it was out. But again, my laptop is a Skylake, my desktop machine is a Comet Lake or Skylake X. Actually, yeah, the laptop’s Comet Lake. So these are representative of the kinds of things that actually happen in a modern CPU, but they have changed in the kind of server quality CPUs that you’ll be using in your day-to-day job. Where I remember them, I will try and call out the differences, but again, I couldn’t test the things that were on more modern machines. So out of order on Skylake is a little bit more complicated than that nice textbook diagram. We have a front end that looks like this. Again, the branch prediction unit we’re not gonna be talking about. We have sort of two parallel lines, the fetch, pre-decode, instruction queue, and then many decoders. Beneath that, we’ve got a micro op cache, to the right of that, we’ve got a queue of micro operations and the amusingly named LSD. Then we hit the renamer. We’re gonna talk about all of these things in a second. I just wanna just give you a big picture of where we’re going with all of this.

(00:05:16)

What was I gonna say about this? There was something I meant to say, it’ll come to me. We’ll cover it anyway. So anyway, this is the front end and this point here by the time, oh that’s right, micro ops, I’ve written micro op here. So instructions are, specifically on Intel, broken into smaller operations, micro operations. So unlike most RISC processors where you’re just adding two numbers together or you’re reading or you’re doing a branch, some of the Intel instructions are crazy and you can like do some crazy address calculation, which is one number plus another one times four plus some offset. And that’s the address that you’re gonna read from. Then you read from that address, then you’re gonna add to it and then you’re gonna store back to it. And that’s one instruction. So it makes sense for the poor front end here, which deals in terms of those crazy x86 instructions, to break it down into a sensible RISC architecture in the back end. But we never get to see that. We just have to infer its presence from all the measurements we do on the outside. So that’s what a micro op is.

(00:06:05)

Yeah, so by the time we get to this point here, we have a stream of micro operations that are in program order. They get to the renamer and then they reach the backend, which is a little bit smaller and simpler. And we won’t be covering as much of this if I’ve got my timings right. We’ve probably only got five minutes of this at the end. But we’ve got a scheduler, we’ve got multiple execution units, not the four that are here. There are actually many more than that. And there’s a writeback stage, then we have the retirement and in post-retirement we have this store buffer commit phase. And again, each of these steps is probably a simplification, but this is mostly built out of measurements that we can make. And by we, I don’t mean me actually, although I have contributed to this in the past, the community of people that reverse engineer all this kind of stuff, I’m relying very much on their results that I’m presenting here. These things are not necessarily found in the Intel manuals that you have to go and dig them out yourself. So I have some references at the end if you’re interested in this and how you can measure it yourself, knock yourself out. It’s a really, really interesting rabbit hole to go down.

(00:06:44)

This selection of TLAs, this soup of letters over here are all the sort of shared data structures that each of the bits of the system can see. We’ll talk about those separately, but I wanted to call them out to begin with. We’ve got physical register storage. So while you tend to write programs or see programs in assembly that refer to, you know, EAX and ECX and EDX, whatever the 16 registers that we have access to, 16, what a luxurious number. I know we were talking about earlier that we probably could do with even more, but it wasn’t that long ago we only had eight of them, of which two of them were used for other purposes. But that is small fry compared to the number of register slots that are actually available on the CPU. And so behind the scenes, it’s doing all sorts of allocation and renaming and mapping to take advantage of the extra space it has on the chip. So the physical register file contains all of the actual values of the registers and there are probably hundreds of those.

(00:07:40)

The RAT is the register alias table and that’s the structure that maps the current EAX to where on earth it is currently on the chip and other things. There’s a reorder buffer which stores essentially a ledger of all of the micro operations that have flowed through the system and is essentially the sort of reference keeping of like this is the point at which we know an instruction has been issued and later on we will retire it in the order that they came through into this reorder buffer.

(00:08:32)

Then there’s a reservation station, which is also called the scheduler in some literature, that’s where we actually store the things that have yet to be processed, like the actual operations themselves. So like I need to do an add of this number and that other number or this number and a number that we’ll get when the read that it depends upon has completed. They live in the reservation station. Branch order buffer is a sort of checkpointing system for branch misprediction. Again, I wanted to talk more about this, but sadly we don’t have time. But it’s fascinating if you’re sitting down and think about what the thing must be doing when it’s predicting, you know, 20 branches in the future and then it gets one of them wrong and it doesn’t wanna throw all of the work away. And then there’s a memory order buffer, which is responsible for making sure that loads and stores to memory still make sense even though we’ve reordered everything.

(00:09:25)

So we’re gonna start with the front end. This is gonna be the majority of the talk and I forgot to start my timer, so I’m gonna have to rely on looking at the clock behind you to see where I am. I’m gonna use this example and I’ve sort of, it’s thematic that I have to do it in C++, although we did attempt to write this in OCaml earlier, it did not generate the same code. I will leave that to your imagination. So it’s a bit of a made up thing. It’s taking an array of integers, it’s squaring them and summing them up. So it’s just doing like a rolling sum of all of the integers, sum of squares. And it suits my purpose because it compiles to six instructions and exactly 16 bytes of machine code.

(00:10:11)

So I’m sure you’re reasonably familiar with this, but on the right hand side we have this text representation, which is the assembly code, and on the left hand side is the bytes that the CPU actually reads out. That’s the machine code. It used to catch me out all the time. I used to use the two interchangeably, which was foolish and you know, you don’t really need to know, but it’s like we read the array pointer, we square it, we add it to our running total, we move the array pointer to the next element, we compare it to the end one and then we jump back if it isn’t at the end yet. Pretty straightforward stuff. 16 bytes worth.

(00:11:02)

What you’ll notice though immediately is that each of these is a different length. We’ve got a two byte instruction, a three byte one. There’s even a four byte one here. In general, x86 instructions can be anything from one byte to 15 bytes long. And it’s not sensibly written because it’s kind of the sort of thing that happens when you take a design from the 1970s and incrementally add things to it while we’re maintaining backwards compatibility, which means that sort of everything harks back to these ancient ways and there’s like bytes that say, hey, the next byte is now interpreted differently unless it’s Tuesday and the moon is rising, in which case… So it’s really quite complicated to disassemble x86. Anyone who uses ARM or MIPS or RISC-V or whatever, it’s lovely, all the instructions are the same length. So it’s very simple, but that’s kind of why we need to do this complicated decode.

(00:11:51)

Another thing I’m gonna quickly show you is an instruction trace of a sort. This comes through a somewhat heavily modified tool called uiCA, which is the microcode analyzer that Andreas Abel, who is now at Google, has written and open sourced and this sort of lets us see an individual instruction’s journey through the pipeline. So what we’re seeing is one instruction and then all the stages it goes through from left to right, time based. So this instruction takes 16 cycles on the first cycle here it’s being pre decoded, it’s hitting the instruction queue here. It’s being issued, dispatched, executed, and then retired, in this particular example, I just wanna show an example, we’ll have a few more of these up and I’ll explain them a bit more as we hit them, but I don’t want it to be surprising when it popped up.

(00:12:36)

All right, first stage, fetching, this is probably the easiest stage, although I’m gonna gloss over a whole bunch of really complicated stuff that it does. So we have the predicted instruction pointer. So the fetch unit doesn’t wait for the program to definitely go somewhere. It’s being told where to go by the branch predictor at all times. And so that’s kind of a really interesting aspect that I had never even considered because you might pick up a bunch of 16 bytes and until you’ve decoded it, which is like four or five steps down, you don’t even know if there’s a branch in there. And by that time you’ve already had two more 16 byte chunks of memory that you’ve pulled up and you’re like, well I’ve went the wrong way, right? So, and that would be an unconditional branch, right? If it’s conditional you have to actually even execute to work out whether or not you’re gonna take it or not. So the branch register has got a great job and it does by and large a good job.

(00:13:14)

So the fetch unit picks up 16 bytes at a time at a 16 byte boundary, which means that if you branch to the middle of a 16 byte boundary, then you’re immediately wasting a bit of bandwidth there because you’re missing out on some of the bytes that you could have picked up in one cycle. The other thing to note here is that I’m not going to talk about caches at all because that’s another two hours talk, but this obviously has to liaise with the TLB to go and work out where the heck these instructions actually come from and the L1, L2, L3 caches to actually fetch the bytes into 64 byte cache line to then pull 16 bytes of it down. So there’s a lot of work hiding here, but the result of this fetch stage is some kind of pipeline of chunks of 16 bytes presumably with some kind of address that’s tagged with them so we knew where they came from and it can be checked later on.

(00:14:03)

All right, easy one done. Because the x86 instruction set is so complicated and because what we want to do is unlock this parallelism where we can run more than one instruction at a time. If the most that we could do in a single instruction, sorry, in a single clock cycle was decode one instruction, then we’d be missing a trick down the line. And so what we want to try and do is decode as many instructions as we possibly can out of this block of 16 bytes in one clock cycle. So it’s broken down into stages.

(00:14:47)

The first stage here is called a pre decode stage. And the pre decoders have a set of kind of flaky heuristics about what these bytes might mean, they’re flaky because the only thing that we can think of, again, “we” being in, this instance, Agner Fog, is that the only way it can be decoding this because we have no idea where the instructions lie in here, all we know is they don’t overlap with each other and they depend of course on having decoded the previous instruction. And so how could you possibly do it in one cycle? The only thing we think of is that it just speculatively tries to decode an instruction at every possible offset and then it has a sort of filter a step where it says, well these ones overlap with the one that’s come before. So that can’t be a valid decode. And again, it’s a bit of a heuristic hack, it never gets it wrong per se, but it sometimes conservatively tags instructions for a later stage of the pipeline that are tagged as being more complicated than they actually are. So we’ll get to that in a sec.

(00:15:29)

So these 16 bytes happen to represent exactly that routine that I showed. The pre-decode tags it into it actually just breaks it down and says, well this is the opcode byte, these are the opcode bytes, whatever. Again, a lot of this now when I’m pointing at this side here is my kind of guesstimate about what’s going on. It’s not canonical, there’s no source for exactly what’s flowing here, but hopefully you’ll forgive me. And the interesting thing here is that what we’ve done is we’ve been able to discover where the bytes of the first instruction, which is that MOV, the second one, which is the IMUL, the add EDX, the add RDI here. And then for this last one, this is a compare and a jump. And at this stage in the pipeline it spots an opportunity because although we treat them as two separate instructions, we have to, that’s what the ISA is, it’s so common, you know, every branch, every loop ends in a compare and jump if not equal or something like that. That Intel have decided to say, well, what if we had an instruction which was just a compare and jump as one unit? Well we could make a new instruction, we could issue it and write all the compiler writers, you know, actually emit this instruction. Or we could just spot it in the instruction stream and replace it here. And that’s what they do. This is called macro instruction fusion and it’s tagged at least in this pre decode step.

(00:16:12)

So we still don’t know really what these bytes are. And again, sometimes the bit masks or whatever it does are wrong. I’m gonna run outta time if I keep doing this.

(00:16:55)

Okay, yeah, so the limitations of the pre decoder, it can do up to five instructions in a cycle and it does this macro op fusion, now at the moment, and by the moment, I mean for Skylake it may have changed, but I don’t think that this one has, a compare and test on anything that isn’t memory (which is therefore more complicated) followed by any branch: that’s treated as a single operation. An add subtract, increment, decrement, or an and, followed by a branch on these limited set of instructions is treated again as a single unit, as a compare and branch. It’s worth saying that you probably don’t do, this although you write compilers, so you might have the latitude to do this. Don’t put any length-changing prefixes. These are usually used in like the crazy 32 bit modes to use smaller pointers, but they effectively snarl up the pre decoder, it just goes, oops, I don’t know. Because you can make an arbitrarily complicated instruction out of just putting random prefixes in it and it just gives its hands up and it takes, I think it’s a three cycle penalty if it sees one, it’s like, whoop, I’m out. And yes, the other thing that it does do is it notices whether or not an instruction is simple or complex and so it steers that instruction to the appropriate decoder. And steering’s not really the right word, but I’ll tell you what I mean here.

(00:18:01)

So these sort of partially decoded instructions, which again are just blocks of bytes and presumably some kind of sequence, I keep putting “and seq” here. I don’t think it uses addresses internally to track because obviously if you’ve got a loop, you’ll see the same address over and over again. It must be some kind of global sequence number that’s used for like, hey, this branch was mispredicted, everything after this is now garbage. So I’m just gonna say and sequence number. So we’ve got partially just decoded instructions and they’re passed onto the decoder. The decoder there are four units and they’re passed them in order. The result of which is, and again this is me making up my own syntax over here, the micro operations, in this particular example, every instruction is one micro operation. So each one in on that side is one thing on this side here. So one of ‘em is like a 32 bit load through RDI and multiply and add and add whatever. And then this one is a macro fused one which I’ve made up an instruction of compare and jump, variously, these kinds of things.

(00:18:54)

Okay, there are two kinds of fusion. One is the macro fusion we’ve already talked about. So macro fusion, macro is big, this is how I remember them. That means that two big things, the instructions have been fused into one internal operation. So we’ve talked about that. Micro fusion on the other hand is well this stupid syntax or sorry this stupid ISA that Intel have come up with, it’s so, so common to have an address operand on the right hand side. While this is logically two operations, that is an add rax comma R14 is go calculate whatever R14 is, go read that from the cache and then come back and then go to the ALU and add it to rax, right? That’s two distinct operations practically for most of the time. What if we just made our micro op like format a little bit wider and had like an optional memory tag and then we could like put for most instructions that we’ve got space we can say well this is both an add and a read and for the simpler cases that can happen. And so micro fused instruction will later on be actually executed by two different parts of the chip as two micro operations. But at this point we can pack it into one micro op in all the queues and things that are coming up down. So we’ve got micro fusion, macro fusion.

(00:19:41)

There is also a microcode sequencer. The microcode sequencer is effectively a ROM with a sort of simple interpreter put on top of it. And so anything that’s more complicated than four micro operations needs to come from a ROM rather than just being sort of hardwired in the code. And that ROM can either just literally puke out a bunch of micro ops, here’s 15 micro ops for a complicated instruction. Or it can actually have logic in it itself. It can actually go like, oh, okay, well while ECX is not equal to zero keep issuing these micro ops, right? And so for things that are complicated like locked operations, which I’m sure anyone who’s been threading has seen, that takes more operations than four or obviously things like rep stos, which is a repeating operation. This is effectively just a microcode sequence. Just keep emitting the read and write operation until you hit the end of the thing that you are copying. And obviously complicated things like rdmsr or cpuid, they, you know, effective runs a little program, you know it’s gonna go off and say hey what kind of CPU are you? Oh I can return that and so that’s kind of cool.

(00:20:32)

I actually met somebody who worked at Intel. He said, oh yeah my neighbor writes microcode. Like what a crazy job that must be to sit there and just like write this stuff that’s right on the other end of this. And then the other thing that’s sort of done by the micro code sequencer are things like divides. I know they’re also complicated, you know, more than four micro operations but it was kind of an opener to me. They’re like oh that’s how divides work. It’s got all this acceleration circuitry, but that’s how they can short circuit. It’s like well okay, we’ve reached a point where there’s no more bits coming out, we can stop here, anyway, I’ve gotta thing about divides. You’ll learn that in a second.

(00:21:08)

So we’ve got four decoders, but not all decoders are created equal. The first decoder is the real decoder. It can do up to four micro ops. So any instruction that is four micro ops or less can go and be passed by the first decoder. Anything that’s tagged as complex, even if it actually turns out to be a simple operation, gets put to the first decoder and the other ones won’t touch it. Any instruction that requires a switch to the microcode sequencer has to go through the first decoder as well. So things like that idiv, the first step will be hey here’s the setup code and then now we’re gonna start reading from the microcode sequencer. Decoders two through four can only do a single micro operation. So any simple things, but this is actually the fused micro ops. So those adds with memory, that still counts as one in this world, even though later on it’s gonna be broken into two.

(00:22:01)

Are we good so far? I’m seeing a lot of nods from the front row, which I appreciate very much. So you can decode four instructions or five micro ops, which means that like if you really need to get lots of micro operations out, you can have a four micro operation instruction followed by a one micro operation instruction and that can come out in one clock cycle and then the other two decoders just have to sit and twiddle their thumbs. 3-1-1, 2-1-1-1 or 1-1-1-1, right? And luckily most things fit into this world unless you’re doing something silly like dividing. Again, the worst case you can possibly do is just to have a sequence of instructions, which are exactly two micro ops each because they only go to the first one and all the other ones are idle. And then the next cycle you generate two. So that’s the worst case. But although I’ve just banged on about this and it sounds interesting and complicated and Intel refer to this as the “legacy decode pipeline,” not that we have any other choices. Thanks Intel. We’re mostly saved by the next thing we’re gonna talk about which is the micro op cache and the loop stream detector.

(00:23:01)

I had a sketch of what might be in a micro op. I’ve been sort of trying to, in my mind construct my own emulator/simulator so that I kind of exercise the past of like what, where things would need to go but we don’t need to talk about it ‘cause I don’t got time, just to show you some pretty rainbow diagrams.

(00:23:41)

These are some other more complicated instructions that come out as being two micro ops, three and four, although most of these are fused in the fused domain still, I think this is only two and this is only one, I think this is the only one that’s actually three micro ops. But you can kind of see a push rax is like, well I have to write rax to the stack pointer and then I have to update the stack pointer. Makes sense. xchg RBX RAX is probably temp equals RBX, you know the whole switch thing. It doesn’t do XORS, I’m pretty sure, maybe it does. I mean who knows. And then an add pointer here is one of those really complicated memory operations where we have to read the memory, we have to then add to it and then we have to write it back. And writing back is always two ops because there’s the address generation is separate from the data part of the store for reasons we’ll touch later.

(00:24:17)

All right. And then, yeah, because I have to put a divide on here. This is a 32 bit divide on Skylake. They’ve got a lot better. I mean I’m sure Granite Rapids is a lot quicker than this, but this, you know, when I first time I sketched this, I got it to print this out. I was like oh my golly, yeah it’s like a hundred cycles for the 64 bit version.

(00:24:37)

But what the interesting points are here is like, so we can see that like in the first clock cycle, the decoder zero is able to output these three micro ops, they kind of flow through and then it switches to the microcode sequencer and there’s a two cycle delay here and now it kind of carries on doing all of its whatever magical stuff is going on here. A note about these. These are very empirically derived from like looking at port pressure. So we’re not a hundred percent sure what the individual operations are doing and sometimes these, you know, when things finish, I’m not totally sure it’s right.

(00:25:16)

So yes, we’re saved by the micro op cache. The micro cache sort of sits logically to the side of this whole complicated pipeline we’ve just been talking about and we’re in one of two modes. Either we are caching or we are reading from the cache. So unlike the regular sort of like L1, L2, L3 caches, it’s not like every read goes and says are you there? Okay, no I better go and do it the hard way. It’s just you are in one mode or the other mode and it’s jumps that determine whether or not that process starts.

(00:25:30)

So yes, in caching mode as we’re running through our program, the micro ops that the translation engine has been doing, sorry MITE is the micro instruction translation engine. That’s again, Intel loves coming up with complicated names for things, or the legacy pipeline. They get pushed into this cache and then they’re just sort of written into the cache. Or if a jump happens and the part of the branch prediction or whatever part that notices that the destination of that jump happens to be in the micro-op cache, then we start streaming instead from the micro op cache and this is the happy place, you wanna be in this situation. So the result of that is just also micro operations.

(00:26:14)

The micro operation cache is the weirdest cache you’ve ever seen because it has a really difficult job. If you think about a normal cache like an L1, L2, whatever, you’ve got a one-to-one mapping. It’s like this byte is in that cache line and then you just fetch the whole cache line around it and you whack it in the cache and you’re done. But if we jump to an address that’s not on like a cache line boundary, then well we decode that instruction but we don’t, we can’t reverse back and say well what was the instruction before it then? You just have to kind of start at that point.

(00:26:57)

And so there’s a lot of really weird restrictions on it. This has changed a lot since Skylake, but the Skylake has 32 sets by eight ways and each way (cache line) has up to six micro operations in it. Some of the micro ops take two slots. So like if you’ve got one of the MOV with a 64 bit value, you need to have two slots in order to sort of store the 64 bit value. It is inclusive with the L1i, which is a clue into how it’s implemented. And then the weird thing here is that there are no more than three ways can be used by any 32 byte of code. And this to me smells like they had hit a real snag because morally this is very similar to the trace cache that the Pentium Pro used and I don’t know, you all look too young, but the problem with that was that you have as many entries in the cache as there are traces through your code. And so very quickly you blow up as the traces change. So I think they try and limit that by saying if any block of memory, small area of the program needs more than three lines, then we’re probably in this case where it’s about to start monopolizing the whole cache that we’d rather throw it away and prevent it from being cached at all.

(00:27:37)

It’s worth saying that any branch, even a not-taken branch, ends a cache line and then it starts caching from the next thing. So it kind of tries to find the sensible boundaries. But the one of the sort of like take homes from this is if you are generating code, if you have more than one entry point into the same 32 byte block, then you might be not using the DSP as effectively as you like. And I know that it’s quite typical in compilers to like start with a hey let’s do some setup code, then we jump to the end of the loop that then does like the loop end and then it jumps to just below the setup code to like then run round and you’re like, ah, now I’ve got two entries. But so you can have to think about it, two is probably fine and yeah, I forgot to say this before, but like you know this is why we align loops on 16 byte boundaries typically, is so that you get the benefit of the fetch system picking up the loop.

(00:28:25)

But yes, in newer machines it is more like you can have six ways in 64 bytes. So it’s a bit of a weirder thing more closely tied to the L1 cache. All right, I think, yes. And then it can deliver anywhere between four and six micro ops per cycle, which means that even if they’re complicated or simple or whatever instructions, you can get a pretty consistent stream out of it. The literature says six, I could only ever measure four, so I dunno if whatever. All right. Yes, yes. The loop stream detector. The loop stream detector. So in between this sort of source of micro operations that’s coming from either the legacy system or the cache, there is a queue obviously, you know we love queues, queues buffer if the renamer or the execution unit is sort of backing up a little bit. We’ve got a bit of space in here, but this, this queue effectively is a big circular buffer of a couple of hundred entries that we’re writing in micro codes in program ordering.

(00:29:17)

And what the loop stream detector does underneath is it says, wait a second, you’ve jumped to a location that’s actually already in this buffer. You know, it may have already popped off the end of it, but like, you know, we haven’t overwritten it yet. Wait a sec, wouldn’t it be better if we just kind of held that part and just kept streaming back that same sequence of events over and over and over again. And so that’s what the loop stream detector does and that means it can turn down the whole front end then there’s no caching that has to be done, there’s no decoding, it’s just brilliant but it’s relatively small and it has limitations. So it spots loops that fit in the queue entirely. It can deliver, each cycle, up to four micro operations, but it can’t deliver more than one loop iteration. So if you have a really small loop or a loop that doesn’t fit into a multiple of four, then you might get like, you know, if it was five, you get four in the first cycle and one and the next one and then four and then one and four and one except it can unroll, it actually spots the longest thing it can get up to eight times in Skylake so that it’ll actually unroll that loop effectively in hardware and go, well I now if I take my five iterations and I do them four times, my maths is not good. Five fours are 20, 20 divides four, yes, I think we’re good, right? Yeah, something like that. So then it would just fit and then in every cycle it can stamp out four, which is pretty cool, which is great but why was it in parenthesis and why did it have a star, well, it’s disabled on Skylake, which was reported due to a massive thing.

(00:30:22)

And this is where I’m actually gonna have to bring up my notes. Sorry, you’re gonna have to see Speaker View for a second because yes, here we go. It was described as a “nightmare level bug” in the Debian fix. Short loops using the AH/BH/CH/DH, you know, the high part of the 16-bit register and the corresponding wide resume may result in unpredictable system behavior. I mean whoa, that is scary stuff. So props to them, I dunno what it was. And there are people in this audience who probably can tell, but something about the OCaml runtime or the OCaml compiler liked to do this like in a loop it would use the high eight bits and the low eight bits, maybe tagging, oh I’m looking over at you two because you’re my favorite two to look at. But yeah, it was found by some folks in the OCaml community and reported and then it was, Intel determined it couldn’t be fixed and they just issued a microcode patch which turned the whole thing off. Wow, okay. Yes, yes Ocaml, we did weird things so you didn’t have to, right.

(00:31:37)

Okay, renaming. So now we get to the cool part here. Now again, those of you who’ve done like a college level stuff, stuff, courses in this kind of out of order execution, this is probably meet and drink but I’m gonna go over it anyway ‘cause it’s just really interesting and the penny dropped for me writing these slides about how important this whole process is. So the renamer has probably the biggest job in the entirety of the front end because it has about three different jobs. The first thing that happens is that now the micro ops have reached this point. This is like the first point where we can say this is definitely gonna happen. So we’re gonna put it into, I mean definitely it’s definitely speculatively gonna happen. It might be undone later on when we discover we shouldn’t have gone this way. But like we’re gonna write it into the reorder buffer which just says this instruction happened or will happen. We also then rename them to physical registers. So we take the EXs and the RDIs and whatever and we use this vast array of on-chip resources to actually use for registers. And again, I’ll talk about it in a sec. And then it also takes the micro operations that were represented and shifts them off to the right execution units to sit to wait so that sit in those schedule until either the execution unit is ready to accept a new operation, maybe you know, there’s multiplier that’s backed up waiting and there are other queues, other instructions ahead of it or maybe this instruction depends on the result of a previous instruction that hasn’t completed yet. And so it’ll sit there and wait until all of its dependencies come in. But this is the point of which we kind of like fan out into all the various data structures and then the out of order system picks it up on the other side.

(00:32:30)

Amazingly, although it’s doing all this complicated work, it can do four micro ops a cycle. This is staggering. Maybe this is the kind of thing that in hardware is a lot easier than it sounds, right? This is the kind of thing you discover.

(00:33:15)

So the first thing we do is the reorder buffer write. This is just to have our ledger for in-order completion later on. There is a process called unlamination that happens at this point. So at this point sometimes the micro op that came through, even though it wasn’t micro op fused, is determined to be, this is actually more complicated than I can do in a single unit. And so it’ll break it up at that point into multiple pieces. Some of the more complicated addressing modes sometimes fall into this category for operations that could otherwise go to like the adder. It’s like, well the adder can’t do this. I’m a little vague on that, I’ll be honest with you. But unlamination is a thing that can generate extra micro ops at this point.

(00:33:57)

There are 224 entries in the reorder buffer that is the sort of top end of how many things that can be possibly in flight at once that are in the system being executed. That’s more like 500 on, I’m looking at you two in the front again for this. I think it’s about 500 in Granite Rapids and above. So that’s a place that they have changed a lot and expanded.

(00:34:48)

It kind of looks something like this. You know, we have some kind of sequence number for each of the instructions in sequence. We know the information it needs to track is where are we writing to and then what rename happened. We’ll get to why we needed that information. It’s sort of weird that we need both the source and the old version partly for undo, you know, if we were to talk about branch stuff, but partly for another thing that I’d never considered before. And then we’ve got some kind of type of information. There’s something to do with like where if it’s a branch where its sort of restore state might be stored. And then the important thing over here is that while this is all written to in this ROB write this part over here, the execution and exception is gonna be written to once the instruction is actually finished. So it’s gonna go through the rest of the pipeline and come back here.

(00:35:23)

And so in my sort of example I’ve got up here, this first instruction, this load has completed, which means it’s gone through the system. It’s finished, it’s done with, and it could be retired at any point. The retirement unit could finish it off and you can see, I can’t remember what I did. Oh yeah, it’s got a source down here of P22. Yeah there’s sort of an implied dependency and it’s not really that important. And again, we don’t really know what this looks like. This is the minimum set of information I think it needs, the issuing step is where we actually take the live to be calculated micro op and give it off to these execution units. The execution units are in two big blocks that have many ports on them. It’s a weird terminology but there’s an ALU unit that has all of the things to do with, you know, arithmetic and logic as you might imagine. And then there’s a sort of memory system which only does loads and address generation and they have different numbers of entries. They are presumably balanced by some kind of internal pressure inside both the layout and the complexity of those things. And also, you know, the experiments that Intel do in terms of how many of each type of instruction they typically have live at once, the issuing determines also which port within each of these. So these are further subdivided, but at this point this is when it’s decided where it’s going to ultimately go. So even if it’s an add and there are like three different locations on the chip where that add could happen, we decide now at rename time which of the adders it’s gonna go to. Which is a little bit myopic ‘cause you know, by the time it actually gets ready to run, maybe that’s not the best choice of adder. Now maybe there’s some free adders somewhere else, but it’s useful to know and obviously it’s a simplifying assumption here.

(00:36:54)

The algorithm has been reverse engineered with high probability that they got it right. And it’s kind of as simple as you might imagine. It’s to do with like the balance of how many operations have gone recently and it tends to pick the higher port numbers if they’re more free, if there’s a tie and then it’s sort of of the four instructions issued, the one and three go to the top place, best place to schedule ‘em and two and four go to the second place. So it kind of tries to balance them as well. It’s again, it’s documented. You can go read the Python code that does the same thing and at this point also now separate from unlamination, which actually makes two entries in the reorder buffer for an instruction that we noticed later on is a bit more complicated. Anything that’s actually micro fused that is like those, the add with the RDI gets separated into the load component that’s gonna go over here into the load unit. And the add part that’s gonna depend on the load that’s gonna go into the ALU component.

(00:37:45)

So the difference between unlaminated micro ops and micro fusion is that this only has one sort of slot in the output buffer in the reorder buffer. Whereas the unlaminated stuff actually takes up more resources in that, not that it’s a huge problem with 224 of them. All right, good, good, good, good. So to again, I think really the reason for me putting this up here is to sort of show you what information is being stored in each part because for the longest time I had reservation station, which is the scheduler, sorry, and the thing that the issue is writing to and reorder buffer kind of mixed up in my mind. So this is more of my symptom of me explaining to myself so that I can talk to you about it, but hopefully you’ll forgive me.

(00:38:41)

So it looks something like this. We have a bunch of operations, micro operations. These are no longer in any particular order and we have, you know, maybe three sources, that can happen. And we’ve got a destination that is gonna write to, and it’s an add and this is a guess, now, I have been unable to find anything that tells me whether or not the values of known quantities are stored in the scheduler, in this sort of micro operation or it’s just a reference. Now the way that I’ve chosen to show this is that this one has two resolved sources. So these have already completed the instructions that generated these two numbers. Maybe this was actually in the opcode and this one was like reading from a register. These two have already completed. So I’ve written the values in here and then with these guys here I’ve shown that like I’m waiting for the previous instruction. That’s what this guy in fact who’s gonna write to P24, I’m gonna wait for him and then when he’s completed, he’ll tell me what the results are and I’ll put it in here.

(00:39:26)

We don’t know if that actually happens as in this broadcast action causes the issues, these slots to get written to. There’s some circumstantial evidence that maybe it is because there’s a broadcast bus, but it also opens more questions about what happens if you are issuing instructions that have two known quantities. And then we have to read an arbitrary, a number of like physical registers in one clock cycle. Like if every instruction had three sources, each of which wasn’t already in flight, then you could imagine that issuing four of them means that we have to read from the register port four times, three times. And that doesn’t seem realistic either. And no one has been able to find a limit to how many you could read unlike Sandy Bridge, which had a very, very limited number of read ports. So something magic is happening, we dunno what, any who, let’s talk about renaming.

(00:40:10)

So let’s talk about first of all the nuts and bolts of renaming. So what renaming is gonna do is gonna unlock more out of order opportunities down in the rest of the pipeline. And what we’re gonna do is we’re gonna break any dependency that we can. Because many times when we say EAX, we just mean the current value of EAX and we’re gonna do something to it and then we’re gonna throw it away and then we’re gonna again load something new into EAX and we’re gonna just keep doing this process over and over again. And those things are separable from each other.

(00:40:49)

And so one example of something where the values of the registers don’t depend on an earlier value of the register is a loop because you know, we’re gonna read into EAX every time around the loop and we’re gonna do some work with it and then we’re gonna store it into a value or whatever. And then we’re gonna start the loop again at the bottom here. So if we can break that dependency, it means that we can run maybe two iterations of the loop at the same time or even more. I’ll show you what I mean. We’ve got pictures coming for those of you furrowing your brow.

(00:41:32)

So the way that this works is we have a table that tells us where each register currently is. So which of the physical registers, and again this is my made up syntax, P whatever, currently contain the most contemporary value of rax. So rax is P39 and whatever and then there is a free list of registers. These are registers that are currently spare chip real estate going spare, fabulous. For every time we read from a register, we’re gonna use its current value in the RAT. Every time we write to a register and completely replace it, it might as well be a new register. So we’re gonna pick a new one off the free list, use that instead and then update the table to say this is where the rax register is now.

(00:42:18)

So this first instruction we’re reading, the RDI register and we’re gonna go and fetch memory from there and then we’re writing into EAX. So it really doesn’t matter what was in EAX before. So P39, don’t care about you. So what we’re gonna do, we’re gonna rename this to be P45 and hopefully you saw that that kind of went that way. It’s me popping off the front of the free list. This is in hardware, remember by the way, you know this is like pretty impressive. And then P22, and then this mul here is EAX with itself, right? I’m squaring EAX and I’m putting the result in EAX and I know the opcode doesn’t have three operands or it’s can do for some multipliers, but you know, it’s been rewritten into this nice RISC-y form by this point. So when we read the EAXs, we’re gonna use P45 and then when we write we’re gonna use a new one, which will be yeah, P46, and this is indeed what happens and so on and so forth. We can go through the rest of it here. And now we’ve got this whole rewritten set of micro operations that are phrased purely in terms of physical registers only and we’ve sort of encoded the dependency between instruction values by, if you follow the P45 in here, it’s used here and then it’s done with. So you know, that’s helped us.

(00:43:07)

So this is what that loop looks like or rather loop iteration and a half. So I’ve got one and a half ‘cause it fits on the slide. If we didn’t have a rename, this is what it would look like. So the first iteration of the loop is from here to about here and you can see that the next iteration of the loop can’t even start because, well I can’t read through RDI until I finished adding to RDI from the previous reiteration, right? Because it’s like I have to wait for it, right? And it couldn’t start, we couldn’t add four to it to move it forward until we’d finished reading the memory address at RDI because presumably I still need the RDI register or whatever. Again, this is, I had to hack the code so badly to make it do this and I then went through it and sort of manually fixed it up.

(00:43:47)

But you get the idea, if we were to sort of show you the zoomed out view, it looks like this. And if you see these kind of diagrams, you can squint and you just look at the Rs really and you kind of look down there and this is, I mean mostly throughput rather than latency, but it’s 10 cycles in iteration, doesn’t seem totally unreasonable. I mean obviously you’d use SIMD if you actually try to do this for real, right? And you get more per iteration, but 10 cycles in iteration, if we rename it, on the other hand, it looks like this. So on the first cycle, we’ve already queued five of our instructions and we’ve issued four of them. The first and the fourth one can already complete on cycle six because this add can be running concurrently with the load as they don’t depend on each other anymore and so on and so forth. And you can see every time there’s an E, there’s usually a D underneath it because the E means it’s finished executing and the results become ready and immediately can be used in the instruction that depends upon it. And this is evidence for the broadcasting theory of this where as it kind of completes, just goes, anyone who cares, P45 is now 27 and they’re like, oh cool, I’ve got it. And then he doesn’t have to wait another cycle.

(00:44:34)

Okay, I think we’ve got everything. So looks pretty cool, but if we zoom out, it’s like one and a half cycles in iteration. It really makes a difference. It’s super cool. All right, I’ve labored that enough, right? Let’s talk about my favorite instructions. XOR EAX EAX. You’ve all seen it, right? Yeah? Why the hell does it do that? Anyone wanna shout out. No, nobody wants to shout out. Sorry, it sets it to zero, but why not just do more of EAX comma zero, come on. Smaller instruction, it is, it’s a smaller instruction, but wait, there’s more, XOR EAX. So let’s go through it first of all, so exclusive or-ing a number with itself always leads to zero, right? And because of the stupid syntax of x86, this is EAX XOR equals EAX, right? So it ends up with zero, fine, compiler loves to do it.

(00:45:27)

But the cool thing is the CPU knows that you love to do this as well. And so it goes, oh, if you’re setting a value to zero, I don’t need to do anything at all. All I’m gonna do is I’m gonna arrange to have a physical register that’s always the value zero and I will just rename it. Okay, rax is now P00 and any kind of thing that reads from it later on is just gonna get the value zero, right? We’re done here. Doesn’t issue at all. So although it’s written to the reorder buffer, it doesn’t go to any execution unit, it doesn’t take up any resources at all. It’s magic. We can also do some ONE-ing idioms. There are some parallel ONE-ing type things that can generate an all ones. So presumably there’s some other magical register that happens here.

(00:46:12)

Okay. We can also do move elimination because if you think about it, if I move RBX into RAX, all I’ve done now is I’ve just got two names for the same register. So we just have to update the RAT and say, well, you know, rax is now P88, which is incidentally where RBX is at the moment. This is no issue. This doesn’t mean there’s no problem ‘cause it’s a problem. There’s no issue, there’s nothing written, there’s no actual work to do, there’s no actual ALU unit that needs to be done. But there is an issue because this actually does introduce a complication, it’s surprising, there are limits. If I’ve got a slight, oh notes here, oh, gone ahead by one. So there are limits, right? Because now we’re allowing a single physical register to be aliased by more than one architectural register, which means that suddenly we either have to do reference counting to determine when we can free that thing back up again. Or we need another trick to determine when everyone’s finished with that particular physical register. What it seems to use is it has four slots for aliases. And once one of those slots has been used by having two registers pointed at the same thing, until all of those registers have been overwritten, that P88 is now locked, it can’t be freed. And also that slot can’t be used by anything more. So you can do four mods in a row and they’re all free. And then after that it becomes really expensive to move between registers until you overwrite both the source and the destination of both alias, that has gotten better.

(00:47:06)

And I think they now use a bitmask to track things in a much more clever way because on Alder Lake and onwards they have another trick. And I know that I said I was in Skylake, this is just too cool not to talk about because you know, adds and subtracts are so straightforward. What if the renamer goes, well, hang on and add with one. Why don’t just rename it as rax as P88 and just remember it’s got a one added to it. How cool. So again, small increments and adds don’t take up any resources at all. They just get written as sort of a housekeeping issue inside the RAT.

(00:47:47)

We have to pay the price at some point. Two reasons. One, the range is only between minus 1,024. So it’s got 11 bits of range. And then if it hits that limit, it actually does issue the micro operation that says, oh fine, add the thing and then we know where we are, right? And then of course after that it can be renaming from the new value. So it’s pretty cool.

(00:48:51)

There’s also, interestingly, if you think about what this really means is that when P88 becomes ready, if it were depending on it, anything that depends on P88 has been given the burden of having to now actually do this damn add, right? And so many of the instructions, that’s fine, right? You just, it’s like, okay, I’m reading from P88 and there’s an adder that kind of runs in parallel and then by the time result gets to me, it’s already had that offset that needed to be applied to it, applied to it. But the shift, like logical shift for whatever reason, it takes a cycle to do it every time. And this is how we discovered this is going on, ‘cause this is not in the literature anywhere, right? If you do like do MOV RAX comma 10 and a shift, you’ll measure the shift takes one cycle. If you do MOV RAX comma zero INC RAX and then you do the shift, the shift now takes two cycles because it pays for doing the add in a single cycle beforehand. And we don’t know why it does that. But the guess that we have is that for things like variable shifts where you’re shifting by the value of another register because the barrel shifter needs a lot of setup time and it’s a lot of circuitry to do that kind of 64 bit shifter arbitrarily, we figured that it probably needs the results a lot earlier in the cycle. And so it hasn’t got time to do the add and then set up this thing and then rather than them just blanket the shifts that have variable shifts, they just said all shifts or it’s a mistake. I mean mistakes have been made before as we’ve seen. So this is super cool. It’s really interesting. There’s a whole blog post which I think I’ve got at the end that’s fascinating digging into this.

(00:50:02)

Okay, we are 50 odd minutes in to the presentation. I don’t think I’ve lost too many people. Most people’s eyes are still awake, open or they’re awake or both. And this is the summary, this is where we are, but this is the meat of the talk just getting to this stage. So we’ve got a predicted program order, which gives us a sequence of instructions, or instruction pointers, which are then fetched, they’re decoded into micro operations, those micro operations are renamed and then the reorder buffer is written to in program order. And then these sort of micro operations that are pending execution get written into the scheduler. Great.

(00:50:49)

Okay. The backend looks really simple in comparison, but it does hide, you know, kind of the thing we care about, you know, actually doing the work, right? Everything is just sort of preamble. So I’ve already said reservation station, the scheduler and the reservation station, the two names that Intel use interchangeably or certainly Agner Fog does. And so that’s where I’m getting a lot of my information from there, the execution and then the writeback. So it sort of looks something like this. We’ve got, now we’ve gone from this nice pipeline to a big soup of things that can be done at some point. And so we’ve got the two reservations stations, two schedulers, whatever we call ‘em, reservation stations that have their 39 entries that are just sat there waiting for either all of their operands to be ready, sorry, not either, waiting for their operands to be ready and for an execution unit that they’ve been tagged to go to to become free. So that’s what they’re sitting there in these queues. But again, the queues are not really in any order, it’s just whatever’s ready.

(00:51:37)

We’ve got the reorder buffer and the writeback for the reorder buffer is just to say this instruction is now done, we’ve completed, there doesn’t have to be any other values. Now, way back when in sort of like the, oh crap, what was the one before Sandy Bridge? I’m forgetting now. They have such stupid names. I really wish they would come up with a good naming system. Anyway, earlier, earlier on the reorder buffer actually contained the physical registers, effectively, it just used the reorder buffer as the physical register store. And so the results of everything were written back into the ROM, but that’s no longer the case. They’re still, they live down here because we added, you know, vector operations and if we had, you know, 512 bit wide slots up there, it’s a lot of waste for a lot of, you know, unless you’re writing AVX 512 all the time, which maybe you are, I know you can’t tell me, maybe you can, no.

(00:52:27)

So we’ve got reservation station here. Yes, we’ve got various ports attached to these reservation stations and sort of broadly the big picture thing is every clock cycle, the scheduler says which of these micro operations is ready to run and it issues them to one of these ports that has many actual logical units. You’d think that this is just like an address is, you know, multiplier or whatever, but they each have different usages and then those are pipelined themselves. So they have their own pipeline but they’re fixed length and they come out and when the result comes out five cycles later, three cycles later, whatever, it comes back into this writeback branch and it sort of says yes, P47 is now 127. Maybe it’s written to the physical register file. Well it’s definitely written to the physical register file. Maybe it’s written to and snooped on by some of the other entries that are waiting for that result and they become ready to run and then the whole process just keeps going until there’s, well it just keeps going. Never stops, right? Fabulous.

(00:53:14)

Okay, so looking at the ports that we have, this again is Skylake. We’ve got a whole bunch more on the more up-to-date Granite Rapids type things, but by and large, you can sort of see where all of the engineering effort went down on this poor thing that gets all this work here. So there are three units that can do integer and vector operations on like simple integers. There’s two that can do the shift, one that does the permute, some string operations and whatever the number here is the latency incidentally. So you’re doing it wrong if you’re using x86 for what it’s worth. And so the execution ports over here are doing all arithmetic type stuff except for the kind of like here’s one of these kids that doesn’t look the same as the others, which is the store unit, which is weird that the store unit doesn’t live in the load store unit but whatever.

(00:53:48)

Over on this world, we’ve got a bunch of load ports that can do work and they can also generate the address of the store and then one sort of really, you know, kid that can only do simple addresses and therefore, you know, I dunno what happened here. Maybe they were like they copy pasted it twice and then they didn’t have enough room for it and they edited it a bit and well we can’t do the complicated addressing here. We’ll just have to have that kid do that. Cool, right.

(00:54:18)

I forgot where I was going with this, nevermind. I think you get the idea there’s a lot of things that go to different ports, yes. And so yeah, typically it picks the highest port of the available ports if there’s a tie because the higher ports can do fewer things and so if it can do a vector shift in this one, it’s kind of freeing up port zero one to do more of the other weird thing that might be coming down the line. All right, so the address generation/load unit generates addresses, surprise, and it does the loads. Hey, who thought, what does that really mean? Well we’re gonna have to introduce something called the MOB, the memory order buffer. This is probably the most complicated part and we only have like four minutes to do this. So there’s gonna be necessarily some simplifications here. Some of the simplifications is in my understanding. Some the simplification is that no one has come up with a way of measuring exactly how this works yet. And some of it is, user error or my part, I dunno.

(00:55:02)

So there are three kind of parts to this. There’s a load buffer and a store buffer and there are some predictors which are absolutely bonkers. We talked about earlier, and I don’t think I have time to go in the depth that we went into the table, but there’s predictions all over the place. So the store buffer holds all of the stores that have not yet gone out to the real memory. And now remember everything we’re doing is speculative, speculative at this point until the instruction retires. We haven’t got cast iron guarantee that something upstream of it that happened in program order before it, hasn’t gone, oh I just divided by zero or oh I mispredicted a branch or anything like that. So until it gets to that retirement unit, we can’t let the store go out to real memory, right? So we have to hold it in this sort of holding pattern.

(00:55:48)

But of course if I store to memory and then the next instruction reads from memory, I might reasonably expect to see the number I just stored in memory, right? That would seem quite bad if it didn’t. So I have to do something about this. So this store buffer has many, many hats. One of the ones is to hold the stores until they go out to memory. The other one is to kind of handle the loads that have come after it that might need to read from the things that haven’t yet committed.

(00:56:29)

So it’s broadly broken into these two parts where we have the address calculation and the data calculations separately from each other. And this is why this data unit lives in the ALU because where does the values that are gonna be stored come from? Probably the ALU and the address is generated by the address gen. What we’re doing here is that by separating them, most of the time, maybe not all the time or some of the time, let’s just go with some of the time, some of the time you know what address you’re gonna be storing to earlier than the data that’s gonna be stored to it. Like, you know, you square root something 20 times in a row, so you got this huge long dependency chain of square roots and then you store it to address 20, right? I know that it’s gonna store to address 20 pretty early on. I’m gonna have to wait 500 cycles before those square roots have all completed. Why do I care about the address? Well if I’ve got a load coming on and it’s not to address 20, I can do that load. So I need the address more than I care about the data for the purposes of letting other instructions that are later in the flow get their results.

(00:57:10)

So we have the address, we have the data, we have how wide that is, whether it’s completed or not, and whether or not it caused a fault, obviously again, if we cause a fault but it was on a path that is not got to because we’re gonna divide by zero before we get to the fault or you know, Spectre or Meltdown notwithstanding, you know, then we don’t want that to actually take effect. And so when it comes to retirement, we’ll check whether it was a fault or not. So stores we have to wait for those two address calculation and the data operations and then we sit in that buffer serving the incoming reads and eventually we actually commit. And that happens way much later after we retire this, once it’s retired, the load buffer on the other hand stores the loads that are happening and some status about the prediction stuff that I alluded to, but I’m not gonna go into. But this is used for tracking some guesses. It makes some intelligent guesses about whether or not a load may never alias, in which case it can issue it really, really early. The problem with that is if it gets it wrong, it’s incredibly expensive. It has to do a whole bunch of like state clearing to kind of undo the fact that it let a load happen that was actually dependent on a store before it, but it didn’t run. There’s not like an undo mechanism for that. So it just throws everything away.

(00:57:58)

So we have, you know, some kind of reference to where the ROB is, how wide it is and what address it’s going to that it’s been served from. And again, this is also the order in here and there’s the store color here is a reference, it’s called this in the literature. There’s a pattern that talks about store color. They don’t go into any details of what that really means. If you look at the RISC-V implementation of this, it uses a complicated bit mask, but effectively this is some way of it saying which stores could potentially have affected this load as in where is it in the sequence of stores so that I’ve got some way of tying the things together and knowing that like once all previous stores have been accounted for, I can do this load.

(00:58:40)

All right, so loading is really, really, really complicated. Which is one of the other reasons why it’s slow, not just, you know, we talk about L1 cache, it’s like, hey, it’s three or four cycles to get to L1 cache. It’s like, it’s gotta do all this as well, right? First of all, we do some prediction about the ambiguity, again, out of scope, unfortunately. We have to wait for the address to be calculated. Hopefully that’s quick. And then we have to check the store buffer. Now if we get a hit in the store buffer and there are no intervening stores between the hit that we made and anything that hasn’t yet been resolved, then we know that no store earlier than this load can possibly affect the value other than the one we found. And we can go, hooray, we don’t even have to go to L1. This is like L0 cache, here you go, here’s your number. If that’s not true, if we miss, as in we don’t find anything in there, but there are also no stores that haven’t been resolved earlier than this. We know like, hey, there are no stores that are in the queue that have not yet been flushed out to memory. So we can just issue the load now and then it will actually finally go to the L1 and start the whole process of reading from memory and obviously the load finishes and eventually completes and comes back and then we kind of send it off to the program. It’s a miracle that any of this stuff works, right?

(01:00:06)

And if not, if we’re still in sort of ambiguous state where there are unresolved stores, we just have to wait. We have to keep going and sit in this queue until all of those stores have been resolved. Which is why we try and predict if it’s gonna happen. And sometimes that’s worth it. Okay, yeah, there’s some patent number here, which I’m sure you’re not allowed to click on because of all the legal things, but let’s not go there.

(01:00:44)

All right. The other thing that the memory order buffer is involved in is all of those horrible locked instructions that you wanna do, fences. So, you know, like store fences and load fences are like operations that will cause those things to drain out before any further instructions that can issue loads can happen and that kind of stuff. And it’s sort of vaguely alluded, I’m gonna put just the words total store order here, but my confidence in this has been irredeemably shaken by our conversations earlier, whereas Andrew, he was here earlier, I’m sure. Yeah, yeah, there he is. So there’s definitely some things that go on with this, but that happens on later on.

(01:01:14)

So finally we get to talk about execution. We are now exactly at an hour since I started talking to you all and we’re gonna talk about execution, which is luckily the most boring stage because it does what the old CPUs used to do, right? You know, it’s pipelined, you know, there are machine traps. Interesting, this is where floating point assists happen. If whoever has ever hit the hideous performance cliff of denormals, there are some nods in here.

(01:01:51)

To the best of my understanding, what happens here is that if you, so the floating point unit can’t handle numbers that are like not one point times 10 to the minus whatever, or two to the minus whatever. So for very, very, very small numbers that are below float min, whatever the thing is, there has to be special casing for it. So the hardware can’t do it, but it can’t determine that ahead of time. So by the time it gets to this point, the multiplier, the adder or the divider goes, oh crap. Shit. Right, what do we do? We’ve got like a number I can’t deal with. It just has to essentially flush the pipeline and then it goes back to the microcode sequence and says, can you just generate me the code that can handle this, please? You’re like, what? You know, dynamically like undoing as a deoptimization kind of thing, like a legit deop. Yeah, it’s crazy.

(01:02:43)

Anyway, and the execution ports, we hope that they’re balanced like in terms of the latencies, if I go back to the slide before it used to be the case, you could sort of see, oh, all right, I won’t do it now. But like they try to like make it so that there aren’t different latency number of instructions. Because one of the other things that happens is if two instructions on an execution port complete in the same cycle, like you’ve got a three cycle guy and a five cycle instruction that was issued two cycles earlier, they both wanna complete in that last cycle and that would delay one of them. So they try to avoid that by like coming up with sensible, everything’s a multiple of two over here and then we do this type of thing or whatever. There’s some tricks to it. Gosh, where were we? So that’s what I mean by that. And then yeah, we talked about this, the writeback stage is the results are broadcast, they’re written to the permanent physical register file, not the permanent register file. That’s different. And it’s marked complete in the reorder buffer.

(01:03:33)

All right. Retirement, retirement is the easy bit. All we’re doing, all the retirement system is, I mean, yeah, we’re all trying to retire. I know, I know, earlier perhaps than not, but finally everything is done. So that reorder buffer is essentially that ledger, again, of all instructions that flow to the system. The retirement system is trying to pick up as many completed instructions as it possibly can and say, okay, these are now done and they can be freed back off. And that’s the other thing, it does do the freeing. So this is where the registers that were renamed are now freed back to the system as they’ve been all the way through the system. But of course the opcodes that we’ve just seen, it’s not that the register that we just got written to that we are freeing. It’s whatever it used to be when it was renamed right back at the beginning of the pipeline. So we know that when we complete the register, it replaces EAX completely. And it used to be P88 and it’s now P50, at the point it retires, P88 is now free. So that’s when it gets chained back onto the list. And that’s also why that is simple provided you have no aliasing and no one else has also got P88 at that point.

(01:04:15)

So this is where exceptions actually get handled. We talked about this as just a flag. And then that’s when the whole thing goes horribly wrong. Hopefully you’re not doing too many exceptions in your code and not exceptions, like C++ exceptions. I mean exceptions like machine checks and, you know, divide by zero and memory stuff. The stores are also marked as senior in that memory order buffer because they’re gonna continue to live in there until they actually get flushed out to the real memory. And the retirement system can do four micro ops a cycle, which is very rarely a limitation. But you can sort of prove it with very, very pathological code.

(01:05:02)

Yes and then later, the last stage is, it’s not really a stage at this point, it’s just that the store buffer is kind of retiring stores in order as well as they become senior. That means they can go out. That means the instruction that caused them to become ready has completed. So they’re okay to be committed and then some magic happens something, something total store order, some kind of liaison with the other, the chips on the die and/or the other subsystems. And this is where I can tell as far as I can this with the RFO, the read for ownership, that actually is the kind of, when we talk about cache ping ponging and false sharing, this seems to be when it happens. I was very vague on this honestly. I was trying to work out where it could possibly happen. I should probably talk to y’all and see if you have opinions on that afterwards, but it seems okay.

(01:06:03)

Right, well that’s it. It’s pretty straightforward, right? And to think that I was gonna talk to you about branch predictors as well, but you know, where’s the next thing? There it is, yes. A conclusion. There’s not really a conclusion to this. The whole talk is the conclusion. It’s like this thing is an amazing edifice that’s been built, right? It’s amazingly complicated. The kind of advice and I feel like I’m, you know, teaching your grandmother to suck eggs here. This is like, you all know this, you know, do things simply and straightforwardly. Don’t do divides, again, integer divides that is, floating point divides fine. I don’t care about those, small aligned loops are good. Let the renamer do its work. Again, I wrote this before I had the conversation with some of your experts and the call for more registers was really interesting to me because I hadn’t really thought about the implications, but my small loops seemed to work okay. So I’ll stick with those. But yeah, so the renamer is really clever at doing its work. You shouldn’t need to worry about having more registers with now a massive footnote, speak to these people.

(01:06:52)

Yeah, I’m sure you all have questions. I’ve got a whole bunch of references here that you can go and look at. I will make sure that these slides are available to you all so you can go and look through them all. Thank you very much indeed. All right, I got the cube. Is there anyone brave enough? I have swag as well. You can come and get stickers before you go. You can come and grab a sticker or I have one special or two, two coasters, which are not very exciting ‘cause they made of cardboard, but I’m trying to bribe you to talk to me. Somebody speak, come on. Oh, there is a question. Hello. All right, I’m gonna try not to hit anyone, but I’m sorry I don’t think I signed anything to say I wasn’t causing physical damage.

(1:07:20)

Audience: Are the physical registers like equally as wide?

Matt Godbolt: Good question. No. So the physical register file is broken into different pieces. I thought I had the numbers up on one of the slides. Maybe I did, but I didn’t talk about it. There are different number of registers for different types of, sorry, different number of physical registers for different types. There are like several hundred flags registers because they’re small and pretty much every instruction affects the flags, but almost every other instruction doesn’t give a shit about what those flags were. So you wanna try and rename them away and make sure there’s no false dependencies, for the AVX things in registers, there seems to be a batch of like full width, 512 bit physical registers and the renamer seems to want to use those only when it knows that it needs them. Otherwise, it tries to use 256 bits. Again, it’s trying to save some kind of real estate. Again, I’m waving my hands a bit here ‘cause I haven’t had full experience personally of this, but from reading in the literature, there is some kind of cleverness about the allocation of those and obviously then the 64 bit general purpose registers are, there’s just some couple of hundred of them.

Audience: But they’re at least 64 bits wide?

(01:08:25)

Matt Godbolt: They are at least 64, yes. And they’re made for each type of register, so they’re at least 64 bits. Yeah, another thing to know, I was very cavalier about saying rax and then then showing things that had EAX on them and all that kind of stuff. Now we all know that they are the quote, same register, it used to be the case, again, Sandy Bridge era that if you wrote to say AX, then the register allocated gave you a new rax like EAX and effectively at that point. And then it also wrote out an opcode that was like, and/or it with the remaining part of the old version of that register. So it’s kind of combining, that seems to be not the case anymore. It just seems to just emit them and magic happens and if you write to the bottom 16 bits, then you don’t have to do any extra work. But yeah, but they’re all 64 bits wide is my understanding. Yeah, good question. Okay, do you wanna try and pass it over rather than, excellent.

Audience: Do you have much of a sense of how much variety there is in the design between like these Intel CPUs and AMD or even like other high performance CPUs?

(01:09:28)

Matt Godbolt: Not really actually, no. My sort of childhood experience was all ARM, but then that was very, very old and not very sophisticated. And then I’ve only really been concentrating on the Intel CPUs, that having been said, you know, the things that I know are different in AMD are the branch predictor. That’s something I’ve done my own research onto and that is very, that’s got a very different ethos. It’s sort of, but I don’t know about these kinds of aspects of it. Maybe someone else in here will, but no, sorry for the non-answer. And there’s a question there. Audience: I’ve heard like two perspectives on whether ISA matters like is ARM faster because of, you know, all the non ISA things or is it faster because of ISA? So what is your take?

(01:10:10)

Matt Godbolt: So I think again, in those halcyon days when ARM was actually a RISC processor and was, you know, I could disassemble it by looking at the instruction because they were that straightforward, I think it probably had an edge against the contemporary x86 era thoughts. But I think that ship has sailed, you know, with all of the new whizzbang features that the ARM can do now. I think they have just as complicated a set of decode steps and prediction and pipelines and out of order and all that kind of stuff. So I don’t know that it makes that much difference. It’s more of an implementation detail of the front end as to how the actual adds and subtracts get their way into the ALU. Then there is, you know, I guess the one argument could be that the x86 ISA is sort of a poor man’s compression, so you get a little bit more I cache usage maybe, but you know, we’ve all seen 15 byte instructions as well. You’re like, I’m not sure that’s better, you know, so yeah, no, again, another non-answer for you there I’m afraid. Question at the front I think, if you wanna carefully, here we are, lovely. Audience: Like how did anyone figure any of this out? Like what kind of crazy experiments are people running to figure all this out?

(01:11:20)

Matt Godbolt: Yeah, huge shout out. So the OG for all this is Agner Fog who is a Danish or retired now as I found out last week, a retired Danish anthropologist, which is exactly the kind of person you’d imagine. He seems to be more of a social anthropologist than a technology focused anthropologist or was. And so this sort of, ah, why? I’m doing something. So he did a whole bunch of experiments to reverse engineer things. Now, so the short answer for that is there are performance counters. Those performance counters have revealing names, even if Intel don’t really want you to know what’s going on and you can use them and infer from their usage. So like my own research back on earlier was about answering the question, you know, the XKCD with someone is wrong on the internet and you’re like, I gotta go. It was one of those, someone made some sweeping statement about like backwards branches were always predicted taken. I’m like really? Always? But what if you don’t take it? You know, what if it’s a conditional branch that’s backwards? Anyway, the long and short of it is, yes it is in some unconditional ones are different from conditional ones or whatever, but like the measurements were to do two things that were in the Intel documentation about early resteers and they don’t tell you what this is. It’s just like, oh yeah, the number of times the pipeline was resteered early. And you’re like, okay, and then late resteer and then some other thing. And if basically my inference, which was backed up by the evidence I got and the conclusions I drew was, like I said in the beginning, you don’t even know if there’s a branch in a 16 byte block of memory, right? So the fetcher is just going predict, sorry, yeah, the BPU and the fetcher are just gonna go 16 bytes, 16 bytes, 16 bytes, 16 bytes. But as soon as a branch is detected really early, like in that decode stage, it can send a message to say, oh well whatever you are doing is probably wrong because if you didn’t tell me that I had to branch at this point, you should branch. And so that was an early resteer, so it’s like only three cycles of latency and then suddenly you’re already getting on the write.

(01:13:27)

And so by playing around with that plus by branching forwards and backwards and then timing it and then seeing how far apart instructions were, that’s what those color pictures were, you can kind of infer some of the patterns of what’s going on. And then for things like the much more complicated stuff like the micro operations and stuff, there are counters of how many micro ops have been executed by each stage of the pipeline. So you can run experiments where you’re just kind of adding over and over and over again. You’re saying like, oh yeah, the counts are going up from port zero one and seven, so those are the three that can do this in operation and so on and so forth. And then, you know, you build huge dependency chains and then you try and do some things behind the back of those dependency chains and see which other counters go up and like, oh, these things are where that we hit the renamer but it wasn’t able to rename anymore because like it was way, things like that.

(01:14:08)

So there’s a whole bunch of experiments and Travis Downs has probably got the best set of work on this. So this is his blog, but he also writes something called U Arch Bench. I’m getting that right, which is kind of like a whole suite of micro architectural benchmarks. Agner Fog has his own stuff as well. I’ve got my own branch of that, which is what I used for the BTB stuff, which I is now I’ve made it recompile again. It was broken as hell.

(01:14:58)

So yeah, there’s and then Andreas Abel and his, I can’t think of the other person on his paper, but I think it was his master’s thesis. They spent a whole bunch of time trying to work out the algorithms behind the renamer and which ports get allocated at what time and how long they are. And they found a whole bunch of things that like no one else had ever noticed before, which was just bonkers. And I love the fact that we can experiment with something which like I’m carrying around with me all the time and you’re like, I wonder how it works. You like you can go and find out. But yeah, so the short answer is all those weird counters and in very careful timings and the sadness about this is, and I’m sure we, again, we talked about this earlier, is that most of these things only work if you have actually got a physical machine in front of you, which obviously I do in a case of a laptop, but I do my desktop at home, you know, I tried to get a Granite Rapids machine and I, you know, it’s expensive enough to hire them and then I get to my Amazon account and I’ve got it running and I’m like, pseudo this. And it’s like, oh, there’s only three hardware counters on this. I’m sure there’s more than three hardware count. Ah, it’s virtual, isn’t it? Ah! So I couldn’t do the measurements I wanted to on those. And like I say with Magic Trace, right, you need a real, actual, honest to goodness computer to run it on in order to actually get an answer. Any other questions? Thank you by the way. Ah, you’ve got one next to you and then we’ve got one behind you. Audience: Do you have a favorite x86 instruction?

(01:15:58)

Matt Godbolt: Do I have, oh no, I’m very simple actually. I quite like, I quite like, there’s one of the parallel comparison things that I always forget, but when I see it, it makes me chuckle and I can’t think which one it’s, it’s like what? Like PCL MEQ something something? No, yeah, good question. I just want the kind of thing I should have a good answer for, but I do not know.

Audience: We’ll accept least favorite too.

(01:16:20)

Matt Godbolt: Least favorite ah, least favorite. I mean, no, I wish I had a pithy and amusing answer for you, but you put me on the spot. They took one away for the, they took one instruction away that used to be used for binary coded decimal and they didn’t bother porting it to the 64 bit. And I’m like, come on, you have so much legacy crap and you didn’t bring this instruction over and that’s got a funny name. I can’t remember what it’s now. But anyway, yes sir.

Audience: Presumably we can use this to make our code fast and Intel wants their CPUs to look very fast. So why is this all so hidden and impossible to describe? And why don’t they just tell us more? Matt Godbolt: I don’t know. So you can, I mean, again, it may be possible that within this building there are people who have access to information that I do not have access to. But I am certainly aware of at my time at Google that if you have a certain volume of chips that you buy from Intel, then there’s a sort of sliding scale of how friendly they’ll be with you going from like we can send somebody to come and help you debug your problems through to we can fabricate your own specification of chip and here’s the designs. They had the pink books, I think and the yellow books, they were like different names for different levels of classified information.

(01:17:56)

And certainly when I had some very off the record conversations with some contemporaries around the Spectre Meltdown times, they were like, well, yeah, the pink thing it talks about. And I’m like, what are you talk whoa, what? And they’re like, have you not got those? I’m like, no. And they’re like, oh, we shouldn’t have said anything. I’m like, oh my god, even knowing that they exist. So I think some people do have access to this information, but to answer your actual question, I think they, they don’t want to promise things about how it works because, you know, what is it? Hyrum’s Law, anything that can be relied upon will ultimately end up being relied upon. And of course, you know, that’s why we have instructions from the 1970s, although not my friend, the binary coded decimal adjustment. So they don’t wanna be painting themselves into a corner and presumably as well as an intellectual property aspect to this, you know, even though it gets reverse engineered. So like the branch predictor in particular, I think is one thing ‘cause I do know that AMD is very different from the x86, sorry, the Intel one. And yeah, the folks who reverse engineered the Skylake branch predictor, they found a really interesting case where, and again, we don’t, I was gonna do some more slides that I can flick to, but unfortunately, I didn’t get around to.

(01:19:12)

But effectively the various smooshing and hashing and things that it does to kind of come up with this is a fingerprint for this particular branch and the sequence of operations that led to it such that I can have a table that says, hey, the last time we were in this state at this branch, we took the branch, you know, that kind of field, that’s what is going on in a branch, right? They discovered these researchers that the bit five of the branch address was essentially a direct map. So if it was clear there was one half of the entire branch prediction system that you were using, and if it was set, there was the other half of the entire branch, you know, effectively the hash didn’t mix bit five and they used it to generate a sandbox sort of Meltdown Spectre style where if you are a JIT compiler, you could contrive the code that you JIT compiled to only allow branches to happen when bit five was set right, in the code that you generated. And then you make sure that your supervisor and all the checks about like, are you in bounds? Are you not in bounds or whatever has bit, people laughing, bit five clear and now you effectively have two branch prediction domains and now you are not subject to Spectre Meltdown. How cool is that?

(01:20:01)

And I’m like, yeah, if they’d have told us that we’d have used it, but then of course now they’d be like, oh shit, we have to keep it this way. We probably didn’t even know ourselves that bit five was leaking out like that. So I think it’s a bit of all, you know what, what is the Hanlon’s razor never, no, yeah, no, no. I don’t think Hanlon’s razor applies here, which is what they’re never ascribed to malice that which is explained by incompetence. But no, ‘cause I think a very, very common people, I think it’s very careful decisions have to be made about this kind of stuff. I mean we are obviously in a very sensitive intellectual property environment ourselves and so, you know, trying to toe the line between the things that we do think is useful and the things that most people don’t need to know, but could be a competitive advantage would be my biggest guess there. Plus just documenting it would be a pain. Any more questions? Okay. I’m being waved at from the back as in times like we are done in the time. So once again, thank you all for inviting me out here. I’ve had a wonderful time.

Advanced Skylake Deep Dive

Transcript

The next great idea will come from you