February 26, 2020
A Taste of GPU Compute
The GPU in a modern computer (desktop or mobile) has about an order of magnitude more raw general purpose computing throughput than the CPU, but is barely used except for games, machine learning, and cryptocurrency mining, largely because it’s hard to program. Getting good performance entails massive parallelism (thousands of threads) and careful attention to an explicit memory hierarchy. While GPU compute programming may seem to many like a black art, techniques from functional programming, including immutable data structures and an emphasis on associative operations (monoids) adapt very well to this space. This talk will present techniques used in piet-metal, a new 2D graphics rendering engine optimized for the modern GPU compute, and hopefully will serve as an invitation for the curious to explore this strange but powerful computing environment.
Raph Levien is currently an independent researcher exploring 2D graphics, font technology, and GUI, doing most of his work in Rust. Previously he was at Google for 11 years, including on the Android UI toolkit team, mostly working on fonts and text, and before that was the maintainer of Ghostscript. He has a PhD in Computer Science from UC Berkeley, on the topic of curve design tools.
Thank you so much, and thanks to Jane Street for hosting me. Yeah. So this talk is really a taste of GPU compute because the topic of how you program for GPUs I’ve discovered to be quite deep and technical and complex, but I want to motivate why it’s interesting and a little bit of how to do it and how this landscape is evolving.
So, I’m going to start by talking about the GPU hardware, that GPU hardware is pretty different. There’s some ways in which it’s similar to ordinary computers that you’re familiar… CPUs you’re familiar with, but there’s some ways in which they’re very different. And in order to get good performance out of GPUs you really have to understand the hardware architecture of how they’re built. Then I’m going to propose that a good approach to programming GPUs is functional programming, that you break your problem down into things that really are functional and the dependencies between the state are explicit.
And I’ll give an example. My interest in this is really motivated by trying to build a very high performance, high quality 2D graphics renderer that runs on the GPU. I won’t go too deep into that application, but I’ll definitely borrow from that when I’m explaining how to exploit the compute capabilities of the hardware. And then I’m going to go a little bit into, once you’ve written some code, how do you actually get that to run on the hardware? And that is, unfortunately, still a little bit trickier problem. It’s not a completely solved problem. And so I’m going to talk a little bit about the way that space is evolving as well.
So why do you even want to run code on a GPU? And the reason is that we are seeing changes in the way that chip manufacturers deliver performance, that we’ve always been able to rely on Moore’s law, the increasing number of transistors on a chip, but around 15 years ago the Dennard scaling, which was basically the chips were also getting faster at an exponential rate, stalled out. And so basically that means that the single core performance were done, it’s not improving anymore, that the only way to get really high amounts of throughput on modern computer chips is to increase the parallelism. And obviously GPUs increased the parallelism considerably higher than CPUs.
So this is a chart of basically the raw throughput in billions of floating point operations per second that are the… The blue line is the high end Intel Xeon server chips, and the green and red lines are Nvidia and AMD GPUs. And they show that it’s a little bit bumpy, but fairly consistently there’s about an order of magnitude, more raw floating point operations per second that you get out of GPU hardware compared with CPU hardware.
And I think this also really tells you the story, that the throughput per dollar is also in the ballpark of 10 times more for a GPU than for a CPU, and that’s very consistent, whether you’re on mobile or desktop or high-end servers. And also the story about throughput per watt is pretty similar because watts and dollars correlate very strongly.
This slide is from a Stanford AI talk, and they’re motivating the use of GPUs for doing machine learning workloads. And I think that those have really gone hand in hand, that machine learning requires huge amounts of compute bandwidth. And if you were limited to what you could do on CPU then you would really be much further behind than using GPUs.
So, I like looking at the hardware to understand how this stuff works, and one way to understand the relative importance of different functions on a CPU chip is how much chip area is given over to those. And so this is a… not necessarily a mainstream chip. This is a chip that would be on a laptop. It only has two CPU cores, but it has a pretty beefy GPU capability. So if you’re going to play games on a laptop, this is the chip you want.
And you can see that the vast majority of the area of this chip, especially when you look at… I’ve blanked out the ones that are not available to the GPU, but all of the memory and caches and those kinds of interconnects to the rest of the system, all of that is available to the GPU as well. So this really is a GPU with a little bit of computing added on on the side as an afterthought.
And then here’s another picture. This is the latest. This is the Ice Lake chip, which is the 10 nanometer generation out of Intel. And again, you have a very large fraction of the area of the chip given over to GPUs. I’m going to go in touring this a little bit, and I’m going to talk about, first, the execution units. You can see a regular grid of these, and on this chip there’s 64 of them.
And then, well, I’ll show you the execution unit, but as I go through the hardware, a little bit of a disclaimer, that the variations from chip to chip trying to understand all GPUs is really complicated and really messy. So what I’m going to be presenting today is a somewhat fictionalized, somewhat simplified view. So I’m going to use round numbers, I’m not going to talk about all of the details, and I’m also explicitly not talking about mobile GPUs, because those are sufficiently different than the desktop. There’s integrated and discreet desktop. They’re not as different, but mobile is pretty different.
So I’m going to start with the ALU course, which is those execution units. And then I’m going to work out from there into the way those access memory and organize into larger computational units. So this is one of those execution units on an Intel GPU. So there’s going to be between 24 and 64 of these execution units on a chip.
And at the top of the chip you have the state of the execution threads. So each execution unit has seven of these. It’s an odd number, but there you go. And each execution thread has a program counter, where you are in the program, and some other state related to program execution. And then it has this register file. And a register file is 128 registers per lane. There’s eight lanes to an execution thread. So each one of these has actually got 4k or four kilobytes of register file attached to it.
And the way this works, I think it’s really… To me this was an aha moment of how do GPUs execute large numbers of threads differently than the way that a CPU would? Because a CPU is really focused on one thread and then having a huge amount of infrastructure of a pipeline and registry naming and super scaler stuff that’s trying to really slice and dice your computation to get as much single thread performance as possible. And that’s not what GPUs do. GPUs are all about getting throughput, all about exploiting parallelism to hide the latency.
So the way this works is that there’s this thread arbiter, and this thread arbiter matches the execution threads with the units of work that can be done. So at every turn through the… every cycle or every two cycles actually, then you’ve got these units at the bottom that can actually do work. So you’ve got two ALU unit, so those are adders, multipliers, and other things like integer instructions. And they’re SIMD, so those are working on multiple values at a time. And the thread arbiter basically says, okay, which of those threads are ready to run, which ones are waiting for a ALU unit? And it matches them up. You can think of it as a crossbar in that way.
And then there’s also a branch unit which is going to decide the next program counter. And then there’s also a load/store unit, and that’s also a little bit different than the way it works in a CPU, that when you want to access memory it actually assembles a message and sends that message out, and then it goes into a wait… If you’re loading memory it goes into a waiting state and waits for that message to go out to the rest of the world where the memory lives, and then the response comes back and said here’s the memory you want it to load. And that gets put back up to that thread and makes that thread able to advance to the next thing.
So there’s lots of threads. This thing is not… it’s not like a CPU where there’s one thread and then there’s a lot of cores, that there’s actually a very large number of threads that are able to run. And the job of the thread arbiter is to keep those units as busy as possible, particularly the ALU units, that you want to keep those busy and keep those able to do useful work. So everything that happens inside an execution unit has a latency of about two nanoseconds, round numbers of course, but it’s pretty fast that you send that request to the ALU, it does its work, it comes back, gets stored in that register file in about two nanoseconds.
So now when we go out from the load store… Oh, sorry. Wrong slide. So one of the things I want to talk about here is that these are SIMD, that this execution thread is actually evaluated something like 8, 16, 32, or 64 lanes wide. And that number really depends on which hardware you’re on. Intel is 8 or 16, Nvidia is 32, and AMD is 64 lanes wide. And so when you run through the program it isn’t doing scalar values, it’s doing a vector that is a lot of lanes wide. But when you write your shader code you don’t write it in this explicit SIMD style, you write it in a style that says what happens on each thread.
And the terminology is so confusing. You usually talk about threads in shader language, and those are different than execution threads in the hardware. So each thread might have code that says validate some predicate, compare, if that’s true do A, else do B, and on a CPU it would basically branch do either one or the other, depending on the value of that predicate.
But when you compile this code down to hardware, what happens on a GPU is that you’re executing in these wide vectors, so some of those predicates might be true and some of those predicates might be false. If they’re all true, then it just does the A branch, if they’re all false, then it just does the B branch, and if some of them are true and some of them are false, then it does both in sequence. It does the A branch but masking only the active lanes within that execution, and then it does the B branch, but again, only the ones that are active because the predicate was false.
And when they do not all take the same direction of a branch, that is called divergence. And if the divergence gets bad then your what’s called work factor also gets bad, which work factor is the ratio of the amount of work the hardware is doing versus how much work is inherent in the problem. So all of those lanes that are not running because the predicate is non-active are idle, they’re not doing real work.
So this is definitely one of the tricky things, because if you use Shadertoy you’ll write what’s on the left side of the screen and you won’t really be aware of what’s going in the hardware, but when you’re trying to get performance out of this thing you start really having to be aware and really trying to make that divergence as low as possible. So, going back to the hardware, that you take these execution units… So one of the themes here is that there’s a hierarchy, that you can really think like lane SIMD execution unit, and then Subslice, and then there’s going to be a couple of other stages of hierarchy beyond this.
And so, on Intel you get eight execution units into what’s called a Subslice, and then that Subslice is sharing a lot of resources like the instruction cache and the local thread dispatcher, and it’s also sharing access to memory. You’ve also got a texture sampler, which is something that might be interesting, because one of the things that GPUs do really well that CPUs have been kind…
When I was a kid, when you want to do a computation it’s always faster to just use a lookup table and read that out of memory than to do the computation, but over time memory has gotten relatively slower compared with computation, or maybe you’d say computation has gotten faster at a faster rate than memory look up. But on GPU we’re going back to those days a little bit, that if you want to do a memory access there’s this texture sampler. And it also does linear or bilinear interpolation almost for free. It’s special purpose hardware that’s doing that computation, and it has its own L1 and L2 caches to make that really fast.
But in any case, this Subslice is also what you do to go out to memory, so all these things are sharing that resource. And all of the operations that you’re seeing here have a latency in the maybe 100 nanosecond range. So, again, when we talk about that hierarchy, that each time you go out in the hierarchy you’re going to have more latency.
So, continuing on that theme, you’ve got those Subslices, now you arrange those. So there’s maybe three Subslice in a slice if you’re talking about the chip in this computer here. And those are going to share now your fixed function units, which is your thing that draws triangles and all those kinds of GPU things, which, for the purpose of this talk, we’re not going to care about at all. But what is relevant is now we start talking about, how do you access main memory? And that goes through a main memory bus.
And there is also a thread local storage memory at this point, and I’ll be talking quite a bit more about that as the time goes on. And everything at this level of the hierarchy has a latency in the ballpark range of 300 nanoseconds or so. And then there’s slice, there’s actually two slices on a chip on this particular configuration, but that’s variable. This is one of the things that you can have small, cheap, low power configurations, and then you can have big, massive ones. And it’s really just a question of taking these resources and piling more and more of them into the system.
So this hardware maps to different workloads. There’s different workloads that you can do, like restorization, drawing triangles, and that maps the shaders in those to the compute units. But we’re going to be focusing almost entirely on compute dispatches. And the key thing about compute dispatches is that the computation is organized into grids, and it’s organized into a grid of two levels of hierarchy which correspond to the chip.
So first you have a grid, and this can be a 1D, 2D, or 3D grid. That is fixed. If you wanted to do a 4D grid you would need to fit it into that, that the hardware is actually doing X, Y, Z axes for each thread or thread group in the dispatch. So you really have to take your computation and say I’m going to slice it into a grid. And then each thread group in that grid is going to be some number of threads, or you might call them SIMD lanes. And ideally that’s going to be between 32 and 1,024 to get the most parallelism out of the system.
So, when you program shaders it looks like C, it looks like a very imperative program, and it looks like all of the accesses that you have to memory are very imperative, store this, load that. But in trying to wrap my head around how do you program these things effectively, that I have found that treating it as a functional program is really useful. And I think when the socialist revolution arrives and they decide that financial technology is not really providing value to society, that it could be, you could take the skills of OCaml functional programming and adapt them to programming GPUs.
So, you’re still taking your program, the task that you’re trying to do, and breaking that down into a grid of dispatches. And then inside each thread group you’re reading something from an input buffer or from multiple input buffers, you’re doing computations within that thread group, and within that thread group the threads can communicate with each other. They have that thread group local memory, so they can communicate with each other and synchronize. And then when you’re done with that you write that to an output buffer. And so you really want to think of that as a functional program, that each output buffer that gets written by one of these thread groups is a function of the inputs.
And then you stitch these together into a graph, and that graph is then… You present that graph to the GPU and the GPU steps through the graph and schedules all of these thread groups. So it’s interesting that when you look at this kind of structure of functions of larger units of buffers of computation, that really matches the machine learning workloads very closely, that I think that really motivates why is machine learning able to use this GPU power more effectively, has a lot to do with the fact that you can take those machine learning problems and map them to these grids and these graphs of dispatches in a very easy way.
And I think that to the degree that we can be successful using GPU compute power in other problem domains is like, can you do that, can you take your problem and break it down into threads and dispatches? So when you’re trying to figure out how to do this performantly, that you’re completely ruled by the memory hierarchy.
And, in particular, when you go back to this, if you’re analyzing how much performance can I get out of this problem, one of the main things you want to look at is that every write from one of those compute dispatches, from one of those thread groups into a buffer, that has a cost. All of your global memory write bandwidth from the compute unit into memory, you can just add that up, and that tells you a lot about what the performance is going to be.
And so you really want to structure it so that as much of the work is done in the top of this chart as possible, that you’ve got this hierarchy that close to the execution units you’ve got incredibly high bandwidth, you’ve got very low latency, and you also have fairly small amount of total capacity, that each individual lane only has access to 512 bytes, and that’s pretty limited. You have to really focus in pretty deep to some small component of the problem. And then on the other side of that you have global memory, and relative to the amount of bandwidth and relative to the amount of latency it’s a lot lower bandwidth, a lot higher latency, but of course now you have access to gigabytes of storage.
So to really use these things effectively you also need that middle level, and you have to address that explicitly of the thread group shared memory, which is halfway between, that you’ve got latency that’s a lot better than global memory and bandwidth that’s somehow between ALU operations and global memory. And then, similarly, the amount of storage that you have access to, it’s pretty limited. In the chip design these are static RAMs and you want them to be close to the computation, so you really only get 64 kilobytes at most per thread group for the thread group shared memory.
So, as I said… Oh, yes. Oh, I think they want us to use the microphones, if you want to step up to that, for the questions.
I can ask later.
Ask later. That’s fine. Thank you. All right. So, as I said, my motivation for getting involved in this at all is that I want to do 2D rendering, and I want to do 2D rendering better than the existing solutions that are out there.
And there’s a lot of 2D renderers, for example, Direct2D on Windows or Skia, which is used in both Android and Chrome, that the way that these work is that you use the fixed function pipeline, that you map the scene that you’re trying to draw into triangles and the kinds of things. You do the blending in the blend unit, which is another one of those fixed function things. And I understand why you do this, that it works well on older GPUs. These compute capabilities are relatively new, I’ll talk about that a little bit.
But there’s trade-offs, that if you’re using the hardware for antialiasing it might be very limited, it might not give you in all cases very good results. And the performance is also trade-offs, that if you’re doing a lot of blending and soft masking, that’s expensive because the way that works in this old style fixed function pipeline approach is that if you need to blend something, you make a buffer, you draw into that buffer, and then you do another operation where you read from that buffer and you blend into the main buffer that you’re drawing into. So if there’s a lot of blending and overdraw, then it can be quite expensive.
So what I think of as being the new style, and we still need to validate whether this really works better, but the experiments so far are encouraging, is that you do it all in the compute unit, that basically the way you think about this, if you’re doing a lot of complex operations with blending, that you do all that work in a compute, and then you just write that one pixel out at the end, no matter how much intermediate blending needed to be done.
But of course it doesn’t come for free. The downside of that is that you have to do the work on the GPU of understanding the data structures to represent the scene, that the CPU is going to upload the scene in some way, and the GPU needs to traverse it and get those data structures to do its computation.
So, I have a prototype. It’s called peit-metal. And I’m going to go into a little bit more detail in the next few slides about one particular problem which I think of as the core or the heart, and I think it explains the flavor of, how do you do this on a GPU differently than the way you would do it on a CPU?
So you can think of a super simplified model of 2D rendering as just having a list of objects, actually, that’s going to be a tree, but we can ignore that, or a diagraph, because you can have sharing, and you can think of each object as having a render method, which is just a way of saying at this X, Y coordinate, what is the color? And then when you blend all of the objects in the scene together you use the Porter-Duff over operator, which is just a way of blending a color, semi-transparent color on top of some other semi-transparent color. It’s a nice associative mathematical operator.
And so you can think of this as a relatively simple functional program that you map over all the pixels in a width times height rectangular grid. And for each one of those pixels you do a fold where you map over all of the objects in your scene, and for each one of those objects you render a pixel color, and then the fold is going to composite those pixels together. And that’s the result pixel which you’re going to write into a buffer, and that’s what’s going to get displayed on the screen.
So, that looks promising. Let’s just run this on our hardware and see what happens. And we find that the parallelism of this functional program is excellent, that you’ve got a massive amount of parallelism, that every pixel can be computed completely independently of all other pixels, and also all of your objects can be rendered. And you do this fold, and even the fold can be done in parallel because it’s an associative operator. So the parallelism is incredible.
And the right bandwidth is also excellent because this thing, we’re going to run this thing, and it’s going to generate a pixel, and we’re going to write that to the buffer. And that is the absolute minimum amount of write traffic that you could possibly be doing. But it’s still not going to perform well, and the reason why is that for every pixel in the scene we are reading all of the objects in the source scene and we’re computing, basically, whether it’s contributing to the final rendered image. We’re still reading the object, and in most cases you’re going to get a blank pixel and the over operator is going to be identity.
So the first thing that we want to do is we want to be aware of bounding boxes, that most objects don’t touch most pixels. And so from this mathematical functional approach to this we’re going to say that there’s an object B box, and if a X, Y coordinate is not inside that object B box, then that’s just telling us that the render of that pixel is going to be zero, is going to be blank. So you can see individual objects from this thing have bounding boxes that are much smaller, hopefully, than the scene as a whole.
So, the easiest way to make use of those bounding boxes is as a purely sequential program. So we can look at this. And again, like in this simplified version, that we start with a buffer that’s clear, rectangular buffer of width times height pixels, and then we’re going to go over the object sequentially. And for each object we are going to then run over all of the X, Y coordinates inside that object’s bounding box, and for each one of those we’re going to call the render function on that object. And then we’re going to do the pixel blending operation. We’re going to read the previous state of the buffer, do the pixel blending, and then write the composited pixel into that.
So how are we doing? This is almost the opposite of these highly parallel functional version. The parallelism is terrible that we’ve written this as a purely sequential single-threaded program, but the work factor now is fantastically good, and the read bandwidth is fantastically good because we are only reading that object out of that buffer once for the entire computation. And then, generally this is going to perform. If you’re running this on a single threaded CPU, this is pretty much the way you do it.
And then there’s this one thing where the write bandwidth… I’ve put it as not great. And that depends on the scene. If you have a scene that has a lot of objects that are not overlapping and do not have a lot of blending, then this is actually pretty good, because you’re generally going to touch those pixels once and that’s very efficient. But if you have a lot of objects on top of each other or doing blending with each other, then you’re going to be reading from global memory, doing the render, doing the blend, and then writing the global memory, and that’s not going to be great, even on a single threaded CPU.
So, let’s try and use this insight of bounding boxes. In the sequential case it was pretty straightforward, we get the bounding box, and then we only run over the pixels in the bounding box. Let’s do that in the functional program. So this is very similar to the program we had before except that instead of running over all objects we’re going to do a filter where we go over the objects, we determine whether the pixel rendering is inside the bounding box.
So now we’re going to get basically a list of objects that intersect that one pixel, we’re going to render those and blend those together. And it doesn’t help. And the reason it doesn’t help is that doing that filter operation is pretty much just as expensive as rendering, you still have to read the object, every single pixel has to be reading that object.
So let’s try to do better than that. And the approach that we’re going to use is that we’re going to chunk the problem into tiles. So we’re going to have a two-stage pipeline, and the first stage is going to be at this tile level of granularity, and we’re going to do that filtering of which objects are relative at this tile level of granularity where there’s… If these tiles are 16 by 16 pixels, then there’s 1/256 as many of them, which is a pretty big difference.
And then we’re going to tune this tile size so that it matches almost exactly the optimum thread group size for scheduling on the GPU. And then we’re going to have a second stage which takes those objects within those tiles, does the rendering, and writes those out.
I’m trying to present this in a functional mathematical notation, hopefully still pretty simple, and I’m just going to introduce some notation about taking a rectangle of pixel locations and just breaking those into tiles and then having this stitch function, which the only thing it does is just take the results of the rendered tiles and stitch them together so that this is going to fulfill this identity, that if we run some function over all pixels in the whole scene, that’s the same as breaking it into tiles, running that function for all the pixels in each tile, and then stitching that back together.
So this is the functional program that results from using those primitives, and it’s not too bad actually. We’re getting closer to something that we might want to run for real. So walking through the program, that it’s going to map those tiles, basically generate a tile for each tile in the buffer that we’re trying to render, and then for each tile we’re going to do this filter operation where we say which objects have a bounding box that intersect that tile. We’re going to collect those objects, and then we’re going to do this two level map where for each tile we are going to do all of the pixels within the tile, call the render operation, and then walk those back up the tree.
So, the scorecard is starting to look better. The parallelism is still very good. There is a lot of parallels in here that we can exploit, that we’re running over all the tiles, we’re running completely in parallel for all the pixels within the tile, which is… you really want to do that. And your write bandwidth you still have this property that you’re computing a pixel and writing out just that pixel at the end. And then your work factor and your read bandwidth you’ve improved it by a factor of 256 approximately. So it’s certainly a lot better than it was.
Let’s see. So, yeah, so this is the very high level abstract functional view of it. And I just want to go back to how that maps closer to what you would do on the GPU in hardware, that it’s really two dispatches and so you still have that grid structure, so this is actually matching the hardware capabilities pretty well. And in the first dispatch you have one thread per tile that you’re generating, and that’s basically going to write out a per tile command list. It’s more or less a list of the objects within that tile, and it does a little bit of processing. And then your second dispatch is reading that command list and basically just interpreting it. It’s almost like a byte code interpreter.
One way of thinking about this is that the scene is a program and some programming language, and that your first pass is doing a specialization of that program into bytecode, and then the second pass is interpreting that bytecode. And that pattern works well. And I think there are other problem domains in addition to 2D rendering where you can do that, take the problem, break it up, do some specialization on it before it goes to the next stage in the pipeline.
So let’s look at this question of why our work factor is still not great, what can we do to make it better? And so what I’m going to do here is that I’m going to skip a lot of the bureaucracy around this and say that we’re just working on one 32 by 32 chunk of the problem. So we have 32 objects and we have 32 tiles that we’re generating, and we just want to run that filter operation so that each one of those tiles ends up with just the list of objects that intersect it.
And so if we do this in imperative code, this is the code for one lane, one thread in GPU’s peak. And so we’re going to step j through all 32 objects in our input list, and we’re just going to do a very simple comparison test, does that bounding box intersect our tile? So there’s an implicit index that each thread has its own identity, and so it’s like, does my tile that I’m responsible for intersect with the bounding box object? And if so, then I’m going to do a memory write of pushing that into an output list.
So when we run this on hardware, this thing that I said of the work factor not being great, we need to be analyzing that in terms of what’s really running on the hardware. And so this is the same transformation that I showed you before, that we’re basically seeing there’s an implicit i of the 32 lanes in the SIMD. And so you’re going to do a predicate in parallel for those 32 lanes. And each one of those lanes is going to be testing, does the bounding box intersect? And then we have another step that says if any of those hit, this is that kind of divergence business, if there’s any lane for which this is true, then we’re going to do a memory write in that lane if the predicate was true in that lane.
So we can look at this graphically, the execution graphically, and just imagine time progressing from left to right here. So the first thing that we’re going to do is that all lanes are going to read the same object. And this is called a uniform read. This is reasonably efficient. It’s not utilizing the read memory bandwidth as much as you might, but it’s also not wasteful, it’s not doing 32 separate reads from the same memory location. In the hardware it’s going to be one physical read, and then all of the lanes are going to use that.
And then we’ve got a bounding box test, and all the lanes are going to do that. And then we’ve got a write, and then it’s going to step through the 32 objects in the chunk. So, you can see just looking at this that during this write time in the common case that we can expect where most objects don’t intersect most tiles, that a lot of those are going to be empty, that we’re going to be spending a lot of time going through this and not actually writing anything out to memory. So the work factor, the utilization is not great.
And so here is where we’re going to go into a lot trickier programming where what I showed you here is essentially logic that is per thread or per lane where they’re independent of each other. And to really bring this up to the next level we’re going to need to do something where we’re going to do things in parallel and then those lanes are going to be cooperating with each other. There’s going to be communication between the lanes that is going to let us get more parallelism and more performance out of this.
So this is a little tricky. I’m going to do my best to explain it, and if it’s not immediately clear, then you’re not alone. But I’ll do my best here. So this is the code, again, this is sort of what you would write in the shader language. Oh, sorry, this is actually in the explicit SIMD style. So if you look at what’s in the shader language, it’s pretty complex, because you have to do these subgroup shuffle operations. But what it’s going to do is that in parallel it’s going to read 32 bounding boxes. So each lane is going to read a different bounding box from the source graph, from the source, yeah, source scene graph.
And then you’re going to do the bounding box testing in parallel so that each one of those lanes is now going to produce a bitmask of which bounding box is intersected with it. And then the magic is the second line, that you’re going to take those bitmasks and do a transpose operation. So you can imagine that there’s a 32 by 32 bitmask array and you do a transpose operation, and that transpose operation is where you need the interaction between the different threads, that you end up with the threads getting values from other threads.
So your transposed array is now basically telling you, so each lane is holding on to a bitmask that’s telling you which object intersected. So you’re going from which tiles did this object intersect, do the transpose that’s taking you into which objects does my tile intersect, and then you just iterate those. And I think hopefully it’ll be a little bit clearer when we go through this graphically, the timeline of it.
So, again, the first operation we’re going to be reading, each lane reading its own bounding box. So you’re doing this highly parallel, you’re using a vastly higher memory bandwidth at that point. And then you do the bounding box test in parallel. And this can be done using bitmap operations, you don’t have to do a separate calculation, that you can really say my starting X and my ending X and do various different shift and mask operations. So you’re really getting parallelism at the bit level to get all of those bounding box tests computed in parallel at that point.
So, again, those blue rectangles indicate the result of that bounding box test. So you’re computing a vector of those, and for each lane in that vector you’ve got a hit when that tile intersects with that object in that lane. So you do this transpose operation so that now each lane is corresponding, instead of to an object on the input, it’s now corresponding to an output tile. And so that bitmask is basically a list of which objects it needs to write out. So you only need to do the maximum population count, in this case it’s two, and so we only need to do two cycles and all of those can be writing out in parallel.
And your work factor is much higher. You’ve got many fewer of those cycles just saying, oh, the bounding box didn’t hit, so I don’t need to write anything out this time. And then you’re able to go onto the next chunk. Of course, I said 32 by 32 and this is only eight, but it would have been really small writing if I had shown 32. So you go to the next cycle, you read another parallel chunk, and if none of those intersect at all, which is actually going to happen, then you just immediately go, you can just immediately tell, oh, nothing intersected here, I’m just going to go on and read the next thing. So you’re able to chunk through that source input at a vastly higher rate.
So I’ll go back over this code and show the steps, that you’re reading in parallel, doing a matrix transpose, and then iterating over just the work that’s needed to be done. So we haven’t actually implemented this yet. This is a design. My expectation is that the performance is going to be dramatically higher than the current implementation of this. And in order to find that out we need to start writing a shader code and deploying it on real hardware. And unfortunately, as I touched on before, that’s still a little tricky. This is evolving.
So, certainly CUDA has been at the forefront of people running compute workloads on GPUs. I think CUDA was really the earliest practical way to do this. And then parallel to CUDA was this thing called OpenCL which was going to be the industry standard common version of it. And OpenCL seems to be dying. It’s not really supported in an active way by most of the mainstream vendors that are out there.
So what’s happening today, you can certainly still use CUDA, and for machine learning workloads very few people are even trying to do anything other than CUDA, but you are seeing a lot more compute workloads being run through basically gaming oriented 3D graphics API. So if you use DirectX 12 or Metal, then all of these now have the capability of running compute workloads, and they’re getting fairly sophisticated. I think as we evolve towards the future, certainly CUDA is going to continue to be a great choice, but I think we’re also going to see Vulkan and WebGPU, and I’m going to talk about those a little bit more.
And one of the things that I think is also a really exciting development is that we’re finally standardizing on a common intermediate language called SPIR-V. And so this is making it so that you don’t have to have one monolithic tool that goes directly from your source language to your GPU hardware, that you can compile to this intermediate language using whatever open source tool or whatever that you like, and then the driver’s responsibility is just to go from that intermediate language to the hardware.
So the question is, if you use some layer, these are all layers, these are all abstraction layers, and when you use that layer, how much of the real compute power are you being able to access, and how much of it is hidden from you? And this is a little tricky, because the hardware is advancing.
And so what you generally find as a pattern is that every time there’s new Nvidia hardware there’s a new version of CUDA which does a really good job of tracking that. And if you want to access that capability through DirectX 12 or Vulkan it’s going to lag a little bit. So I think that’s still the question, how much are you missing? And I think ideally it’s going to start to mature and converge a little bit, but I think this is one of the reasons why you haven’t seen as much of this in the past.
So I want to talk a little bit about shader languages. I know there’s a lot of programming language fans here, and there’s a tremendous amount of opportunity to do interesting things in this space. And I think increasingly now that we have SPIR-V as this intermediate format which is going to start working really effectively, that you can go from a shader language into an intermediate language, and then have a pretty good chance of that running efficiently and reliably on the hardware.
So this is a pretty complex space, and I think we’re seeing some evolution that now we have a lot of open source tools, standards like SPIR-V that are becoming more central, but the picture used to be the vertical arrows, that you’d have a shader language which is specific to some ecosystems. So you have the Metal shading language that’s in Apple, and you have HLSL, which is the one for Microsoft for DirectX. And then that compiles down to an intermediate representation, which is also specific to that platform, so you have air and metalib on Apple or DXIL or FXIL for Microsoft.
And increasingly I think that’s going to be SPIR-V, and then if you really want to run those SPIR-V on other hardware there’s transpilers in the other direction. So if you’ve got something that can’t run Vulkan but it can run HLSL or Metal, then you use SPIR-V cross, and that takes your intermediate language back up to textual representation, and then you can use your… I think of this as a legacy compilation pass. And so that works, and people are doing that.
And then as a brand new piece of this ecosystem there is the WebGPU Shader Language. And this is breaking news, that this has been very contentious about what shader language will be used by WebGPU. And in fact, the WGSL proposal just got accepted this Monday. And the way this is defined is that it is a bisection with SPIR-V, that it is a textual format, but it is equivalent to SPIR-V, so there’s nothing you can express in one that you can’t just as easily express in the other.
And so the compiler in both directions for that is Tint, which is a Google initiative. And I think that going forward that is going to decrease the friction for getting these things deployed on real hardware.
So I want to talk about Vulkan and SPIR-V, because again, this is a story where I think today this is starting to become a good foundation to build things on, but that’s a relatively recent innovation, that Vulkan and SPIR-V really happened four years ago, and they were able to run compute kernels and gave you access to thread group local memory roughly equivalent to these one generation ago graphics APIs like DirectX 11 and the first version of Metal.
And then I think Vulkan 1.1 brought a huge amount of additional functionality. So it brings you those SIMD shuffle and reduce operations that work at the ALU, so they give you that two nanosecond latency as opposed to having to go through thread group memory. They give you a restricted form of pointers, which is something that you need in a lot of compute workloads. And so Vulkan 1.1 is roughly at the current generation of GPU APIs like DirectX 12 and Metal 2.
And then I think as a sign of how things are going in the future, that there’s Vulkan 1.2, and that makes a lot of things that were optional now standard. And then one of the things that’s most exciting to me is that for the first time there is a written down formal memory model for GPUs. And before you’d have an atomic ad operation and you would read the documentation and it just says this is atomic, and the reality is a little bit more subtle than that.
And it’s very interesting because obviously this is based on the C++ memory model, but GPUs are a little different, there’s a distinction between visibility and availability, having to do with the way that computations are staged. And so that captures that. And I think that that allows us to do more sophisticated interactions of the different parts of a larger compute workload.
And then there’s other things that are really important, like for example, 64 bit atomics which require some hardware support, and then 16 bit floats are really important for machine learning workloads, you can get a lot more multiply adds per cycle through that. It’s still optional in Vulkan 1.2, but if the hardware can do it it exposes you. And again, that just happened last month.
So I think going forward one of the most exciting things is WebGPU. And this is still prototypes, this is still proposals that are working their way through the standards process. It’s not really something you can use today. Although, you can play with it today. You can grab these prototypes and start trying them out.
It felt like it might be completely blocked, it felt like it might go completely aground because Google and others wanted SPIR-V, they wanted a binary shader, and Apple said, “No, we don’t want that, so we’re going to propose WSL which is a textual shading language.”, and everybody else said, “We have to invent a completely new shading language from scratch and figure out its semantics and how to map that to hardware? That’s way too big a problem, we can’t solve that problem.”
And so just in the last really few weeks there’s been this new WGSL proposal which splits the difference. It says we’re semantically the same as SPIR-V so we can piggyback on all of that work to define SPIR-V and pin down what its semantics are, but it is not a binary format, it’s a textual format, it’s native to the web in that way.
So there’s a couple of challenges that when you transfer buffers around you might have an extra copy or a little bit of inefficiency. And I think this really reflects a different way of thinking about safety, that in the gaming world, what is the contents of memory, if you read from memory when you start a thing, it’s like whatever was in there before, and that’s not really good enough if you’re trying to reason about safety.
So there’s a lot of logic that they’re having to do to say, are we in the same trust domain, and if so, we can run it, but if we’re not, then we have to do some maybe explicit zeroing or some additional barrier or synchronization so that we don’t get access to memory that we shouldn’t be. And so this is still work that’s in process.
And these are all open source. And I think that it makes it so that when this stuff becomes real, then you bind those libraries and you call those libraries to get access to your hardware.
So now that we have just emerging this solid foundation, I think it starts to get interesting to compile other languages than these shader languages, which are really primitive C profiles. And so I want to call attention to the five that I think are most interesting. And so in no particular order, Halide is a language that is really designed for imaging, like computational photography, like running your HDR algorithms for cameras. And it is a DSL that’s layered on top of C++, it’s a collaboration between Google and MIT, and it’s really successful. People are running a lot of images through Halide.
Futhark, I don’t know how to pronounce it, is a functional language in the ML family that targets GPUs. And I think on this list it’s in the more experimental thing, but I think it’s interesting. I think we want to find out, can you effectively write ML programs, compile them to run efficiently on a GPL? And I think Futhark is the project that’s trying to answer that question.
Co-dfns, if you have not seen it, I really recommend it, it will melt your mind. It is based on APL, it is an array language, it has its own compiler, and I mean, it looks to me like a view into a parallel alternative universe. It’s very different than a traditional sequential model. They’re targeting the GPU, and I think it really is… It just expands the space of what kind of things you can do.
Julia is trying to go into the machine learning workload space, and so they’re mostly at this point layering on top of CUDA, but there’s a lot of good things about it, like for example, you can play with it interactively, and that’s hard to do with a lot of these other… like CUDA doesn’t really provide interactive rebel tools for that. And so I think that’s a niche that Julia can meet very well.
And then what might be the most impactful ultimately is the MLIR initiative from Tensorflow where you’re going to take your Tensorflow graphs of machine learning calculations, and MLR will be able to convert that into SPIR-V that can be run on Vulkan. And so if that works, that gives you a really good alternative that you don’t have to use CUDA, that you can go through these more standardized interfaces.
So I’m going to make some predictions, that, as you can tell, I believe in SPIR-V and I really feel like that’s a solid foundation. It has a memory model, you actually know what it’s going to do, it has semantics. And now that we have that we can be building tools and languages, and we desperately need tools and languages, because the current state feels really primitive to me. We’re going to be able to deploy this on WebGPU, and I think part of the challenge, part of the difficulty is that every hardware, every API is a little bit different, and I think WebGPU will bring some coherence to this space where everybody is at least going to be able to run WebGPU workloads and they will compete on how well they run.
And it will have a test suite. That is important because to a first approximation GPUs don’t work. I think in my speaker notes I say that people who program for GPUs are extremely religious because when they write code they pray a lot. And so having a test suite that really exercises the capabilities, stress tests these cards, is a new thing that we have not seen before, and I think WebGPU will be the center of activity there.
So then I think that creates the ecological opportunity that there’s all of this incredible bandwidth that we’re letting 90% of the throughput of our computers go idle most of the time, unless you’re a game machine learning or a cryptocurrency miner, and so why not have new languages that can actually exploit that? And I think that we’re going to see that.
I do not want something that takes a very high level. A lot of the academic literature on this says take a functional program written in a very high level and have a sufficiently smart compiler that maps that to the hardware. I don’t want that, because I don’t think that’s possible. What I want is something where if I have my idea of how I map the problem to the hardware, I just want to be able to write that down with a minimum of friction and getting it to run. And I think that that’s, if you’re thinking of making a language, that’s what I want you to be working on.
And I think we’re going to see applications really start ramping up, that when you need throughput this is the way to get it, because we’re not going to get it through single threaded performance. And the hardware is just universally available now, and the APIs are becoming so. So I think that a lot of these workloads are going to migrate off of just being CUDA and requiring proprietary tools and onto Vulkan and WebGPU.
So, there’s a bunch of things that if you’re interested, if this taste is appealing to you, there’s a lot of things that I touched on very briefly, there’s much deeper resources, there’s papers that do benchmarks and explain architectural things at this link on my blog. And we also have a channel on our Zulip Chat instance which is really focused on the 2D render, but it’s a good hangout place that there’s a lot of people that know a lot of different aspects of the problem and some interesting discussions, and that’s certainly welcome to everybody. So with that, anybody have any questions?
Hey, thanks for the chat.
I had a quick question about the algorithm that you proposed. So when you do the bounding box check and then you do the transpose, and then you mentioned in certain cases you don’t have to do a write, so you can skip the write and then you go directly to the next one and that can speed it up, can you sort your objects in a way such that you maximize the chance that you can skip in certain groups?
Yes. That’s a complex question because the real question is, does the speed up that you get from that outweigh the cost of doing that additional sort operation? And I don’t have enough insight yet to know. And so my current approach is to do things as simply as possible that seems like it’ll give reasonable speed up. And then that opens up a lot of explorations like can you sort and that sorting thing, because the other thing you can do is you can reorder. If two objects don’t intersect, then you can reorder them, and then maybe you want to put them into B-trees so that you’re able to skip over them.
I mean, the space of what you might be able to do is really big and rich, and what’s really going to work is something that we just have to explore. But, thanks, that’s a good question.
I was wondering about, earlier you said you break up the computation that you want to do into tiles that are perfectly sized for the thread block.
And you don’t want a, sufficiently a smart compiler, but thread blocks are a programmer concept that don’t map directly to hardware, right?
So the hardware can have different sized optimal thread blocks for different GPUs, so how do you reconcile that without rewriting the program?
Yes, it’s a very good question. And so, one of the things that I glossed over a little bit when I was talking about this hierarchy is that you have both thread groups which are under programmer control, and you also have this concept that has different names and different APIs that I’ll just call subgroups. That’s what it’s called in the Vulkan, SPIR-V world, which is really the SIMD lane. And that is something that the compiler determines.
In the case of Intel it’s very dynamic that you might not know until the last possible moment what that width is. And in order to make the algorithm optimal, if it’s dependent on the subgroup, that’s a hard problem. There’s actually a Vulkan extension that allows you to control that, but that extension is not widely implemented. It’s not running on my windows machine.
So that is exactly the kind of detail that is hard. And if you are now today trying to get a machine learning workload running on this, then what you are going to do is that you’re going to do a lot of experimentation, you’re going to have a lot of different versions of your kernel that been tuned to run effectively on different hardware, and you’ll have a lot of heuristics where you’re querying the hardware to try and figure out, based on that query, which one to run.
And I don’t have an answer to that. I think there are things you can do to organize it and make it less ad hoc, but I’m still skeptical about something where you just magically put the problem in and it automatically cranks through that and gives you the optimum layout.
I used to work on a different project, an open source one, that had this problem, and we knew what our workload would be far into the future, so we could spend a little bit of time upfront benchmarking different sizes and then go with the best one, but I don’t think we all have that luxury.
Firstly, thank you very much for coming. My question is on, there were times in your presentation where graphics and machine learning applications were unified and sometimes they diverged, to what extent do you think those will continue to either diverge or unify?
Super good question. I mean, certainly one of the ways that they’ll continue to diverge is that you’re going to see more and more special purpose hardware that is just for tensor multiplication of smaller precision values, because that is your machine learning workload, and that is not something that is going to be particularly effective in other domains like graphics or data analytics. I didn’t really talk about it, that was an unspecified assumption that I think… And there are people that are doing GPU databases and GPU analytics, because I think that can work, but a lot of the investment in specialized hardware is going to be focused on the machine learning workloads because that’s where the money is.
And then in terms of convergence, I mean, I think the general answer to that is just tools and deployment of making that smoother and lower friction, that that will help both machine learning and other workloads equally, like WebGPU certainly.
Looks like we’re great.
All right. Thanks for coming.