The Saga of Multicore OCaml

Yaron Minsky

Jane Street

Jane Street is an electronic trading firm that uses low latency trading systems built in OCaml to provide liquidity to financial markets worldwide. In December 2022, after nearly a decade of development, OCaml 5.0 was released with OCaml’s first multi-core capable runtime. This was an exciting milestone, finally making it possible to write shared-memory parallel programs in OCaml. The new runtime was designed to be easy to adopt: it didn’t disturb OCaml’s FFI, and performance was meant to be only a few percentage points slower in single-core mode. Despite those promising beginnings, switching to runtime-5 within Jane Street was harder than we expected. Indeed, we’ve only just switched to it this year, after 2.5 years of research and engineering effort. This talk will give an overview of the problems we ran into, and what we learned from the process, including some new ideas that solved some very old problems in the design of OCaml’s GC.

Transcript

All right. Yeah, this is awesome. This is like really a full house, very exciting. So yeah, so at work, I often have to apologize for having inflicted OCaml on Jane Street many years ago. And here again, I need to apologize for inflicting OCaml on you all.

So this talk is gonna be about like in some sense, it’s about some particular technical problems that we worked through as part of onboarding the new multi-core garbage collector for OCaml. And in part, it’s a story about technology transfer of how ideas move from academic sources sometimes into industrial use. And what are some of the things along the way that you need to do to make that happen successfully.

So like a little bit about OCaml, which you know, Jane Street is a huge user of it, has been for a long time. OCaml is an old language, it’s like 30 years old, roughly about as old as Java. And for a very long time, it did not have any kind of shared memory parallelism. And this changed in recent years. And like in some sense, you know, Jane Street has kind of built its whole ecosystem around the kind of set of assumptions and properties that the language had. And now, we’re in a world where we have this new capability of being able to write shared memory parallel programs in the language. And this talk is kind of focusing on one aspect of that.

But in terms of where to start about this, I wanna start with a paper which isn’t really the beginning of the story, but it’s kind of an important part of the process. So back in 2020, there was this very nice paper that landed in ICFP retrofitting parallelism onto OCaml that laid out the kind of design process and basic structure of the kind of design of the multi-core garbage collector and the ideas behind how it was built.

The goal was to build a modern high-performance, multi-core garbage collector for OCaml that would be importantly easy to maintain and easy for people to adopt, right? They didn’t want to make the process of maintaining OCaml much more complicated. And one of the design decisions that came out of that was one runtime. Lots of systems end up having multiple runtimes for different use cases, they wanted to have a single runtime that would be reasonably well-optimized for both, high-performance, single core workloads and high-performance multi-core workloads.

So that meant to make it easy to adopt, it has to preserve the performance characteristics that you already had coming in. So sequential performance couldn’t be challenged. They couldn’t really massively change pause times, right? An important part of a garbage collector is not just the net performance of your program all in, but what it does to kind of the responsiveness of it, how long you end up checking out, waiting for the garbage collector to do things. OCaml has a pretty good story in terms of having relatively small pause times. They wanted to preserve that and they also wanted to preserve the FFI, the foreign function interface because it’s just like, it’s just kind of torture to go through all the C extensions that have been written and adapt them to a new set of invariants. I mean, the C extensions are already broken and have problems, you don’t wanna make it worse by having to change them all.

And part of what the paper did is produce lots of benchmarks and evaluations to justify the design. So one thing they did, they looked at a bunch of experiments and showed that the kind of loss of sequential runtime performance for the new GC was like on the order of like 3%-ish, which you know, seemed pretty good. Although, every time someone tells you like the performance of a garbage collector, an important thing to remember is that a garbage collector’s performance is not a point, it’s a curve, right? A garbage collector is a thing that you can tune to have different behaviors, different trade-offs of basically how much work you do collecting versus how much memory you use. But in fact, the memory used was also shown to be relatively similar and the pause times were also very similar.

So that sounds pretty good, like 2020, this paper lands, there’s a new garbage collector and oh, and also had good speedups, so that was nice. It did take a really long time though to get to that paper and to get after that paper to getting it upstreamed. So this is like nine years of work from when the original multi-core project was born to when it actually got upstreamed. There are like, you know, the green things are papers that were published there. There’s a bunch of important points in time. Like you know, in 2020 after the paper was published, the core team had committed to working really hard to getting that implementation actually upstream. And like the ordering of that is like a real thing. I think if you want to convince the OCaml upstream community to accept something, you should probably write a paper or two to get them to accept it.

But it’s great, you know, finally in 2022, there’s this garbage collector that’s easy to adopt, it’s been upstreamed. Surely, Jane Street as a big user of OCaml can immediately turn around and use it, no. So there’s an extra like two and a half years from what happened in 2022, us to be able to use it internally.

Some of that time was not surprising. So the things marked in blue are just features that got added to the old sequential garbage collector while the new multi-core garbage collector was being worked on. And it was just kind of known that there was extra work to forward port these extra features and like they’re good features, prefetching is actually, I dunno, prefetching is kind of an amazingly good feature for a garbage collector. I was actually just talking to someone who worked on the Python garbage collector, which apparently copied some ideas from that design. It’s like a great thing to add. So anyway, we had to like bring back all those features that took a lot of work.

But like the two orange points in the timeline are interesting. So like, I guess, like the November 2023 thing was like, oh we need to migrate to a new runtime. And like each version of OCaml has just one runtime, that’s bad because it needs to adopt the new language. You also have to immediately adopt the new runtime and switching to a runtime is hard. So we’re like, “Okay, let’s just backport it.” So we’ll have like the latest version of the language supporting both runtimes. And then once we had that, now we can start doing experiments. And as soon as we started doing experiments, we’re like, “Oh, the performance of this thing is like totally actually not acceptable.” And then there was like a year and a half of working really hard on trying to understand what those performance degradations are and fix them. And that’s what this talk is about, okay.

So I’m gonna walk through, I’m gonna kind of describe the garbage collectors to you. I’m not gonna assume that everyone is like a garbage collection expert. So I’ll talk a little bit of background about how garbage collectors work. And then we’ll talk about some of the issues and how what we learned about them.

So first, what is runtime for and like? What’s the classic OCaml garbage collector like? First of all, it’s sequential, right? Premise of the talk. It’s not a multi-core collector, it’s also generational. So a generational collector is built on what’s called, the generational hypothesis, which is a surprisingly stable and sort of broad observation about the behavior of programs in most programming languages, which is that memory dies young, the vast majority of things that are allocated end up not having very long lifetimes.

And so as a result, you break up your collector into two different, you can take advantage of this by having really two garbage collectors that connect to each other. A major heap where the garbage collector uses what’s called, a mark and sweep collector and then a minor heap which has a kind of brutally simple allocator where it’s just like you just have a region of memory and you just like when you want to allocate, you just like bump the pointer forward and then you do it again and you do it again and it’s like three machine instructions to allocate a thing. And then when you get to the end of the region, you’re like, “Okay, I’m outta space, what do I do now?” So then you like scan the minor heap to see what’s still alive and if the generational hypothesis indeed holds, most of the things will be dead. You don’t have to do any work to collect the dead things and the live things you promote, you move them into the major heap and they’ll get collected there.

And then there’s one other clever trick you need which is you need a write barrier. It’s not the only way of doing it, there’s like other tricks but the trick that OCaml uses is a write barrier, which is basically, a way of detecting whether there are pointers back from the major heap to the minor heap. So like why does any of this matter? Like I said this thing of like, “Oh, we’re gonna scan the minor heap to see what’s alive,” but you might wonder like, how can I cheaply scan just the minor heap? It’s like this very small region and this other big major heap, there’s lots of other stuff. Maybe something in the minor heap is alive because there are pointers through the major heap to the minor heap. But if you imagine that you’re in a completely functional environment where you’re just allocating new stuff, like all the new stuff is on the minor heap and things on the major heap can’t point to the minor heap because if you’re not mutating things, you can’t create those pointers. And in the functional language, that’s close to true. You have lots of allocations and interior mutations, but when you do a mutation, you basically need to check whether you’ve introduced a backup pointer and then do some extra bookkeeping and that’s the write barrier. It’s the extra piece of code on every mutation that checks whether you’ve introduced a back pointer. So that’s another part of the design.

And then there’s a mark and sweep collector. So what is a mark and sweep collector? So the marker basically is gonna start from the roots. The roots are the things that you know are alive like values on the stack or values that you have in registers. And then you’re gonna go through and like mark everything that’s reachable. You’re kind of gonna walk the object graph to discover what do you know for sure can still be accessed and then everything else you know can be removed. And then you have a sweep phase where you can walk over the heap and collect all of the unreached objects.

And let’s just like illustrate this for a second. So here is a little animation that I vibe coded with Quadcode, super fun. So here like, you have the stack there which indicates in the roots, we just assume all the roots are on the stack. And then you have this object graph. Everything starts out red unmarked and then in the marked phase, you start walking through and just like turning everything, everything green, right? You go through and mark all the things and now, all the things that are red are the things that you know are unreachable because you couldn’t get there. So those are safe to be collected. And then you go to the sweep phase and the sweep phase doesn’t walk the object graphs, it walks all of memory. So you just kind of go cell by cell. And then when you get to something that’s marked, you’re like, “Well, okay, that’s marked so it’s gonna stay around.” And next round, I’m gonna need to scan it again. So let’s change its color to unmarked. And then you keep on walking and every time you see something that’s already unmarked you’re like, “Well, that’s garbage so I’m gonna free it.” And you just kind of keep on walking through. And the idea is as you go and do this, you get to clear out all the things that you don’t need. And in sometimes, if you think about where that point is in the middle, the colors mean a different thing to the left of the sweep than they do to the right of the sweep, right? So it’s all about like walking through these garbage collection colors.

So that’s just like, just to give you a rough sense of how a kind of classic mark and sweep garbage collector works. Differently from that picture that I just gave you, it’s also incremental, right? So what does that mean? It means, you don’t like just run the whole mark and sweep thing and have your program check out for all that period, you can actually instruct it to run for a period to collect a certain number of bytes and these what are called, slices. You can use little slices that collect a few bytes at a time. And the idea there is that this makes your program more suitable for kind of soft real time applications. There’s some trade off in terms of net performance. It’s a little less efficient to do it incrementally for kind of bulk throughput, but you have much more responsive programs and it’s just like really important. Like you kind of can’t have a reasonable programming language with like multi gigabyte heaps with a non-incremental garbage collector, it’s just kind of madness, right?

All right, that’s runtime four. Oh, right, oh, I shouldn’t forget this. There’s also what’s called, the snapshot at the beginning invariants. What does that mean? If once you’re incremental, what it means is, sometimes your garbage collector is running and then sometimes, your user program is running very charmingly, garbage collection people like to call their user program, the mutator. See, I don’t know this thing, I don’t know what it does, it mutates stuff. So you have like the GC and the mutator running in alternate slices and then the question is like, what are you actually supposed to collect, right? If I am like in the middle of the marking phase and new stuff gets allocated, like do I just have to keep on marking to catch all the new things that are allocated? So in this kind of collector, the answer is no. The snapshot of the beginning property means that you are going to guarantee that you collect everything that was unreachable at the beginning. But if new things that are are allocated and become unreachable, you don’t need to collect those. You’re gonna like wait until some future mark to capture them. There are some trade offs of this particular design, but one nice thing about it is, you sort of get a kind of a guarantee that the mark phase will eventually end, right? Otherwise, you could imagine like doing stuff where like you keep on allocating stuff in the mark phase, just kind of gets stretched out indefinitely.

Okay, runtime four. Let’s talk about runtime five. So first of all, mostly the same, like the design is fairly conservative and tries to keep a lot of the properties of the original one. It’s not sequential anymore. That’s different. It’s parallel now. And another thing that’s interesting about the parallel design, and I’m gonna talk about a lot of interesting aspects of the garbage collection here, but mostly not really about the parallel algorithms. I’m gonna talk not about the interesting part like all the other things that have gone wrong but an important thing to know about this collector, it has a couple of different stop-the-world phases where all of the different threads of execution that are running and you basically have one kind of thread of execution per core on your system. They all kind of have to check in together to do something and there’s like different things at different spots where they need to do that. And in order to make sure that happens, you have to actually make sure that the individual threads and executions will pretty frequently be able to stop and safely kind of checkpoint their behavior and check in and maybe do something else. And so as part of that, there’s a kind of compiler analysis where you basically insert into the code, things that make sure that you will basically always have safe points in any execution. So like morally speaking, every time, there’s a loop, somewhere in that loop, you have to introduce a safe point, right? And that’s basically to make sure that the synchronizations are actually efficient. So that’s one part of the design.

So there’s a minor heap. How’s the minor heap different? Well, there’s not one minor heap, there are many minor heaps. There’s one minor heap per core, like you have like one heavyweight thread per core and one minor heap there. And I won’t go into the details of exactly how it works, but you kind of don’t allow like arbitrary edges between different minor heaps. The key point here is minor heap allocation needs to remain dirt cheap. And to do that, you want to reduce the amount of like contention and communication around the minor heaps. So like stuff that’s really shared is gonna be in the major heap and stuff that’s not really shared can stay in the minor heap and you can have this incredibly efficient allocation for it. And we’re gonna do a one stop-the-world collection at the level of the minor heap for like when any single core gets to the end of its minor heap, they’re all gonna come together and do the minor collections at the same time. You might wonder what if they like have different allocation patterns that have different amounts of memory that different amounts of work that they need to do, different amount of promotions and stuff. And the answer is, it’s no problem, if one of them finishes early, it will just spend some time doing major collections waiting for the other ones to finish up. So you’re gonna use all the time but you’re gonna wait until the longest minor collection is finished.

And there’s a major heap, one major heap for the entire system, it’s shared and it has interestingly a kind of merged mark sweep design. One of the concerns that the people were worried about in the original paper is that these synchronizations are expensive and this is a design that goes to an older paper for what’s called, the very concurrent garbage collector and it’s like a cute trick, which I’ll talk about more in a second, where you can like sort of run the mark and sweep stuff together without having an extra synchronization to like explicitly end the mark and the sweep phase. You’re just gonna kinda do one thing to end a unified mark sweep phase. And again, a stop-the-world sync at cycle end and that’s to kind of play around with the colors and I’ll talk about the details and how that works in just a sec.

So yeah, let’s go back and talk about colors now. So first of all, what happens with the colors in runtime four, it’s fairly simple. There’s the marking phase which takes unmarked things and moves them to marked, right? And then there is the sweeping phase which takes things that are unmarked and moves them to free and actually, takes marked and moves them to unmarked. So actually, I have like an extra spot in that slide which I forgot to put in. But anyway, that’s what the transitions have to do.

In runtime five, it’s different. There’s an extra color so we’re going to have four different states. There’s marked unmarked garbage in free, the marking phase is going to move unmarked to marked. The sweeping phase is going to move garbage to free. So there’s two things to notice about this. First of all, it’s cool, you can clearly run marking and sweeping concurrently because they don’t touch same nodes, right? The colors are disjoint. Also, it’s weird because you need to like go, your things have to get marked and then swept so there’s no interaction between the two, that’s not okay. So there’s an extra color change. Oh, that’s sad. Why did that happen? Sorry, there we go. This is the wages of doing a weird terminal-based presentation software.

So there’s a trick that happens at the end of a cycle everyone synchronized, this is the synchronization point and then you swap the three colors, everything that’s marked becomes unmarked, unmarked becomes garbage, garbage become marked and that’s what links the two things together. But you don’t do this by walking the heap. What you do is, you do it by changing the interpretation of the bits that you use to mark objects. So you have like a constant amount of space that basically represents the mapping between the bitmaps and their meanings and you just like everyone syncs together and just like changes the meaning of the bits and off they go. So it’s an extremely cheap synchronization.

All right, so that’s the system more or less. There are lots of other details, some of which I don’t actually know, if you ask at the end, but like you know, there’s a bunch of other details of how it works, but like, that’s what’s sort of mostly relevant to the story I wanna tell today.

So how did it actually work? So we actually ran a bunch of tests and found a bunch of progressions. Remember, I said, there’s like a trade off, you know, sometimes, you can be faster in CPU terms and you can be worse in memory and sometimes, you can be worse in memory and faster in CPU terms. This was kind of worse than both. There were some programs that were running 10% to 20% slower. You may have noticed, 10% is larger than 3%. That’s not great. And some programs were using 10% to 20% more memory. And actually as we dug in more, there were worse regressions than that, but you know, we found these pretty early and we were like not happy and then there were a bunch of sources of what caused these that we found over time and like there’s another version of this talk where I go in detail about the story of how we discovered all this crazy stuff but I think that would be like a three hour talk. So we’re not gonna do that one.

Suffice it to say, it was enormously painful to like figure out and discover all these problems, but like a few different things that went wrong, there were problems with the pacing of the GC. So when you talk about pacing and GC, it’s like the speed with which you run the garbage collector versus letting the mutator run. How many cycles do you spend on the collector and the pacing was off in ways that turned out to be important. Another thing is transparent huge-pages. You know it turns out, you know, when God created Unix in the 1970s, like page sizes were 4K and it was gonna be good enough for everyone, except it’s actually, a terrible idea on modern hardware and you need much bigger pages and there’s a whole complicated system in Linux for transparently giving you the huge-pages you need except when it doesn’t, anyway, for various reasons, it’s all implicit, you know, you should always be worried about things that will do implicitly do the right thing for you ‘cause then sometimes they don’t. So anyway, there was a bunch of complicated stuff, you know, the old version was built before transparent, huge-pages existed and so it worked well with transparent huge-pages and the new version was built after we’d had them for a long time and it didn’t. So like that was the problem to figure out.

There’s like the old legacy systhread system, another kind of threading system that wasn’t parallel, but was threads and like context switching was painfully slow there and that turned out to be a problem. There are these stack checks, there’s a whole new algebraic effect system for doing like fine grained perils, a sort of fine grade tracking of threadlets essentially, on top of that kind of heavyweight system threads. And that involved a new kind of check for when you’re over the edge of your stack and that turned out to be slow. And then there was just kind of a general lack of optimization in the code. Like the code just wasn’t as like beaten to death and optimized as the old version and that came up in a bunch of ways.

So I do not have time to talk about all of this. So I’m gonna talk about pacing, right? There’s a bunch of stuff we learned about pacing and I’m gonna cover most of that.

So again, let’s talk about pacing in runtime four first, how that works. So there are two, we’re gonna talk about allocations on the heap on the managed heap and allocations off the heap, which it turns out are managed differently. So anytime you build a garbage collector, you need to decide how will users express how they want the garbage collector to make the trade off between compute and space. So like the classic like go back and read like a GC paper, you know, enough years ago, it’s always about Java ‘cause that’s like what prolonged time all the papers were about and the basic model was a roof line model. You’re like how much, “How should I tell how fast I am?” Well, here’s the upper bound of memory you’re allowed to use and when you get closer to that, you better work harder to collect. This, I think kind of makes sense, if the thing you’re doing is like enterprise style, I’m gonna run a single big program on a single box and it has a fixed amount of memory. This is not kind of totally how we live our lives. We have like lots of different little programs of different sizes and the roof line model is kind of insane for that because you don’t know how much memory it needs, what you really want is for it to not waste too much memory.

And so instead, OCaml is tuned via a parameter called, space overhead, which tells you essentially, the percentage of space that you’re allowed to waste, right? Can I use 30% extra, 60% extra, 120% extra? That’s the basic parameter that you’re controlling. Which I think is like a good design choice. I think that’s actually like kind of mostly better than the thing that Java did historically, although you know, lots of GCs involved and have many more tuning parameters.

And then we did a bunch of analysis, not we, who wrote the original garbage collector did a bunch of analysis based on a steady state assumption and what the steady state assumption in a garbage collector is like, if you imagine like a long-lived program, how much stuff is it collecting? Well, probably about as much as it’s allocating, right? Because otherwise, the memory users are gonna kinda keep on growing infinitely, like if those things aren’t in balance. So the steady state assumption basically says, what happens if we just assume that we need to collect as much as we allocate? And what’s cool about this is, it just lets you do, basically solve problems with high school algebra. You just have like a bunch of equations about what things need to do and how many bytes you need to collect and whatever. And you can just derive a number. It’s just like a constant that tells you like for every byte that you allocate, there’s a certain amount of bytes of sweeping and a certain amount of bytes of marking that you need to do, basically per word that gets promoted to the major heap. So it’s saying, you should only clock the major heap based on promotions to the major heap. The minor heap doesn’t matter, right? Like there really are two separate garbage collectors with a kind of connection between them, right? And the pacing story is very different between them. So that’s the on-heap story.

The off-heap story is very different. So what do I mean by off-heap? I mean, like I want to allocate a region of memory. We have a thing called, a big string, you know, in OCaml which is basically, a mallocked piece of C memory, which we want it to be like C memory ‘cause we want to use it for IO and we don’t want it to be moved by the garbage collector. So it could be that kind of thing. It could be like you have some library using PyTorch and it like allocates object. It’s like anything where some foreign system is allocating memory for you that you’re going to be responsible for. So you’re responsible for in the sense that there’s gonna be something on the OCaml heap that has like a handle to the off-heat memory. So you do really want drive the speed of the on-heap collector in a way that is responsive to the amount of off-heap memory because doing the on-heap collection will cause you to recover the handle to the off-heap memory and eventually, collect it back. So if you’re sort of ignoring, oh, I have all this enormous like off-heap stuff I’m allocating and it doesn’t make you work any harder, then you’re just gonna like run out of memory.

So how does the off-heap collector work? So the basic idea is there’s a new number called, extra heap resources. Think of it as a number that goes from zero to one and when it gets to one, it means, you should do an extra cycle. You should do some extra work to collect those. And then there’s just an equation that tells you how much you should increment the number by every time you do an off-heap allocation. And this is N/K times H. So N is the amount you just allocated off-heap, H is the size of your ordinary heap and K is like some tuning constant that came you know, I don’t remember working out some algebra or something like of how much work you need to do for each one of those. And like you can get like a sense of why this is sensible, kind of heuristically, it’s like, ah, if I’m allocating stuff that’s like about as big as my heap, that means like, I really wanna do more collection of that heap and like get to the end of a whole cycle so I can actually collect things. And so it seems kind of plausible, like if you wanna know why, what’s the deeper reason that it works beyond this kind of vague plausibility, I’m going to have to disappoint you ‘cause it actually works very badly. Turns out, it’s like not a good heuristic, but the thing that people kind of came up with and was there for a long time.

So that’s runtime four. A thing I wanna mention about this is, we now kind of get into like control theory land, right? We have these equations that are telling us how our system should respond to its own behavior and there are kind of two kinds of control systems that you can run into. There are open-loop systems where an open-loop system is where your decisions are driven by observations of the world, but those observations are not influenced by your decision. You’re just kind of watching something exogenous and making choices about it. And what’s nice about open-loop systems is they tend to be kind of predictable. Their behavior is easy to understand and easy to tune and what’s bad about them is, they’re not necessarily good at actually hitting the target that you’re aiming for because they can’t see the target you’re aiming for because the target you’re aiming for is definitely something affected by your decisions, right? Because you’re aiming for it.

So sometimes, you end up with what are called, closed-loop control systems and there, with a closed control system, you end up looking at things that you aren’t yourself influencing and then you can like hit the target right on the nose, but I don’t know, for those of you who’ve done like high school robotics or something, you’re in PID controller hell, like you are like trying to tune parameters and if you like are too aggressive about getting to something, you end up like oscillating around the result and if you’re not aggressive enough, it takes you too long to converge and like, ugh, it’s just like not a great experience to like tune these systems to be just right in a wide variety of cases.

And the kind of end result for this design is that if you look at the pacing for things allocated on the heap is open-loop, which is kind of the good kind of thing, at least for a garbage collector. And the off-heat pacing is closed-loop. It’s closed-loop because of that H, the capital H is looking at a value that you are yourself influencing its behavior.

All right, so that’s the kind of pacing story. Let’s talk about what the results look like and in particular, so I’m gonna show this graph where on the X axis, we have the space overhead we requested, right? We told the system, you know, give me a space overhead of 30 or a 100 or whatever. And I actually don’t remember even what these numbers exactly mean, but like this is some way of expressing that like percentage overhead that you get. And then on the Y axis, you just measure it, you look at what the program actually does and like the black line in the middle is the thing you would like to approximate, right? You want to get what you requested, right?

So how did it actually work in runtime four? And the answer is like, mixed, depending on the benchmark. So there are like three different benchmarks here. The ones that are like looking pretty close to the black line are programs that are doing a lot of ordinary heap allocations and then the one that looks insane that’s like totally flat, nowhere near the line is one that’s very aggressively doing off-heap, like quickly cycling through lots of off-heap allocations. And that one, you can see what’s happening is the garbage collector is just working insanely hard, right? So that’s how runtime four looks.

Now, let’s talk about what’s the design of runtime five and how does that work and then, how does that change our behavior? And I guess, just to go back for the second, that line on the bottom is a serious problem. Like we use a lot of off-heat memory, it totally screws up the tuning of the garbage collector and we’ve had to do a lot of annoying work to kind of adapt the behavior of our programs to this misbehavior of the garbage collector. So this is like a thing that we have tweaked and adjusted and tried to fix for a long time.

Okay, runtime five, how does that work? So the pacing approach for ordinary allocations is just the same. In general, like one of the nice things about the original paper is it’s a very conservative design. It tries and successfully preserves a lot of good decisions from the original collection, but it did fix a bunch of implementation bugs with respect to off-heap memory in particular. For example, one stupid simple thing that was wrong in the implementation of the off-heap thing is, the off-heap collections were driven not by promoted memory but by even, if I allocated onto the minor heap a thing that was appointed to off-heap memory and then it got collected and never made it to the major heap, I would still affect the clocking. So that makes it like massively way more aggressive than it should be. That wasn’t the only bug. There were a bunch of bugs, there were a bunch of bugs like you know, mutually opposing bugs, some bugs in this direction, some bugs in that direction. And when you’re in a system like this where you have like this runtime behavior, it’s got various different exciting bugs and then you’re building, you know, you’re spending 20 years building lots of programs that operate in the system, the bugs become load-bearing, right? Because everyone has adapted to the behavior of the system and like you should be a little worried when you fix the bugs.

So what happened when we added runtime five? So here’s the picture with the runtime four lines, so much better. So like the lines that are going up super high, these are ones that are using massively more memory. These are the only ones with the ordinary heap allocations and then like this kind of like way too flat red line like that’s better, it’s like closer kind of to the black line and that’s mostly better ‘cause of like the bug we fixed with allocations on the minor heap so that particular benchmark is very sensitive to it and so the behavior is much better there, but like this is not good and we didn’t like quite understand what was going on here, so what was going on?

So it basically turns out that ordinary collection in the new collector is just less aggressive and in particular, the unified mark and sweep design that we stole from the very concurrent garbage collector turns out to be a lot less aggressive at capturing all of the garbage you need. And in particular, there’s this problem what’s called, floating garbage charmingly in the literature which is like stuff that was allocated, but can’t yet be collected because of this like snapshot at the beginning, like it’s gonna be collected some future things so it gets delayed till farther off.

So let’s visualize a little bit what’s going on here. My second vibe coded visualization here. So here what would happen in runtime four, you have these sequence of like sweep mark, sweep mark, sweep mark segments and of course, like a good garbage collection analysis, I don’t even bother putting the program in there. The mutator is not worth talking about. We just like garbage collectors. So what happens when a thing gets allocated in the middle of a sweep? Well, at the end of the next mark phase, you will have proven that it is not reachable and then in the middle of the next sweep phase somewhere in there, you’re gonna get the chance to actually move it to be free. So that’s gonna take you like one merged sweep mark cycle on average to get to. What happens if I allocate something in the mark phase? Well, because of the snapshot of the beginning thing, it’s not gonna get marked by the end of this mark phase. So you really need to wait till the next mark phase and then you can only collect it in this sweep phase after that. So now, you’re gonna spend a cycle and a half doing that.

Okay, so far so good. Let’s look at runtime five. Now, we have these merged sweep mark phases and it’s worth saying in this design even though, you sort of don’t need to synchronize, the sweeping is mostly at the beginning and the marking is mostly at the end. So we can sort of ask the same question of, do an allocation in the kind of sweepy part of the sweep mark phase, when does that get collected? Well, I have to wait, you can’t get marked until like the end of, I gotta wait till like the next big sweep mark phase to actually mark it and then the next sweep phase after that to actually collect it. So I’m gonna wait a full two cycles to get it. On the other hand, if I allocate something during the mark part of the sweep mark phase, that sort of ends at the same point. So like the thing that was 1.5 cycles is still 1.5 cycles, but the thing that was one cycle is now two. And so we go from like an average of like one and a quarter cycles to 1.75 and so it’s just a lot slower, right?

So that’s the problem, like this wasn’t not known to the people who worked on the original paper and it was the people who worked on the original paper and the people who worked on fixing it up are like the same people, some of them, right? So there’s a lot of conservation of thought between the different phases of this work. So anyway, that was a problem that we worked through and we just fixed it, like we came up with this mark delay patch, which basically, it didn’t like totally give up on all the design aspects of the nice VCGC thing, but it was like actually, we have discovered as we’ve been doing experiments, these synchronizations that we thought were super expensive are actually pretty cheap in practice and having another one, it’s just kind of not a problem. And so it adds another synchronization into the system and it recovers the old kind of behavior for what the lifetime of objects is. But as I said, like the excessive collection with bigstring story is still better in this than the other because of like the bugs been fixed. So as I said, all right, one bug fixed less aggressive, others made it more aggressive.

So what do we do next? Well, let’s see what the results are of the mark delay patch. So again, here’s the picture of the yellow is runtime four, the red is like the original version of runtime five. The green is with mark delay fixed and like it’s a mixed story, right? So first of all, it’s still kind of veering kind of far from the black line for the ones that are like doing heap allocations, but it’s better and the off-heap stuff is worse because in general, fixing the market delay thing makes collection become more aggressive. So that kind of pushes that part in the wrong direction. So that’s not awesome.

So what did we try next? So we’d spent a bunch of time fixing the implementation bugs, we thought, let’s maybe fix some of the design bugs, right? So one of the design bugs is that like N/KH kind of uses the wrong notion of H, like the weird pathological case turns out to be where you’re like doing a lot of off-heap allocations, but your heap is really small so this ratio gets totally blown out and now you just want to super aggressively collect all the time. It would be better for the H to be the sum of the two, the off-heap plus the on-heap. So we worked on that and we kind of improved it and it seemed a little bit better, but now, we started running into this like closed-loop control system problem which is like, we could get the behavioral a little better, but we were like had these like you know, oscillation tuning problems and stuff so we weren’t super happy about it.

So we kind of went back to the drawing board and just thought like, okay, like can we just design something more unified that’s better that does the whole thing. And we basically rethought the steady state analysis pulling in both the off-heap and the on-heap stuff into the same set of equations and working through the high school algebra and like kind of surprisingly, it worked like it’s one of these things, like you throw a bunch of equations, like there’s no deep reason that we know that it had to work, but like you throw the equations together and like you know, things like capital H that are the things you don’t want to depend on, like they still canceled out. So we ended up with a system that basically again, had a constant number of words collected per word allocated and you didn’t have to look at the state of the heap the difference between on-heap and off-heap allocations is that when you did an on-heap allocation, you needed one extra byte of sweeping work to be done. Which kind of makes sense, right? Because like the on-heap stuff you allocated has to get swept and the off-heap stuff does not have to get swept, right? So that kinda makes sense.

So how does the design work? Well, turns out, kind of shockingly well, I don’t know like the sense of relief I get from looking at the blue lines on here is hard to convey. It’s just like so much better and simpler and easier to reason about and like it just kind of gets rid of this problem that we have had for years fighting over of trying to deal with how to handle off-heat memory in a better way.

So that’s kind of mostly the end of the things that we learned about pacing. There’s a couple of other interesting things coming out. There’s like some new experiments about like trying to like tune the thing a little bit away from a pure steady state analysis. It turns out programs have a property where often early on in the program, you’re allocating lots of stuff that isn’t going to go away soon, you’re kind of doing setup of your initial program and like the things you’re allocating are mostly gonna stick around and doing a lot of garbage collection then is actually kind of a waste of time ‘cause you’re not gonna collect anything. And so there’s some new ideas about tuning that behavior so that like you do less allocation there and you do less collection there and those things are more efficient but like the big result was this thing, was kind of getting all this stuff in line.

So I wanna talk a little bit about like what we learned from this whole process. Not just the process of like disentangling the pacing stuff, but in general, working through this like long and complicated set of problems that we ran into in getting the runtime to work.

So one of them is benchmarking is hard and I think the original paper just leaned too much into the thing that like papers often do of like what benchmarks should we run? You know, some of the standard benchmarks and like the benchmarks that are easy to run and that ended up pushing them in the direction of testing a bunch of relatively small programs that just didn’t like embody a wide enough set of the potential different possible kind of program behaviors that you needed to do to see when things went wrong. And so being able to see more of those use cases, I think was important and sort of exposed different problems in the system. I think actually one interesting corner of like why they did this is, they had a good idea that caused them to do a bad thing. The good idea is long-term, there should just be one runtime. I think that was a good idea. The bad idea was to start off with OCaml 5, only having the one runtime and OCaml 4 only having the other runtime because it meant it was really hard to benchmark the two runtimes against each other. And one of the first things that we did was to backport so we had both runtimes with the same language so we sort of had the first opportunity to do kind of like for like comparisons on real world programs and I think they should have prioritized and again when I say us and they, same people to a large degree. And also, not me, you know, I’m just good at talking about what other people do, but I think it would’ve been really valuable to have more of that early evaluation to be on larger scale programs. I think more of these problems would’ve been discovered early.

Another thing is once you benchmark and find out there are problems, debugging the problems is hard. So like I showed you one issue, there were like a dozen issues and like the bugs cover each other over like the worst performance bug makes it hard to see all the other ones. But you actually, you know, Pokemon, you gotta get ‘em all. You gotta like figure out how to get all of these different things. And then the other thing that we did is like we would find a problem, be like, “Ah, this is the biggest problem, let’s fix it.” And then we would develop a good solution to the problem and that was a mistake. We should have developed bad solutions to the problems because you can develop a bad prototype solution much more quickly and discover is this problem actually the important one, we were like doing too much high-quality software engineering and not enough like experimental poking at things. And we also should have spent more time like dealing with like the problems covering each other thing of like creating small-focused benchmarks that exposed in a clear way the different misbehaviors of the system so that we could spend more time focusing and iterating on fixing those individual misbehaviors.

Another thing you learn is like sometimes, you should like go back to the drawing board, right? I think it’s very tempting and often, the right thing to do to just like incrementally try and fix the thing that you’re working on. It’s like, the system it pretty much works, maybe, you know, yeah, runtime four was broken, it’s been broken for a long time, somehow here we are still a running company, things are working, we don’t have to fix that problem, maybe, we can just like make it kind of work. And in this case, it was like, no, actually don’t make it kind of work. It turns out sometimes, you need to go and get a first principles like rethink the whole thing and maybe, there’s something fundamentally better available and in this case, we got one and that was very satisfying.

Also, backwards compatibility is really hard, right? You know, every time you change something, somebody’s workflow gets broken. You know, the load-bearing bug problem is like a really serious problem. Like, you know, you have all these things that have been tweaked to kind of behave in an okay way based on the system as it works. And then being able to roll out the fixes may break lots of production systems. And so there’s just a problem that you have to think really hard about.

So that was a long story about a nice paper with some good research and good ideas and the painful set of realizations as you try and bring that into production. We’re doing that again. So there’s a nice paper called, Data Race Freedom a la Mode. It won the distinguished paper war at POPL in 2025. And this is about in some sense, the second half of the parallelism problem. Part one is a runtime capability. I can run parallel shared memory programs, but without a programming model, right? You know, you just did like NewTech Sema4 condition variable hell like good luck, right? And we were not particularly satisfied with that.

And so a lot of work went into finding a bunch of type level guarantees that we could kind of new type system extensions that would allow us to write programs where you would get a proof from the type system that your program is data race free. And then we’ve done a lot of work to make that happen. One piece of the work is propagating that type information through our libraries. Like one example of a new thing you can decorate on a function is that it’s portable, means, you’re allowed to like move it to another core and run it on that other core. So you have to like do a bunch of like, you know, big refactor stuff to write these adaptations in the right places. You’d think that was the hard work. But the actual hard work is fixing the data races. Turns out, if you have a type level discipline for discovering data races and a old code base that does a lot of stuff, even what is supposedly a nice functional programming language like OCaml, you’ll discover a lot of like dirty tricks in immutable state and all sorts of stuff where you didn’t expect or remember that it was there. And like the type discipline works, if we discover a lot of these bugs, and by the way, as a small aside, I have no idea what Python is going to do, right? There’s this new free threaded Python thing coming and like it’s a, you know, runtime capability and I don’t know what their programming model is gonna be and how they will like survive as like they try and do more. I could have mentioned to someone earlier, I think, you know, they’ll probably just YOLO at someone and be like, “Yeah, there’s some data races. You know, Python programs great for lots of reasons, but I don’t know, it seems hard.

We’re building lots of high-performance data structures. We’re doing all sorts of interesting work on user-level scheduling. I certainly like to say, there’s like four or five things that are terrible about writing parallel programs. One is like, it’s hard to avoid data races. One is, it’s hard to pick the granularity of the job that you want to paralyze. You pick it too small, you get swallowed by overhead, you pick it too big, you can’t use all your parallelism. So that’s annoying. One is, it’s hard to like actually pick the amount of concurrency you want, how many concurrent threads you want. So we’re like taking some ideas from the literature stuff called, heartbeat scheduling and various other kind of like control system things about how to kind of dynamically adapt the amount of parallelism to build smart schedulers that mean that users can just like describe things in a natural way and stuff will like just work pretty well.

Another bad thing about parallel programming is it’s hard to know why your program isn’t working fast, it’s like, why it’s not fast, you got it to work, but like why is it slow and trying to build profiling analysis tools because like just parallel systems are like really unintuitive and like you have a shared memory system but you don’t really, there’s like cache lines getting shared left and right and it’s actually, you know, much harder to understand than you think in trying to build tools that help people understand it. And then also building new abstractions for efficiently and ergonomically combining parallelism and concurrency. And by parallelism, parallelism is where you like offer up to the system, here are a bunch of things, you should feel free to run them in parallel, if it’ll make it go faster. Concurrency is where you’re like, here’s a bunch of things, these can’t block each other, I’m demanding that they run in parallel. It’s actually these kind of are like duals to each other and they’re like complicated to integrate into a single system in a single scheduler. And we’re trying to build good systems that I think do that in a way that’s kind of better than anything that we’ve seen out there. And then of course, just teaching people how to use all this stuff, right? You know, there’s a lot of new ideas and new stuff going on and like there’s a lot of teaching that will need to happen.

So like stay tuned. Maybe, I’ll come back in a few years and write a paper about how this one’s dramatically harder than we expected, but we are working actively on this stuff. And actually a lot of this is like in the process of becoming more public. So all of this work on the OCaml compiler, both stuff we do directly upstream and stuff that we do on our own branch. Like all of this is intended long-term to be upstream, but long-term, I think it’s a process that takes years and we wanted it to take the stuff we’ve built and make it more available for people to look at.

So we are like publishing actually on Friday, you can go to OxCaml.org. So we’ve named like, if you want people to know what a thing is, you gotta give it a name. So we’ve called it, OxCaml for Ocaml Oxidize. And it is going to like officially show up on Friday. There’ll be like a website and believe it or not, actual documentation and the ability to like install a compiler with like an LSP that works in the formatter and you know, the tools all fitting together.

OxCaml has sort of been an interesting state. It is like totally a production-ready compiler. It’s our production compiler, we use it all the time, but it’s not really meant for other people to use as their like production compiler for like real stuff because we are going to like break your stuff mercilessly all the time. The OCaml stuff is gonna be compatible, and all of that. But like the extensions, we’re just like iterating really fast and taking advantage of the fact that we control our own internal system and like we can automatically refactor stuff and get it kind of lined up. For the outside world, I think it’s great to experiment with, you’re like a researcher or an enthusiast and the code base you’re doing is relatively small, but it’s not like a real stable language for like a community, you know, a practical community to build itself around. But we think it’s exciting and you know, invite people to kind of check it out when it lands, just you know, at 9:30 PM on Friday, I am told.

And that’s all I’ve got. So thank you guys very much.

The Saga of Multicore OCaml

Transcript

The next great idea will come from you