Safe at Any Speed: Building a Performant, Safe, Maintainable Packet Processor

Sebastian Funk

Jane Street

At Jane Street, we’ve been building systems to trade electronically for over a decade. As technology advances and the scale of the markets grows, we need our systems to be able to process ever growing amounts of data in ever shorter time windows.

In this talk, we’re going to explore how to build a highly optimized single-core packet processing system that is capable of processing millions of messages per second. We will see how to bridge the gap between the high-level abstractions we’ve come to love when structuring code, and efficient machine-level execution necessary to process messages at line-rate.

Transcript

Today I want to talk about performance. And I want to share some stories and talk about some experiences that we here at Jane Street, came across while trying to build high performance, low latency applications. And these are the kind of systems to try to kind of get as close as possible to machine level capabilities as you can, and try to squeeze as much out of a machine as you can. This talk is kind of a survey talk of a bunch of different techniques in various areas that we found helpful. It’s worth saying that I’m not the expert in every detail of this talk, so this is a presentation of a collection of a work done by multiple people here at Jane Street. It is also worth talking a little bit about how Jane Street thinks about performance before going into the talk about performance.

Jane Street doesn’t really think that our primary value ad or core business is around optimizing purely for speed or trying to be the fastest out there. But, as the world gets faster and kind of thicker, meaning, you get sent more data on bigger pipes in less time, often the only way to stay competitive operational is to become faster as well. Let’s talk about performance, and there is a lot you can say when you talk about performance. I want to frame this talk a little bit more, and I want to start up by saying what this talk is not about. This talk isn’t about choosing efficient algorithms or, picking or optimizing your hardware that you run your program on. Obviously, these two things are important if you are trying to build real world high-performance application, but I’m not going to into much details about these two.

Similarly, I’m not going to talk much about exploiting parallelism. Again, if you’re building a system with performance requirements like this, you have to think carefully about the distribution of work between different aspects or different parts of your system, but again I don’t want to talk much about it today. Finally, maybe somewhat surprisingly, I don’t want to talk much about the comparison of languages. It is clear that OCaml is the best program language, and this talk is also going to use OCaml as the language to explore or discuss certain topics. But I hope that this talk is more broader than that and sort of has applicable lessons in any programming language.

And I’m not particularly interested in arguing that OCaml is the best language for this particular thing, I kind of more think of it as, this talk is, given a set of constraints, how do we achieve the thing, and the thing as being as fast as you can. One of the constraints that we have to deal with is, we are going to use OCaml. I don’t only want to, at the end, convince you and say, “Whenever you write performance sensitive code, you have to do it in OCaml”. I mostly want to convince you that, you can do it in OCaml, and if you start using OCaml then you can make this work. And, it is not a reason not to use OCaml, but I’m not trying to defend OCaml over any performance language.

Okay. Now that I have said what the talk is not about, let me briefly state what the talk is about. This talk is a case-study of single-core performance for a packet processing system in OCaml. I think of this talk as given multiple constraints for one, we’re be operating on a single-core, there’s no parallelism. We are trying to do packet processing, which if you don’t know, just means feeding packets, network packets into a system doing some non-trivial but not highly complex thing with those packets, with that data, and then shipping that data out of the box again.

And finally, the constraint is that we’re going to use OCaml. The talk is about while given these constraints, how can we make it work, how can we be as fast as possible, how can we can squeeze out every little bit of performance that the machine might give us. It’s worth saying that this is a highly specialized application and for many, many systems, including the many systems that we build at Jane Street, the kind of, “We want performance over anything” approach is often wrong, but for this particular talk, in this particular case study, we want to say, “No, we actually do want to get as fast as we can and that performance is the thing we care about most.”

To briefly outline the talk, I want to start out by talking about a motivation for the problem of like, what are trying to solve? And, also doing a bit of analysis to figure out what does fast mean? How fast is fast enough and how fast would be too slow? What kind of performance and what kind of latencies are we talking about? Then I very briefly want to talk about OCaml, and some of the choices it makes on the programming spectrum. Again, I hope that this doesn’t anchor you too much to OCaml and in fact most of these things also apply to other programming languages. And I think on the spectrum, OCaml lands on a very similar spot than other languages, at least some other languages.

And then, I want to talk a bit about some coding paradigms that you come across that have a performance impact, and you have to change how you think about them when you are trying to do this high performance low latency stuff. And finally, this time I want to talk about slightly bigger scoped examples. Okay. To start out by motivating the problem, we here at Jane Street, have built and are continuously building systems that deal with market data distribution. You might say, “Oh well, what’s market data?” Here’s a brief overview of what market data is. In the world of finance, which Jane Street is in … I hope that’s no surprise … the main core thing is, there’s an exchange.

An exchange allows you to trade things, and the main function of an exchange is to keep what’s called, a book plus stock. This is a trivial visualization of a book. And here sits the exchange, and we are interested in the newly IPO’d and hip and cool LYFT company there. It IPO’d a few weeks ago. The exchanger’s job is to maintain a book or a list of people’s willingness to buy and sell the stock. So, here there’s a couple of people willing to currently buy, LYFT Jane Street is here with a quantity of a 1000 of this price and maybe Merrill Lynch and Goldman Sachs are here too. And, on the other hand, people are willing to sell, similarly on this side. So we are going to pick… Exchanger’s primary job is maintaining a data structure, to keep this information and that people interact with this information. All right.

On the left-hand side we have, what’s called, the order entry side, which lets people to market participants, like Jane Street, send orders into the exchange. And, on the other hand we have what’s called, market distribution, and this is where we want to focus on. This is the idea that, “Well if Jane Street sends orders into the exchange, the exchange keeps it to it, adds it to it’s book.” But then, it also sort of broadcasts to the world saying, “Someone is willing to sell LYFT at this price”. From an important finance aspect is that, this part of the stream is anonymized, the world gets to know that someone is willing to sell but, the world doesn’t get to know who. Doesn’t even matter from a technical aspect for us though.

Similarly, if there’s another order that comes in, that happens to cross the book, the world gets to know a trade happened. Again, the world doesn’t get to know exactly who traded with whom. But, the world get to know, the trade happened. Okay. This is all for the market data, and the problem you are concerned with is building systems that can efficiently handle market data distribution. So, market data is going to be the fire hose of activity, of everything everyone is doing around the world on any exchange.

So, that’s a lot of data, and it’s clear that you have clients who very much care about this data. Primarily, you might have automated trading systems who make trading decisions in an automated way, and send orders to exchangers based on this information. Right? They get this feed of what’s happening in the world and they will base their actions upon that. It is bad if that feed is very, very slow or inefficient because your automated trading systems will make bad decisions. There’s lots of other systems, like risk systems, or monitoring systems or studies historical tools, but it’s clear that all of these care about accurate and very efficient representation of this data.

The problem we are faced with is, how do we deal with this fire hose and how do we distribute this fire hose efficiently to clients internally? I want to talk a bit more about some numbers, like what does this fire hose mean? How big is it? And how much data is there? If you take NASDAQ, one of the biggest US equity exchangers, there’s roughly a billion messages of these market data updates of here’s a new order, here’s a new trade a day. And, it peaks at three or four million messages at the close, which is at 4:00pm every day, which is traditionally the busiest time of trading. There’s not just NASDAQ, there’s about now dozens of equity exchangers in the US. There’s lots more worldwide. And, there’s not just equity exchangers, but also lots of other derivative exchangers.

So, it’s clear that there is a lot of data to consume and take in and feed to your systems. Now we got to go a little bit further and look at some actual numbers to try and figure out what does this mean for us, how fast do we actually have to be. But, before I do so, I want to, very briefly, remind people, in case some have forgotten on some terminologies on latency. Here are some latency numbers, we start at the very top with one picoseconds which is 10 to the -12 per second which, according to Wikipedia, is the fastest switching time of a transistor. And then it goes from picoseconds to nanoseconds, to microseconds and to milliseconds. And, at the bottom here, we have 300ms, a blink of an eye.

The area we are mostly going to go concern ourself with is nanoseconds and microseconds. This is going to be the area we will be talking about most. Now, I want to think again about well, how fast do we have to be? So, let’s look at a bit of data. Let’s look at the following, again we are going to focus on NADAQ to speak US equity exchange, and again we’re going to focus at the close, which is 4:00pm, which is the busiest time. Now, we are going to run the following experiment, let’s say we assume we process every message strictly sequentially, there’s no parallelism, and let’s assume that it takes us any of these times up here, to process a single message. The question is, well, how long, how big does our queue ever get? What’s the maximum queues that we see. In a 10 seconds window, from this time backwards.

If we look at the first row, well, if you look at this particular row here, this tells us that, if you can process messages at 750ns per message, and we process all of them sequentially, the longest the queue ever gets from this timestamp 10 seconds backwards is 22 messages. Well, that’s some information. Now, if we look at the full data set, which includes this line, which is the most interesting, this is the actual close, we can see two interesting phenomenons, one is, there sort of non-linear growth here. You can see that if it takes 750ns to process these messages, maybe things are still, kind of, reasonable.

But, as soon as it takes you longer, microseconds or more, you’re starting to buffer up a lot of messages. The point is that, if it takes you 10ms to process each message, you’re going to be buffering a 1 million messages in your queue, which might mean that you are buffering the full messages that you are get at the close, while you are still busy churning through the very first few that you saw. So this is a very non-linear growth that you see here.

The other interesting factor that you see, is the effect of falling behind, which is here. You can see at this time, which is then after the close, which is usually a very, very quiet time of trading, not very many things trade after 4:00pm. Most of these queue sizes have normalized again, except this one. Because you’ve queued all your messages at the exchange, that the exchange hand you at this time, you’re still busy churning through half of that thing that happened at 4:00. Now, any message that comes in afterwards, gets delayed by these in backlogged.

This is the idea of falling behind when you are trying to build something that is real-time, keeps up the messages but kind of falls behind, new messages will experience very bad latencies. If you look at the status slightly differently, and again think about, per message processing time. But now, look at the maximum delay that any message can experience in our system from the time it enters our system to the time it gets delivered to the client. We see a very similar and confirming thing, that if it takes you below 750ns, things are fine, but if it takes you over a microsecond, you are kind of dead. You can’t do that.

The interesting bit is here. If you scroll this up to the log scale, you can’t see this fact very much. Which points out and concludes for us that, “Well, if you can build the system to, in the end, process things below 750ns, you don’t have to worry about it. Anything above a microsecond is not acceptable.” This gives us a bound of where we have to land. Now that we have this in mind, we know well, this is the requirement we have, we have to be this fast. But, I briefly want to mention two other requirements that, I’m sure every software developer somewhere in his head has, but I think I want to motivate that, especially here, which is, well, we have to be fast and scalable. But specifically, we want to be flexible and maintainable.

And this comes in, in this world, from the fact that, if you connect a lot of exchangers, it turns out they all speak different protocols and they all have their own argumentative way of doing things. Some of them make sense and some of them make a little less sense, but you can’t have to build against that protocol. What you want is, you want to be able to keep any exchange weirdness or specifics at the edges of your systems, and the complexity of these things at the edges of your systems. While you want to build this very fast and scalable system, you also want to build it in a way that you can reuse a lot of code and then share things, and only do specific things at the edges.

And similarly, you want it to be maintainable because exchangers change their protocol all the time, sometimes there’s a new exchange, sometimes there’s a regulation that changes things, and you want to be able to act quickly on this, as well. You want to be able to go in and make small surgical changes, quickly and safely. The rest of the talk is about, well, we have to be this, but we also want these two things, will you be that one to give the two things up? How can we do that? What kind of techniques do we have that we can achieve this with? By the way, if you feel that you have any questions, feel free to interrupt, I’m happy to answer them.

Is it an implicit requirement in your earlier analysis that the processing time is constant, correct?

Right. The question was, is the processing kept constant? And in my early analysis is, yes it was. And I think we will see a bit more later of where that assumption comes from. But I agree. That assumption was there.

You mentioned earlier that, process available messages from NASDAQ, you do all one billion or you cut down on some of them?

Oh, the question is, if you cut down messages… You can, you can try to, but often the answer is, at the input of your system, there has to be a system that processes everything, and then depending on the transfer mechanisms you use to shuffle out data declines, hopefully the client only subscribed to the stuff that they care about, but you probably have to have some system within your firm, that can keep the flow fire hose. Cool. Okay. Now that we can understand the system requirements, and we understand what we are trying to do, and how fast we are trying to be, I want to talk a bit about OCaml and what OCaml means in this world.

Probably as most of you know, OCaml is a functional language with a strong static type system, which means the usual function of programming things, you have higher order functions, you have immutability, and you have type influence. We really like these factors and these features. Higher order functions are like a very useful and cool way of structuring control flow in a way that makes more sense to you. OCaml also has some support for imperative primitives, so you can do some procedural like coding, you can have some side effects, you can have immutability. In fact, I have heard rumors that the author of the OCaml language once called OCaml a imperative language with strong support for functional paradigms. So OCaml sits a little bit between the two. On the fundamental level, on the one time, OCaml is, kind of two things, there are things called immediates and everything else which is a boxed value.

And roughly distinction is, immediates are fast and you can store them on the stack or in registers, and they usually they’re just hints or everything else that doesn’t fit this criteria is a boxed value which is stored in the heap. The cortem is just, immediates are fast and boxed values are quite slow. Similarly, OCaml has a generational incremental garbage collection, which maybe it means something to you, maybe it doesn’t, it indicates that, it’s a with the memory manage language, so there’s actually some garbage that takes care of it for you, you don’t have to free and allocate memory yourself. It has it generationally, meaning it has a first generation, if things survive that generation it gets promoted, which is similar to things, for example, Java.

And incremental means that it can do sliced bits. But, it is a stop-the-world garbage collector, by the way, if you know what that means. If you have to do garbage collection, it stops your program, it does some, and then it resumes. There’s no parallelism between the garbage collector and whatever you are trying to do. Right. And finally, OCaml has a way to interact with non-OCaml code through what’s called the foreign function interface, especially C-code, in a fast and very low overhead way. And we’ll see that in a bit why that might matter.

Again I want to stress that these things, some of them are OCaml specific, but I think some of these choices are quite similar to other languages. And I think, specifically, the code that we will see, that we have to write, as a result of this, is quite similar between OCaml and, for example, a language like Java. In fact, I think NASDAQ itself is written in Java, and probably in a Java that is not that dissimilar to the OCaml that we are going to be looking at. Okay. Let’s look at some of these program paradigms and think about performance in these ways. There’s two things I want to talk about, the first one is the problem of garbage. Garbage is going to define as you create a temporary storage for a calculation, you store that calculation there, and then you maybe use it again, but afterwards you don’t need it anymore, so if that’s memory allocated, you don’t need it anymore.

That’s garbage. And at some point the garbage can is going to come along and take it away, and sort of free it up. If you look at a piece of code, some OCaml code, I hope no one is too afraid of some OCaml. I think, it is pretty straight forward. You have some function that takes two arguments and it maps over a list of this, one of the elements of the list, it maps over it. And the function, the price that we’re mapping is multiplying every element by this multiplier and then it maps again and then it [inaudible 00:19:22] this list. Some very straight forward function, nothing super exciting.

If you think about the sort of allocation behavior of this program, though, you can see some obvious and maybe some subtle problems. I want to stress that for most of the applications who write, writing this code is just totally fine. And you don’t have to think about allocation at all, but if you’re trying to do this really, really low latency sensitive stuff, you have to kind of carefully think about this. So one obvious source of garbage of kind of unnecessary allocation for temporary storage should be thrown away. Is this list up map the way this works OCaml like it takes your list, which is a link list usually, and then just kind of copies it. Now you have a list of length, and now you’re just going to create two copies here, which obviously creates only to be used once and then thrown away so that creates a bunch of garbage.

There’s also more subtle introductions of garbage here. One is this function F, I cleverly constructed to be, a closure because this function, this value multiplier is not bound here, which means you kind of, the OCaml run time has to allocate a bit of memory to represent us, right. This is kind of OCaml specific, but it kind of makes obvious that it’s not always easy to spot where your program kind of allocates in this way.

And another kind of subtle way of allocation is the fact that these are floats to them. Stalling in this list might mean that if you reason about the allocation of this program, you might be surprised because floats have to be kind of stored and they all kind of run time in a specific, in kind of expensive way. So that goes to say, “Well, it’s hard to reason about the allocation. Sometimes it’s obvious. Sometimes it’s hard to find source of allocation.” But you might sort of ask, “Well, why is this even a problem? Why is this slow? Why should I care about allocations?

So the first thing is, well, the allocation itself, isn’t slow. Usually the allocation is pretty fast. In fact, in OCaml in the common case, the allocation is two or three at 70 instructions. It’s just an increment and a check for overflow. That’s it. So the allocation, isn’t the problem. It’s fast. Sort of the obvious problems, of course. Well, the garbage collection is slow. It’s kind of the obvious trade-off you make, you don’t have to worry about managing member yourself, but every once in a while you have to be pay a sort of a cost of like, “Oh, now the garbage truck has to run and do a bunch of stuff. I can’t do anything.”

And that is true, but it’s not the only thing here. If you assume that garbage collection is the only thing that’s problematic for performance in this case, if you don’t think about performance, kind of as a graph or as a distribution, you would assume that your average or your median is kind of unchanged, but every once in a while, your garbage collection tricks in which causes the tails to look bad, right? The idea being like on average, do you process messages find no garbage collection, great. But every once in a while you hit a garbage collection, halfway through posting a message and now the performance of data is really terrible sort of tails of your performance distribution looked bad.

It turns out this is true, but not the only fact. In fact, the media in itself is also affected by this kind of garbage-y programming. And the reason for that, is a little more subtle, but it’s to do with the fact how caches operate at the operating systems level. If you remember, there’s multiple, usually as multiple layers of caches level one, level two, level three cache. And the operating system is clever about pulling memory references into those caches to hide or mask latency going out to the main memory system from you.

But, well, if you allocate a whole lot of stuff, it puts all it in cache, but if you never use this thing again, only use it once you can have locality of these caches is very poor. If you allocate a lot and sort of throw it away quickly afterwards, you start to see a lot of cache misses and you kind of pay for this price twice because you once wants to pay for it when you allocate, because when you allocate something that’s quickly going to come garbage, it kicks something else out of the cache that might be useful.

And then you pay again when the garbage collector runs over the live sets to sort of to figure out what to allocate, what to collect. Because again, it’s going to kick out a bunch of stuff that you cared about in the cache and kind of made your casual [inaudible 00:23:20] bad. So we kind of have these two main reasons the arbitration itself and the effect on cache, cache as being problems. So now the question is, well, what do you do about it? Well, the answer is simple. If allocation is a problem, well, and you just rewrite your code to not allocate, and then the problem kind of goes away. And this is usually what’s called a zero alloc coding paradigm. And the idea here is you just write your code in a way that it doesn’t allocate on the critical path.

You can allocate anywhere else on startup on kind of error cases or corner cases. You can allocate those, you shouldn’t allocate under sort of the critical path of the path of your system. So the way you do this, is you keep flat and simple representations. You don’t use abstractions like you bind on sync deferred and a monad and all kinds of fancy stuff. You just, usually you have some IO buffer, you read some memory into this thing you see over the IO buffer in the right places, adjust some data structures and you write something out and that’s it. So it’s usually a very flattened, simple hierarchy and representation of things. Similarly, you use sometimes they use an external pool where you do your own explicit memory management, instead of letting the Java script to do it for you. This way you can allocate objects during the, on the critical path, but you don’t have to pay this price of the garbage collector.

And this is why I mentioned the function for interface earlier of calling C code from OCaml, because the way you did it, as it turns out, as you call it some C code in your program, and now this C code access memory that is outside of the control of the OCaml garbage collector. And then if you’re clever about it, you can sort of lay out the C structures you have to look like actual OCaml values. So should the OCaml runtime can read them and pretend they’re OCaml values, but a garbage collector kind of never sees them. You also adopt usually a synchronous workflow, because it sort of makes things easy and there’s no lock contention overhead. And you also use a technique called pre-allocation if you don’t want to use this explicit Marion management.

So if you step back, you might say, “Well, that sounds a little hacky, and that sounds a little, well, low level, perhaps.” And that’s true, but I kind of want to stress that you really only have to do this in this small subset of your code and the critical path of your code. The vast majority of your code still can use the garbage collector and sort of which is in fact, nicest way as you can.

You might also say, “Well, if you do like stuff like this, why don’t you just write the whole thing and see.” And I know I didn’t… I said, I don’t want to compare languages that much, but here I go anyway. I think you still get a bunch of nice benefits from using OCaml here. For example, you get the type system and a very expressive way that lets you kind of encode a lot of safety properties, that you can’t otherwise encode, I think as easily. And you, I think also maintain a bunch more memory safety, than you do in other languages.

So while this might feel a little low level, I think it sort of strikes a balance between low level, but also gives you still a bunch of tools that you can use to maintain kind of code you like to maintain and look at. We see a bit more of this in the examples afterwards. Okay. One kind of thing I talked about is garbage and why is the problem and how do you avoid it? So the second kind of paradigm I want to talk a little bit about is the idea of function calls. So a function called obviously if you sort of compile it down, whether it’s just an assembly, a jumper instruction to some other pet peeves in your code.

So you might say, “Well, why a function called slow? What’s that? Why is that a problem?” And then you might remember some operating system class of like, “Oh, actually, you have to preserve some registers and, you might have hit an instruction cache miss, right?” Depending on your calling conversion, maybe you have to move some things around or maybe you like instruction cache isn’t perfect. Maybe you get some misses there or maybe you have too many functions arguments, and then you spill over registers and there was a whole like problems and things to think about, but it also turns out that the operating systems architects are usually very smart people.

So these problems have kind of gone away or at least become a lot less noticeable. For example, brash prediction is usually pretty good. So you usually don’t have to worry about it that much. And in fact, it’s probably, if you have sort of a function call in a tight full loop, you might be able to start to measure these and start to notice these being a problem. But in general, these things are not going to dominate your performance probably. But as two sort of bigger points of why function calls can be a problematic thing. One is there’s problem of a closure allocation I mentioned earlier where the one-time has to allocate extra memory to be able to represent your function call. And that kind of brings us back to the first problem of having too much garbage and too much allocation.

But then it’s also this problem of having fewer compiler optimizations. So you open compiler by default is fairly simple in terms of optimizations that does, and having a lot of function calls makes it even harder for the compiler to do anything clever. Now that we’ve kind of seen function calls and why they are problem, what do we do about it? Well, we could resolve to do the thing we did in the other case, and just never use a function call, just have one big procedural main function. But that will obviously be terrible. So we can’t quite do that.

Instead we’re going to enlist the compiler for help. We’re going to use a compiler optimization called inlining, that tries to get around this problem. So the compiler optimization … Inlining, it kind of sounds a little deceitfully simple because the optimization is just, if you have some function here, G, that does some stuff well to this example, I call it choice once were true. And once was false. The way you apply inlining is you kind of take the body of G and just paste it into the call sites.

So instead of having this function of in G you would just in this immediate G true and substitute the body with flag equal to true, and then you do it again down here. Yeah. So this is kind of a simple textual substitution of taking a code from one place and moving into the other. It’s not quite that simple because maybe there’s some mutability and state around, but in general, the ideas kind of disappear. And now if you look at the code down here on the right, well for one G is gone and G again, I cleverly wrote as a closure and because of this arc over here, so there’s no more closure location. Great. If you then look at this piece of code over here well, there’s some pretty obvious optimizations that even the OCaml compiler could very easily and very quickly do. Hooray.

All right. So, all right now we’ve gotten around this problem. We can use function calls everywhere in our code. And then the compiler is going to come in and inline everything away and it’s going to be great and awesome and super, super fast. It turns out it’s sadly also not that simple because the sort of trickiest bit about getting inlining right. And this is what’s sort of the tools and compilers team at Jane Street I think spends most of his time on, is trying to come up with good heuristics about when to inline and when not to inline, because it turns out if you just inline all the time, your program is going to be a lot slower than if you hadn’t inlined at all.

So really the trickiness here comes from when is it a good idea to inline? When is it beneficial to inline versus when is it not. Cool. All right. So now we’ve kind of seen these two coding paradigms. We’ve seen function calls, and we’ve seen inlining as these kind of two things to think about. So now I want to talk about two sort of slightly bigger scoped examples and see how things like this come up. The first one is the beloved option type. I used to give a separate talk about why option type itself is such a useful feature in how any language that doesn’t have something like this it’s just flawed inherently and half the time maybe it’s very similar and sort of the very short argument of that talk is the idea that this gives you the type safety of avoiding non-point exceptions entirely, right?

So in lots of languages, you can have a thing and some of some type, but oh, it can also be no. And you better remember to check it now, and it’s annoying and like sort of countless bugs, while in this, language, you kind of with the option tab, you explicitly state when a thing can be absent and when the thing is guaranteed to be present, right? So if you have something of type T then you know what’s up type T it’s not just T on all it’s T. So this type is really, really useful and a really useful feature in languages. And I think it’s becoming a lot more popular in languages that don’t have it and try to crawl it back somewhat. So if you look at a bit of code that does some stuff, it takes some arguments and it tries to take an order and validate it against the market and receive if it’s okay.

Well, it does some stuff and it tries to find the most aggressive price currently in the book. And if there isn’t an object to the order, otherwise it kind of does a check in sense. So this is pretty straightforward. But of course, there’s fine most aggressive price function. We turned that option because there might not be any pricing in look, right? There might not be anyone willing to buy or sell in the market currently. So you kind of have to specify what to do in that case.

And now if you kind of squint a little and think about how you’d write that in Python, it’s pretty obvious and pretty straightforward and will look very similar except the fact that you probably have to remember that, well, the price you would get returned here might be null. And you probably have to have some like, “If null, then there’s this otherwise not.”

And the problem here is well now you do develop hadt to think about that, and you have to be clever to remember to do so. And there’s nothing in sort of inherently that reminded you to do this well. With the option type and the type system there isn’t a way to get the price out of the function return rather than disrupting and matching on it, which kind of explicitly forces you to specify both things. If you omit this non case, the compiler’s going to at least warn you most of them are out. So this gives you the nice feature of kind of eradicating the possibility of a null point intersection. Cool, great, awesome.

Except, well, if you benchmark the thing and if you kind of assume all these functions that I called here, do nothing. Well, it turns out if you do nothing, it’s fast, it’s only takes four nanoseconds, but then allocates four bites. That sucks. If you do like a bit of back of the envelope math, if you take like four bites per message. And if you’re setting your minor heap size in OCaml to swing like 256 kilobytes, which I think is pretty normal, then you kind of have to run a GC every 64,000 messages, which if you’re trying to process million messages in a second is not that much. [inaudible 00:34:00] is a problem. And by the way, I should say the price itself, you can represent as an integer. So like the price itself is fast immediate, but still the option makes it kind of makes you have to allocate. And that’s frustrating.

So what do you do to avoid this? Well, if you think a bit about it, you could probably call it the same solution that we did. You build something what’s called immediate option, where you take the univere of things, universe of integers, you take one value out of the universe, and sort of say, “Oh, that value is now just none.” You can’t use it anymore. And then the rest of the values are just sort of immediate integers, and then you put a little tactful interface around it and use something like this. So that’s pretty straightforward and avoid having to allocate. And who’s going to have a price of min value anyway, no one. So fine.

So then if you read the code, well, it looks a little bit like this. You still have this find most aggressive price, you reject if it’s not there, otherwise you’d lose some checks, right? So the coat’s pretty similar. And now if you benchmark it, hooray,, no allocation and you kind of succeeded in eliminate this allocation. Great. Except that it’s kind of obviously a lie or obviously a deceit. I just kind of raved to you about how awesome this option type is and how we’ve eradicated the now point exception forever and how we laugh at other languages, where you have to remember to call, to check if the thing is null or not. Well, now we have to do it ourselves because kind of here, we have to kind of make sure we check is none. And, if it’s not, well, we can use this unchecked value function it gives us the value.

But now as a sort of reveal, you have to scratch your hat and think, well, if this unchecked value function is here, is there always guaranteed to be safe? Is there somewhere to where our check for is none in this case? Right. We kind of back into the world of non exceptions and, kind of having to think about yourself as a developer, having to remember to check for none is none. That’s frustrating. But it turns out you can play a trick to avoid this.

Is this the allocation in the previous slide?

Sorry. I think it’s not obvious, but it’s just the way that the OCaml option type is implemented is I’m using a box value, so you have to allocate for it. By default the OCaml option, like an integer, you can just allocate represent on a stack, on a register, but as soon as anything is an option, you can’t. That’s it. So turns out you can play a little trick with syntax to avoid this . And, in the end, your code’s now going to look something like this, which is like, it looked in the first slide, except there’s little percent optional.

And this percent optional is well it’s PBX, but it’s sort of more generally, it’s a P processor that is applied before the compiler compiles your code and the P processor kind of takes this code you’ve written and expand it into something else to compiler than that, kind of looks that. And the thing it expands to is kind of cool. It has… So then if faults branch, where you take your efficient price immediate option, and turn that back into a normal option, and then you match on it in the normal way with none in some, and then if it faults… So in the true branch and the out branch, you kind of have the code that we had here, which is a fast and efficient way of doing it.

And the sort of neat trick in this is that the tack checker in OCaml runs before any debt code emulation. So you get your nice normal compilation errors. And if you omit the non case, it’s going to say, “Oh, you’ve omitted your non-case. You have to put something in there.” But the code that runs at runtime is the fast and non allocating version. So now we’ve kind of had this thing of we’ve wanted this type safety of optionality. But we also didn’t want the allocation behavior. So the way we kind of got there and combined these two things is by playing a [inaudible 00:37:50] on top of it.

Right. Now I want to talk about one other example, which is slightly bigger for which I want to look at it. Can you read this or is it too small? Is it too small? It’s good, or is it small. Well, mixed answers so it’s too small. No, I guess it’s not too small. Sorry. I think that the idea is pretty simple and all the code details probably don’t matter that much. Let’s say we want to do something straightforward. We get some message from an exchange and the message of something like this. There’s a bunch of identifiers, some symbols, and it also contains sort of a list of other things that has like some prices and the side to it. Fine. All right.

These are like some arbitrary messages and exchange kind of sense to us as part of my data. And now let’s say the thing we want to do as well, we want to iterate over all these messages we get and then throw it over all the book entries we haven’t done. And whenever we see and book entry that has the side set to, let’s say, sell, we want to count up the size we have. So at the end, we want to kind of want to know how much size there is in this total market industry for selling.

Say you wanted that. I’m sure if you kind of looked at us and spent 10 minutes writing some OCaml, you could kind of solve this problem. It’s pretty straight forward. But also, if you solve it in this normal naive way, if performed terribly and it will kind of perform terribly for two reasons for one, is every time you get this new nice message from the exchange, you have to allocate this record type, which is similar to sort of as a struct and see which means you’re going to allocate the whole ton of stuff. And you’re not going to use very much of these things, but you still going to allocate everything. That’s terrible.

And similarly you kind of, not only do you allocate you also pause way too easily, right? So the message is going to come down on some binary format to you, and you’re going to convert it into something like this, which means you’re going to take a look at all the timestamps and pause all the timestamps out of this binary format, but you’re never going to look at a timestamp. So that’s a shame. So you pausing too easily and you’re allocating way too much for this to be efficient, right?

So what do you want to do instead? I think that the actual thing you want to do is sort of take a buffer of IO, where you can just write stuff in, have your network card, write the packet into that buffer, seek to it in few right places adjust pointers and you’re done, right? That’s you got a much more straightforward and much simpler in the flat representation of how to do so. So let’s see how we might want to do that. So we could do something like this. Where we do a bunch of stuff.

And let’s look at this. So we have like some IO buffer, and now we can vet some interfaces there. Well, if you want to get to a exchange identifier, well, we take the buffer and we seek to position five and we get [inaudible 00:40:52] fine. If you want a sequence number, you go, of course, to position 13. So you can do this and it’s now fast and efficient, but it’s also terrible to read or write or exist. I don’t know if the protocol changes, I don’t know how to adjust this number 13. I’ve never thought about it. I don’t know what it means. So clearly this is not really maintainable or usable. But again, you can play a syntax trick if you have something that’s a fast and efficient, but kind of unmaintainable or sort of it doesn’t have the properties about safety that you want.

Well, maybe you can play this index trick. In this case, you can, instead of writing this code or reading this code, you can just write something that writes this code for you. Great. And this is what we did with a library called photogene which is just a code generator where you specify in a little DSL kind of how your message looks like, and you specify kind of what parts go in which order in this message.

And then there’s library goes away and writes this code for you. And in fact, I never go this far unless I showed you, that’s why I don’t know where number 13 comes from and I’ve had sort of photogene write this for me. And now you have kind of benefit the benefit of the thing is fast and you don’t have to actually maintain some horrible code, but you still get to maintain pretty nice and easy code. Now to kind of prove to you this is not all wrong. I wrote a small benchmark where the code, I think is a little bit too much. I’m here to fully read and understand, but you can hopefully trust me that it kind of does the thing I said, hopefully.

And if you look at… I’ve written, two limitations, one using the simple types that I had defined and one using the efficient types of had defined. And if you look at them now, then don’t look at these implementations at the implementation here, it doesn’t look a lot worse. But then if you look at the benchmarks, you can kind of very clearly see an effect here. You can see that the efficient posing of message takes 700 nanoseconds immediately on this laptop here. So it’s not the world’s best technology. But the efficient world is quite fast while the implementation is for five miles, so the five microseconds is quite slow, right? And again, you can see here, these are minor with allocations per run, so you can see the thing does not allocate but this thing does. Right.

We can see that this has sort of efficiently, sped things up for us. But another, I think interesting point in sort of a more meta point here is that, if I had shown you this slide at the very beginning of this talk, you might’ve said, “Well, sure, like you’ve spread things out by like a factor of eight,” but you also had to write like some library generate code for you and a DSL and all this nonsense, is that really worth it? And the answer to that in general is I found very hard, very tricky to reason about, but in this case, it’s kind of clear because of the analysis that we did earlier. We kind of know that well, five microseconds is just not acceptable. It’s just way too slow.

I think as a metaphor and this kind of doing this kind of precise analysis early to figure out what numbers you actually need for your system to be acceptably fast helps you guide, make these performance optimization choices later on, because it was always a trade off between writing faster code, but also spending more developer time on maintaining or coming up with it.

So to wrap up, one last thing to say is, we have written your application and it’s kind of now super fast and you’ve written it in all the right ways. And it sort of doesn’t allocate in the right places. And then the compiler has in all the right things for units. It’s awesome. And now you go and try and run it for the first time, somewhere in production in CE.

And it’s obviously super fast. And then it turns out it’s not fast at all. It’s super slow. And now you kind of enter a fun third world, but we’re not going to talk about, but I find very fun to think about, which is kind of the world of deployment. And these sort of tricks you can play here to be fast. For example, you might figure out that, well, if all your application does is like read these packets, do there’s a bit of analysis on them and then write them out again maybe the majority of time it spends is in the operating system stack and not your program. So maybe you want to do something to improve on that. Or you run into this sort of fun fact, they called the cold hardware tax, where you have the sum of them of, or if you make a system do very little it’s very slow, but then if you make the same system, do a lot, it’s a lot faster at doing so.

And only sometimes you control that. Sometimes you don’t control exactly how much your system does or how much a system that you talk to does as we kind of had to reason about how to avoid the cold hardware, where you can, the cold hardware tax. And there’s a myriad of other things you can treat, you can think about network topology and definition to cause and all kinds of other fun stuff that are really, really important and getting your system to run fast. But we don’t kind of talk today. So instead I kind of want to wrap up and talk about a few takeaways that we had from kind of trying to do this and exploring this the first one I think I already said, but it’s very important to understand your requirements.

And it’s very important to upfront, be very clear about what is fast and what is slow and where do I draw the line, right? It’s kind of almost impossible to make these decisions afterwards, if you don’t exactly know what numbers you’re aiming for. You also kind of have to understand your language again. You don’t have to use OCaml, of course. I encourage you to, but you don’t have to. But you still, whatever language you choose, you kind of have to understand carefully. How does, this language translate to machine code and how, and how its run time operates and such things. And you also have to understand the machine itself that it runs on. Another thing I didn’t talk about at all is you have to understand the tools to poke at the system that you vote, or measure the system.

So it is surprisingly easy to benchmark the wrong thing and be confident about the conclusions you have of it. And then only later to realize the benchmarks were nonsense. So actually getting a good and precise benchmarks or performance analysis out of your system isn’t trivial either. I think separately, if you want to build a system like this, it is kind of impossible to take a big complex system and say, “Oh, let’s make it fast.” And it’s just about possible to do the opposite of take a system that does nothing and make sure it’s fast at doing nothing. And then kind of intuitively add features to it. And every time you do, you kind of make sure it still is as fast as you want it to be.

And this kind of goes to say, you have to consider the performance in your system from the very beginning and not sort of as a later add on feature. To be able to do this. You also need to measure constantly. Most of these things that I talked about are kind of fragile things that might change the next time we changed the OCaml compiler or the runtime version. So you have to have good regression checks and benchmarks in place to make sure you catch these kinds of things.

And finally, I think just sort of maybe to us slightly surprising takeaway is you can do this and you can kind of hit these performance goals of being below a microsecond with using kind of standard and safe programming languages. And we don’t have to always resort to writing SMB code as we originally had feared. In fact, you can do this with a sort of standard language or at least a high level language like OCaml. Okay, cool. Oh, I think this is all the questions I had. Sorry all the things I had. Does anyone have any questions? Yep.

How synchronous is the system?

Usually very synchronous. I think the systems, at least that I’ve seen or touched most operate in this way that you have this packet, you read it in, you operate on it and until you’re kind of done with it, you do nothing else. And doing that kind of puts you in a peculiar spot because now you have to operate at line rate. Like if you, at any point slow, everything’s going to be slow, right? But at the same time, it gives you a bunch of high-level qualities or bigger sort of abilities because you can reuse existing things.

So traditionally, in this example, I showed you don’t allocate new buffers every time a message comes in, you have one buffer, you read into it and you processed your message, you’re done. And then you use the same buffer again. And being synchronous kind of means that that is safe to do. But if you’re asynchronous, you have to think about, “Oh, but maybe someone else would reuse my mutable state later while I’m not yet done with it.” Kind of thing.

You’re doing this for trading, and for messages? Some of the messages aren’t [inaudible 00:48:58] anything they actually have synchronized process, they have their own buffers.

I agree, but I think for one is sometimes the exchange has gives you one big fire hose and you don’t have a choice. You have to have a system like this at the beginning, who can process everything in one go. And I agree later on, you want to fan out, but I still think there’s something architecturally important of being able to at least able to build a system that can sort of process the full rate without any parallelism.

It just feels like you have multiple views on site at the same time, the ones that are-

I agree you can do this. I’m not convinced that that solves the problem in a better way, but it might. But I think we should probably talk offline afterwards. Yeah.

Do you use the same system of the coach in DSL to do IPC stuff and memory map files?

Let me unpack your question. You’re saying, do we use the same system as cogeneration in interprocess communication in memory map files?

Yeah. So similar paradigm, as far as example, you were showing this for just in process allocating offer, I was wondering if it was having a similar idea for how you’re communicate IPC or device trackers and things like that.

So I think the answer is yes, at least to some extent, I think we’ve definitely experimented with sort of having multiple processes use of shared memory as a mechanism of communicating in this way. And I think we also experiment with all the sort of cleverness you can do with device drivers. So yeah, I think the answer is yes, though I’m not entirely familiar with all the details.

And then the second question I had that was, I think a couple of times ago the subject of JDTs came up as a better way of representing and touching memory.

Oh yeah. Excellent. Oh yeah. So I kind of sidestep this in this talk because, I didn’t want to introduce the whole notion of a JDT because it’s a complex thing to talk about. So what I did when we looked at this protogen, this example I had. I kind of sidestepped an issue because as I said, “Oh, here’s a message type.” And it was just a product type would you can like very simply expressing in a DSL. But if you want to do some types of like you don’t like… Depending on the first bite of the message, maybe the rest of the message, they have changes. Now you have to play some tricks with the type system to do so. And yeah, that’s that. And I agree you can use JDT to do that. Yep.

Is this like still traversing the OS memory stack, or is it being injected earlier in the process and actually correctly reading from tbe buffers?

So kind of the thing I presented is I think you can do both things. So we at Jane Street, definitely do both too. We definitely experiment with ways of bypassing the operating stack. I think the code I just showed the examples I showed kind of just work through in a normal opening system. But yeah, I think if you want to be faster these worlds, you can also accelerate by kind of ignoring your operating system stack. Yup?

How do you guys do FPGAs?

Oh, so the question is why we have not done FPGAs or?

Are you guys doing it?

How are we doing FPGAs? Yes.

Why were you not talking so much on the software.

Right. This tall kind of focuses on performance on a standard single core thing. And I think that gives you a lot of value, though, and I think the talks meant to show that, well, you can get below a microsecond doing so, and you can get below a microsecond doing so without sort of giving up too many of the features you like. But I think there’s also kind of clearly a sort of lower bound where if you want to be even faster than this yes, you have to use FPGAs and Jane Street have some FPGAs engineers who experiment with that. But it kind of forces you in a different framework too, I don’t think at least we had at least we at Jane Street don’t know how to do all the things that you can do in this framework and a FPGAs you can, you certainly use FPGAs in highly specialized applications things.

But I don’t think they solved the general problems of this. I agree in some cases they would, So, yeah.

So this doesn’t in way [inaudible 00:53:06]?

This does not run on the FPGA. But some other things at Jane Street run on FPGAs.

Sorry, again. So this is a [inaudible 00:53:18], all right? So in the science, if you’re processing this in the software, as you’re doing something else on FPGAs why are you crossing, [crosstalk 00:53:28] FPGA?

I don’t think we’re crossing. I think that these two, two different systems with two different characteristics in different places.

Oh, the orthogonal? Okay, okay. That is fine.

What is your age in [inaudible 00:53:43] library, what are you using for picking all these metrics?

Good question. I don’t have a good answer. Like in this case, we kind of did an analysis in the study, which I think the study itself was pretty easy because you can kind of subscribe to the NASDAQ stream and kind of like, just in this case, you can build this, the original example I showed of the queue distribution, you can kind of just build a simulator, subscribe it to the real market of the feed with a different queue sizes and see how big the queue gets. I’m afraid to say that there isn’t a good general answer I have here is more of a… Well, if you want to do this particular experiment, I can show you how, but I don’t have a great answer to general tools, so yeah.

Given how much compiler work you guys do, is there some structural reason why you couldn’t rework the variance system to instead of kind of hack you did with PPX for options just to fix it on a pilot level?

Yeah. It’s a good question. I think I’ve asked the same question. It kind of feels a little frustrating to sort of work around with a preprocessor at a sort of around a fundamental limitation of the runtime when you really like to just be able to change the runtime. I think the answer I got was like, it’s not trivial to change the runtime, like backwards in complexity reasons. And, also sort of memory representation and the way it interacts with the garbage collector tagging is not trivial. But I agree, it kind of feels a little intellectually frustrating to have to build something on top of it to make the underlying thing disappear when you could just fix the underlying thing maybe. Yeah?

Although you mentioned that you have to [inaudible 00:55:22] OCaml. And you don’t want to do the language [inaudible 00:55:27] but for GC main characteristic. Why don’t you consider something else that don’t have a [inaudible 00:55:39] predictable?

Right. So the question is why do we use OCaml in these cases, especially on GCs when we could use other languages? I think that’s a good question and a fair question. But I think there’s like a high value in maintaining the same language everywhere in your sort of workflow. So Jane Street we kind of use OCaml for everything, which means all the developers understand OCaml all the developers use OCaml on a daily basis. All the tools are optimized for OCaml. So I think there’s a pretty non-trivial cost introducing another language to do a small, even if you use do it in a small spectrum, which I think is the primary reason not to do it. And I think originally when we started out, we were thinking, “Oh, hopefully we can make this work in OCaml if we can’t, maybe we’d have to do something like this.” But I think with at least this target’s kind of trying to show that at least to some extent it works in OCaml. So we have it doing that. Yup?

I think you mentioned about the inlining that there is someone that basically measures when it is worth and not worth, right? But do you know anything about… Do you guys build like models of computer architecture, is it just like benchmarking like you just read this [inaudible 00:56:51].

I think it’s more the latter, it’s more of an empirical study, I think there’s… I think it’s mostly a benchmark in the empirical study of what does your performance look like if you inline this thing… If you don’t and then build a heuristic based on like the size of the function, or maybe you tag something in the new code as being on the hot path and the cold path, and then you line spec based on that. I think that is the primary way of arriving at the heuristics.

Does the OCaml compiler, give you [inaudible 00:57:23] the inline something or the inline or?

Yes, it does. You can specify at a particular function point and that you would like this inlined and if it can’t inlined it and for a reason, it will give you an exception. That’s actually a compiler yeah one, [crosstalk 00:57:37] yes. You can say kind of, “Oh, this things should be inlined,” and then it can kind of guarantee that or fail. Yeah, we do that. And at least in these kinds of code paths. Yes. Not on other code pass, but yeah. Any other questions? Cool. Thanks.

Safe at Any Speed: Building a Performant, Safe, Maintainable Packet Processor

Transcript

The next great idea will come from you