Production Engineering When Trading Billions of Dollars a Day

Mark Doss

Production Engineering at Jane Street

What’s it like to monitor and operate software that trades billions of dollars a day on stock markets around the world?

When your software has near-unlimited access to your bank account, every single message counts. Speedy alerting and incident response have a direct and measurable impact on the P&L.

In this talk, Mark Doss, a production engineer at Jane Street, explores the day-to-day technical operations of a trading firm, with a heavy focus on what happens when things go wrong. He covers the unique features of the trading environment that make production engineering especially high-stakes, how Jane Street approaches monitoring and alerting (and why traditional SLO-based approaches often fall short), the role of defense in depth and cross-team communication, and walks through a realistic sample incident to make it all concrete.

Transcript

Mark (00:00:05)

I’m Mark. I’ve been at Jane Street for a little under 10 years now. I’ve moved around the firm a bit, but I’ve spent the bulk of my time doing something close to what we would now call production engineering here at Jane Street. And I’m going to talk a little bit about some of the interesting challenges that come from the intersection of that discipline with the trading environment. (00:00:27) I’m going to start with an intro. That’s this. I won’t infinitely recurse. I’ll then go into some relevant features of the production environment. What makes trading such an interesting space in which to operate software in production? And in what ways is it similar to other domains? And for those of you who work at other trading firms, some of this will be familiar. Some of it may be different. Maybe the ways in which we place emphasis on things might be different.

(00:00:55)

Spend about 15 minutes on that. I’ll actually try to spend a little less time, but we’ll see if I succeed. And then we’ll talk about what it means for us at Jane Street. How should we think about building technology, processes, culture, in light of the features of the production environment? I kind of hope to spend a little more than 15 minutes, but again, we’ll see how I do. And then I’ll talk about a sample incident that will hopefully make everything feel a bit more concrete. And then I’ll leave as much time for Q&A as possible. Hopefully that sounds good. So a few more words of introduction. First off, I’ve said production engineering at a trading firm. Production engineering, similar words might be DevOps or SRE. It’s kind of similar concepts. This is one plausible definition. I don’t claim it’s a perfect one, but maybe it’s good enough.

(00:01:46)

Maybe a more human definition is roughly, we’ve got a bunch of software running. It runs on computers that have hardware failures, in data centers that have cooling outages, over network connections that get their cables cut, connected to exchanges that have their own issues. And of course, we occasionally deploy our own bugs. And so how do we technically, operationally, manage all this stuff in production? And again, especially when things go wrong. So that’s basically the problem statement. And then I said, at a trading firm, so what is trading? What does a trading firm do?

(00:02:25)

Very basic basics of trading, sort of a system diagram that you need to know. You start off with some trading system. Could be a human trader. Could be an automated system. Someone coming up with some strategy. The primary input into a trading strategy in our model is going to be market data. So an exchange can tell you recent orders, buys, sells, et cetera, in some stock. And a trader who’s going to decide what to do with that is going to be interested in those order, tick by tick order information. And then they’re going to come up with some strategy, some trades that they want to execute. And when they do that, they’re going to need to execute it somehow. So we’re going to connect to some sort of order entry port. I might call it an order engine throughout the course of this talk, that’s going to be directly connected to an exchange.

(00:03:16)

We might have a lot of different order entry ports connected to a lot of different exchanges. And that order entry port is going to write to the exchange and also read to the exchange, read from the exchange. The exchange will acknowledge the orders or reject them, communicate that it has been filled some way like that.

(00:03:35)

That may be a thing that is already in a lot of your heads. Maybe it’s like your model of trading already. There’s one more piece that people forget about a lot. I actually won’t talk about it a ton in this talk, but would love to later because it’s in which I first started at Jane Street, so it’s near and dear to my heart. But once you’ve done a trade, you also need to tell the world and yourself that you’ve done it, so you need to keep track of your positions. That is then gonna feed back into your trading strategy, which needs to know what its positions are to correctly perform its logic, and you also need to upload those trades to regulators, also clearing firms and other external entities.

(00:04:14)

That is the basic system diagram that you kind of need to have in the back of your head. And I want to go into one more thing, which is I have said the word order a bunch, like you send an order to the exchange. Let’s look at what an order is. When we talk about an order, we’re talking about an instruction to an exchange to execute a certain trade. And this is, I would say, like the simplest order that you can kind of come up with. There’s five main key components to this order that I want to talk about. There’s the side. You can buy, or in this case, sell. There’s the quantity. One share, maybe 100 shares, maybe 1,000 shares. There’s the instrument symbol ticker. What is the thing I’m buying or selling? In this case, J-Com. Could be Apple. And then there’s a price, which is going to consist of a quantity and a currency.

(00:05:11)

That’s the basics of an order. They can get sort of arbitrarily complicated with more foot guns, multipliers, quoting conventions, all sorts of funky stuff. But this example is going to be good enough for the purposes of this talk.

(00:05:28)

That’s the intro. Now let’s get into some more complicated features of the trading environment. Feature number one is every order is important. This is notably different from many other web services, where maybe it’s not the end of the world if one client’s request to your web server fails or is a little slow. It’s not so bad. They can wait, try again. Maybe worst case, you end up with an unhappy customer. It’s not really the end of the world. Returning to our previous example of an order. Simple, but if you were to get any of those details wrong, it could be catastrophic. What if you make a mistake in any one of those economic details? To give one hypothetical example, like what if you were to make this mistake and sell 600,000 shares at a price of one yen instead of selling one share at a price of 600,000 yen.

(00:06:27)

Quick mental math, if you were to execute this or submit this order, and if it were to get filled, you would lose roughly 600,000 squared yen. Now, you may think, okay, this is a pretty implausible bug to write. Like, that’s a huge mistake. Just don’t make a mistake that big. But, you know, 1,000 engineers writing code that needs to be handled by automated systems and manual trading, so you have to avoid fat finger bugs. There are different order types. They do get more complicated. There’s different exchange protocols. You kind of need to be 100% sure that you’re not going to somehow make this mistake. And you might think, okay, doesn’t the system kind of save you? Like, okay, sure. Surely someone submitted this mistake. And like, you can sue or something, right? Like no one’s going to actually let this go through. Unfortunately, this is not some random hypothetical example that I came up with.

(00:07:23)

This was a real thing that happened December 8, 2005. A trader at Mizuho made this exact mistake. And he immediately tried to cancel the order. The exchange did not accept his cancel. The order went through. It was filled. The firm lost 27 billion yen. Japan’s primary stock index fell about 2%. And I think that trader was pretty unhappy, as was the firm. They sued. They won some money, maybe like a quarter of the total losses they were requesting. So I would not count on the financial system saving you. You just cannot count on that. So that’s one reason why every order is important. Separately, there is compliance concerns. Like, the SEC is just not going to be happy with 99.99% success rate. They actually want you to report every order. You will get fined if you have a few thousand, you know, missing reports here and there. So yeah, in general, feature number one of the production environment, every order is important.

(00:08:33)

Maybe I can more generally, I can generalize that into trading is just generally scary, especially automated trading, where we’re connecting computer programs to our bank accounts and telling it to operate in a tight loop. Like, anyone who has written software should know that that is a scary sentence. It would be scary enough if the logic were all trivial. But the logic actually gets pretty complicated. The price might be more complicated than that simple one number price I showed you before. There could be a price with a multiplier. The multiplier for a given instrument could be different depending on the context in which you’re talking about that instrument, whether it’s like pre-trade or post-trade. And remember, getting any single one of these details wrong can be catastrophic. Or it could be just a little bad, but if you do it enough times in a tight loop, that can be catastrophic, which I’ll get to in a second.

(00:09:31)

And then the last bullet point there about adverse selection, I just want to say sometimes these examples can seem contrived, like, you know, okay, sure, you roll this bug, but nothing bad is actually going to happen in the real world, right? And the problem with the trading environment is, no, something bad will happen. In this environment, as soon as you decide you want to do a bad trade, every other market participant very eagerly wants to do that trade with you as fast as you or they possibly can. So once you make a mistake, it is sort of immediately pounced upon, which is, I think, pretty unique to the trading environment.

(00:10:08)

And again, if you think this is theoretical paranoia, this is a real thing. Quick show of hands, who has heard of Knight Capital or Knight Day? Okay, less than half. Okay, about half. I’m not going to spend too much time on this, but I just want to recap this story. So the basic story of Knight Day is they were a trading firm like Jane Street. This story happens in 2012, I believe. They were among the leading market makers, automated trading firms based in the United States at the time. And they had some dead code, which if triggered, would lose them money. There was a bug. It would lose them money on every order that went through that system. It would only be triggered by another system sending a certain config flag. And that other system never sent that flag. So this never triggered in production.

(00:11:04)

Then they did two things simultaneously. One is they repurposed the config flag to do something else instead, implement some new feature. This was to save themselves a version bump protocol upgrade.

(00:11:20)

I have done this before, repurpose to config flag to save myself a version bump. So this new feature that they were implementing did not have the bug. So, but this is a good thing, this change. They’re avoiding the bug now. Then they started sending the flag from that other system. So you can kind of see like, uh-oh, better hope step one goes well. There’s going to be a problem. You can probably predict that step one did not go well. It mostly did. They had, I think, eight instances of this system. Seven of them successfully upgraded, and one of them failed to upgrade. It actually did send notification emails that it failed to upgrade, but this was ignored. So the open happens and they immediately start losing money. Every trade that goes through the system that didn’t upgrade loses money. The engineers immediately identify the system as the culprit.

(00:12:19)

And what do you do when you just rolled to a new version and it is hemorrhaging money? You roll back. They did not identify that it was the system that failed to roll that was losing money. They thought it was the system that rolled. I think a pretty intuitive, reasonable thing to think, quite frankly. So they pretty quickly rolled back and that 8X’d the speed at which they were losing money. 45 minutes later, they had lost about half a billion dollars. And they basically ceased to exist at that point. They had accumulated billions of dollars of positions that went against them. And their own stock price collapsed. They were bought by a competitor a few months later. And now we talk about it in talks like this as a cautionary tale. There’s a ton of nuance to this story that I’m glossing over. I could have a whole talk on lessons from Knight Capital Day, and that is not this talk.

(00:13:17)

The main thing I’m trying to convince you of is just that, you know, this is not a theoretical concern. This is a thing that happens and that we think a lot about.

(00:13:28)

And there are some… The SEC has a great report on this if you enjoy like, reading. It’s actually like an amazing postmortem and you should read all the details. So that’s feature of the production environment, number two. Feature number one, every order is important. Feature two, trading is generally pretty scary. Feature three, a little different. Timing matters a lot. Our system follows the trading day. So systems shut down in the evening, and they start up at the beginning of the day. And nothing really happens in these times at the beginning and end of the day. And then there’s these huge spikes at the market open and the market close. People not in the industry may be surprised at just how large a percent of our overall volume happens in those small windows. And in between, there’s a lower constant level of trading.

(00:14:28)

This is a slightly simplified model. Stuff happens during the day, too. So sometimes there’s special trading events. FOMC minutes, this means like, every few months, the Fed will announce various economic data or, you know, their policies. This is a big trading event. But that’s one example. So we have these planned trading events that we know about. Sometimes the events are unplanned. Trump tweets something, or, you know, a country invades another country or something, and volatility goes haywire. That’s a thing that happens as well. And then to say one more thing about timing, external systems, like we have a lot of external dependencies that are really anchored to time. So there’s a pretty interesting flow throughout the day where, you know, at 7:30 AM every day, you download some metadata from some external service that then you load into your systems, you load those into another set of systems,

(00:15:33)

some other, you download some more metadata, and you’ve got this kind of ticking clock of about two hours every day where a bunch of things need to happen in concert, and if anything slows down, you need to scramble to fix it before the open.

(00:15:47)

There’s a few examples listed here. There’s more examples that I could give, but this is the kind of shape of what the trading day might look like. So that’s another, I would say that’s an interesting feature of the production environment. Number three is just that timing matters a lot. The last feature I want to talk about is we have internal users. This is actually, in my opinion, the most interesting of all of them. It’s pretty cool to work in an environment where all of the users of our technology are mostly within the same building, or at least within the same company. And I’ll talk more about the implications of that later, but it is just important to know that if something is wrong with our system, we can talk directly to the users in a very high bandwidth, high fidelity, sort of no BS way. I don’t have to couch my language and pretend like I didn’t worry about some reputational impact of the service going down or something.

(00:16:43)

I can just say exactly what’s wrong, and our user can talk to us about exactly the impact that’s happening on them. And that’s going to really affect the way that we respond to things in the production environment.

(00:16:59)

That is the main features of the production environment. I think I am ahead of schedule if I timed it right, so I’m happy about that. So now let’s talk about the implications. What does Jane Street do as a result of those interesting features? Let’s start by talking about monitoring. So you might be traditional with maybe more SLO-based monitoring. That might look something like this. So the error rate is below 0.01%, i.e., you know, 99.99% of our requests should succeed. This is a pretty standard shape of alert that you might monitor. We do very little of this. There’s a problem, which maybe you’ve identified, which is I just said 100% of orders are important. So if the 0.01% of order that you want to cancel is that one that’s really, really important, you’re the one that’s going to bankrupt you, then this alert, or an alert based on this SLO, is not going to help you very much.

(00:18:06)

That’s not just true in the order space. You know, you might say like 99.99% of our instrument symbology should be accurate. But, you know, you really can’t have, you can’t be trading the wrong symbol on any order. The fact that every order is important, it’s sort of every aspect of the order is 100% important as well. So we actually do very little SLO-based monitoring in our live trading systems. What do we do instead? We do a lot of event-based monitoring and alerting.

(00:18:41)

What does event-based monitoring mean? It means we have a section of code where it says, if this condition happens, then raise this error for a human to look at. And the main feature of this, I say feature, it’s kind of a blessing and a curse, is it forces you to think about every single edge case. So in traditional SLO-based monitoring, One of the awesome things about it is you can ignore unimportant edge cases. Who cares if there’s some weird condition on some weird browser version with some weird plugin configuration that causes 12 users to have the web page load slowly? Like, it’s just not important. You don’t need to waste an engineer’s time dealing with this condition. That’s a great feature for people who can set up that kind of SLO-based monitoring. With event-based alerting, what we’re saying is, actually, we need to look into that edge case. And we are going to decide consciously in code, or maybe in some config that’s attached to the code, that we need to at least identify that issue and verify that we’ve decided it’s not worth an alert.

(00:19:52)

So we really, at least in these trading-based systems, we really catalog every edge case that comes up and decide whether or not to alert on them. We don’t necessarily fix every single edge case. Sometimes we have decided that we can fail safely in some way. But we really do enumerate all of them and modify our alerting accordingly. And so some benefits and downsides of this. So the benefit, obviously, you get the safety that we’re after. There’s some other nice things, like you get a really high fidelity of alert detail. In SLO-based monitoring, you might have this annoying problem where you’ve decided, okay, the rate of failure has gone up two times over the last month for like, a wide variety of reasons. Maybe you need to do some data analysis to figure out which ones are most problematic, or if you should just change your SLO, or what the deal is.

(00:20:45)

It can be a lot of annoying debugging. When you have a specific call site with a specific alert, you know exactly what happened and what you need to fix. And that direct link to the call site is also nice. You can just look at the code that generated the specific error condition, rather than having to look at a broad range of symptoms. The downsides, as I’ve alluded to, are you have to enumerate every edge case, which is just a lot of work. It can be noisy if you’ve failed to enumerate every edge case. There can be tribal knowledge, where you decide, oh, yeah, this thing is happening again, but we know that’s not an issue. And it can be hard to change alerts. Instead of just changing a config, you know, change 99.99 to 99.95, you have to change the logic that is generating the alert, which in turn means you need to understand the code base pretty deeply.

(00:21:39)

It makes it harder, I would say, to have some far-off distant operations team that’s monitoring your system without being deeply familiar with it.

(00:21:51)

All that being said, we do use plenty of metrics-based alerts as well, both for many cases where there’s not a real risk concern, and also to measure like long-term health of our systems and improvements to like our performance as a whole. But a lot of what we do is this event-based alerting. And I’d like to talk a little bit more about what kind of event-based alerts we really like. There is a classic mantra and like SRE land, like alert on symptoms rather than causes. Again, a quick show of hands, who is familiar with this phrase or concept? Ah, sick. Okay, very few people. Excellent. I will not rush past it then. So a classic example here is you’ve got some CRUD app, some web server that’s connected to a database. And as you’re writing this system, you might think to yourself, I am going to raise an alert whenever the database goes down. And your logic that you’re intuiting is probably something like, the database going down is obviously bad and unintended. And it is obviously going to cause user impact.

(00:23:05)

If they try to connect to my web server when my database is down, they’re going to get 500 errors, and they won’t be able to connect. And furthermore, when I get this alert that the database is down, I know exactly what the root cause is. The root cause of the 500 errors is that the database is down, so I need to fix the database. If I were to not alert on it, then I might get a user complaint that they’re getting 500 errors. And I then need to debug logs, whatever, to eventually find out what I could have known all along, which is that the database is down. So that might be what you intuitively think. And I think in many places at Jane Street, and I mostly believe this, we mostly think that’s wrong. And you should not alert on your database going down.

(00:23:51)

And the basic logic for this is you’re going to need to alert on the 500 errors anyway, because the database going down is not the only reason why your users might get 500 errors. There’s a bunch of other reasons, and you probably can’t list all of them. So you need to alert on the 500 errors. And so if you alert on both, now you’re getting duplicate alerts. That’s noisy. That’s a problem. That’s maybe problem number one, and that’s kind of like the best case. And the worst case, which is actually, I’m going to claim very likely, is that a lot of those assumptions I just said about the database going down is obviously bad, obviously unintended, obviously going to cause user impact, those assumptions are actually not true. It may be true at the moment that you’re writing your app, but over time, you’re going to realize that you occasionally need to do database maintenance, and you’re going to come up with some scheme where you have a backup database, and you can actually do safe database maintenance with no user impact.

(00:24:49)

And so you’re going to say, “Okay, well, I’ll silence the alerts during my database maintenance window.” And then at some point, you’re going to do an ad hoc maintenance. And you’re going to raise an alert. And then after that, you’re going to say, “Ah, okay, we’re going to create some RPC for the person doing maintenance so that they can proactively silence the alerts for the next three hours as I do my ad hoc maintenance.” And it’s just going to be duct tape on top of duct tape, all covering over the fact that actually, none of your users care about your database. They care about the 500 errors.

(00:25:23)

It is true that the part I said about it being convenient to have the alert pointing to the database being down, that is true. That is a convenient aspect of debugging, but there are less noisy, less costly ways of getting the same benefit. You can attach the metadata that the database is down to the alert that’s generated. Or you can have a monitoring dashboard that displays information about the database so that when you see the alert, you know to go to this dashboard and it will be there. You can quickly draw the connection without having to look at another alert, which might wake your on-call staff up just because of routine database maintenance.

(00:26:06)

So we do a lot of symptom-based alerting in general. And one particular variety I want to call out is what I’m going to call orthogonal alerts, maybe based on epistemic health rather than service health. And I kind of think of these as an extreme version of a symptom-based alert. This is a case where we really have no idea what the root cause even might be. In fact, we don’t even detect a problem with any particular service. We just know that something is weird about the world. Something that some service is outputting seems wrong in some way. And this may be a little abstract, so I’ll give two examples. We have an alert called fill-too-good or trade-too-good. This means we made too much money. And this is my favorite alert at Jane Street. I don’t know if it is, like kosher to say that you have a favorite alert, not a fun party trick or something, but it is my favorite alert.

(00:27:11)

It can catch so many issues in so many different places in our tech stack. If you have problems with market data, if you have problems with instrument generation, if you have problems with the order engine, if you have problems with your trading strategy, if you have a rogue trader, if a trader has fat-fingered something, anywhere in your trading stack, something can be caught by this one alert. That’s like sort of symptom-based alerting on crack or something. Another example might be high percent of market volume. If you’re normally 3% of market volume, and all of a sudden, you’re a 60% of market volume. I don’t know what happened, but that’s pretty weird and indicates that maybe something has gone catastrophically wrong.

(00:27:54)

By the way, that one looks a little more SLO-y to me, which is cool and fine. I think the main thing is it’s a bit orthogonal to any specific indicator of service health. And yeah, I would say what those two examples have in common is that they’re not related to any particular technical failure in any particular service. They’re more indicating that our understanding of what the cohesive unit of our systems is doing is wrong at some high abstract level. So these are some of the alerts that we like. Whenever you raise an alert, you have to think hard about who’s going to receive the alert. And I think these examples are a really good use case of that. I just said they’re not tied to any particular service. So which service owner should receive the alert? And I’d kind of like you guys to spend 20 seconds just thinking about this question.

(00:28:54)

Like, if you have this alert on fill-too-good, who should look at that, and how should they respond to it? Yeah, literally spend 20 seconds thinking about it.

(00:29:19)

I’m not gonna call on people ‘cause I think the mics won’t work and stuff. Also, there’s no right answer. Yeah, it’s a totally underspecified question. But I do want you to kind of think about that question in the back of your mind. I’ll harken back to this in a little bit. I’m curious what you guys thought. Just something to think about for now. The last thing I wanted to mention in this section is we do obsess over signal-to-noise ratio. So I have basically no secret sauce here, except to say that this is just a thing that the firm has to take really seriously as part of its culture. I think it’s a pretty neat thing about Jane Street that, you know, traders, devs, production engineers, everyone deeply understands that noisy alerts are worse than useless. And we care so much about human attention that we do end up spending a lot of time, you know, tweaking alerts, alert thresholds, alert logic, whatever, going over every edge case so that we can have a good signal-to-noise ratio.

(00:30:14)

And I think you probably can’t really do this event-based alerting unless you just have a really strong cultural foundation of caring about signal-to-noise ratio because it is so prone to noisy alerts that you need it embedded within the culture, not just like the tech nerds or whatever, but like everyone, traders, et cetera.

(00:30:39)

Yeah, part of the culture. Cool. Another thing I want to talk about is defense in depth. So I had drawn a diagram similar to this earlier, but I’ve now added this external risk enforcer. And I’ve been talking a lot about alerts, but I had previously said that any problem is catastrophic. So what if we fail to raise an alert? You know, what if we have a bug in the alert generation logic? And I think basically the only way that I know of to solve that problem is to just have a lot of defense in depth and have a lot of things that need to go wrong, hopefully in an uncorrelated way, for you to, you know, be confident that you’re not going to have big problems. So what we do is we run similar alerts in different systems written by different teams and try not to share too much underlying metadata logic.

(00:31:31)

So, you know, the trading system might have its own risk checks monitored by a trader who’s applying some judgment. The order entry port has its own risk checks. And then you’re going to have some external system that, you know, reads in after the fact that submits maybe leases to the order entry port that has its own risk checks. And again, importantly, these are all written by different teams. So, you know, if one alert has failed to generate, the other alert hopefully will. And the main weakness in this approach is probably they have some common dependencies. And so you’re going to have to look really carefully at those common dependencies, and you might need duplicate versions of those dependencies as well.

(00:32:14)

Some more about defense in depth. This is going to be similar to the fill-too-good thing I talked about, which is we have, we monitor technical health, monitored by technical staff. And we monitor trading health, trading impact, monitored by trading staff. And again, those are two sort of orthogonal sets of alerts that hopefully can cover a similar ground of thing. So tech system monitoring is going to look like, is our software or hardware experiencing technical issues? Are our systems able to execute trading decisions as expected? Do we have external connectivity issues? Whereas trading impact is going to be, are we having the market impact that we expect? What P&L are we accruing? Are we sending the same orders that we intended to send? And like the fill-too-good case, sometimes traders noticing weird trading impact can inform technical staff of problems with our technical health.

(00:33:14)

And the very last thing I want to say about monitoring is, it is not optional at a trading firm. I sort of hear of other firms where it’s kind of like an ancillary thing, maybe like the project you give to the new joiner. If it crashes, you know, who cares? It’s not serving any customers. It’s not a huge deal if it goes down. And that is not the way we treat monitoring at Jane Street. We basically don’t trade unless it’s being monitored. And so monitoring systems are among our most robust, most well-tested, most redundant, we have failovers, et cetera. So that’s just another sort of philosophy around monitoring at Jane Street.

(00:33:56)

Okay. There’s one more thing I wanted to talk about, sort of how we react in our production environment to the nuances of the trading environment. And that is business context is key. So production engineers and all technical staff who respond to things in production, which is basically everyone, need deep business context. And actually, a corollary, or not a corollary, the inverse of this is traders need technical context. So if you remember that bit I said earlier about how we have internal users, the ability to communicate effectively, I sort of implied it was good and useful and nice. But you can only really take super advantage of it if you can actually speak really effectively together in a high fidelity way. Which means, or another way of putting it is it raises the stakes and importance of being able to communicate effectively with people who speak kind of different languages from you.

(00:35:00)

So, you know, communication bandwidth is much higher if a trader can say, you know, spoo market data looks stale. And I, as a technical staff, know immediately that when he says spoos, he means Indexed Future of the S&P 500, and that I know that this trades on the CME, which is located in Aurora, Illinois, which therefore means that probably if I search for AUR in our list of order engines, I’m going to find the order engine that he’s talking about. And so in an incident, if someone were to say spoos market data looks stale, it would probably take me, you know, two or three seconds to find the relevant order engine or market data feed that that trader is referring to. If you don’t have the business context to know what that means, you might want to do, you know, ask him what spoos means. Okay, it means this symbol name.

(00:35:48)

I’m going to do some lookup table to see the exchanges that this symbol trades on. It’s going to give me some identifier. I’m going to look up that identifier. It’s going to give me some other thing. We can bypass all of that and know immediately what the person is talking about. This is a bit of a contrived sort of simple example, but I think deeper examples exist as well, where traders can say pretty nuanced things about the trading and technical staff who know some nuances of the trading can pinpoint the likely culprit or system that might be at fault.

(00:36:22)

And lastly, I had also said that timing matters. And knowing those special market events is really important for, you know, technical staff to be aware of. You can know that different market events will impact different desks differently. Understanding those nuances can be really valuable. The reason our market data is going crazy right now is because, yes, Trump just tweeted about steel tariffs. We should be prepared to respond accordingly. I understand why we’re getting these delays. This is what we should do. This sort of thing happens all the time, and it is really useful to be able to have that context and avoid doing things like rolling a system before some economic news that’s going to affect the desk that that system is serving.

(00:37:14)

That is what it means for production engineering. I think I’m still doing good on time, right? And now I want to get through a sample incident, which will hopefully, make this all a little bit more concrete. This sample incident is loosely based off an amalgamation of a few separate incidents. So it’s not 100% real, but it’s pretty realistic, I would say. And yeah, hopefully interesting, so. In this simple incident, 7:15, routine roll of order engine completes. New version, kind of standard, has some bug fixes, small refactor, tests pass. Cool.

(00:38:00)

9:31, fill-too-goods, raise to trading desks for a bunch of symbols. At this point, there is a lot of individual traders who have each received one fill-too-good. And so, no one quite knows that there is some widespread problem. But maybe a bunch of people might start looking into their own subset of trading. 9:32, the order engines team receives an alert called too many fill-too-goods, and the order engine automatically halts. So now I’ll do a quick sidebar. I had talked about who should receive the fill-too-good alert. I think there’s a lot of plausible answers. I think this is one possible answer, which is like maybe you have individual fill-to-good alerts go to the trading strategy that is related to that specific symbol. And maybe you have some meta fill-too-good alert that goes to some tech team. In our case, I had asked the question, if this could be an alert that covers anywhere in the tech stack, like which tech team should receive the alert? And the thing that we do at Jane Street is we send them to the order engines team. And the basic idea is the order engine is the last point before you sort of get to the exchange.

(00:39:17)

It’s like the last point of safety before your risk is going out the door. And so theoretically, the order engines team should be the team to most reliably guarantee that we’re not doing bad trades. They’re our last line of defense. That’s our answer. I think there’s a lot of plausible ones. That’s what we do. So 9:33, order engines is going to start an incident. They use a system we have called what changed. What changed is you type in a system, and it tells you all the things that changed in that system. You know, you can imagine binary changes, but also any relevant config changes, any relevant metadata changes, anything like that. And they very quickly see that there was some code changes that were released this morning. And you look at a list of feature names, one of them says refactor of price serialization. Okay, price serialization seems like potentially a pretty scary thing to get wrong.

(00:40:09)

Hope I serialized the price, right? Maybe the refactor caused a bug. So I start investigating that.

(00:40:18)

Within a minute, as you start debugging the price serialization change, you realize that that change was actually rolled to many more order engines than just the one that you’re currently investigating. You now have a bunch of interesting things to think about. So maybe you can kind of divide the belief story into a few possible categories. Let’s assume for a second that we have 100% confidence in our alerting. That could indicate two things. One is it might indicate that the price serialization feature is a red herring. If our alerting is 100% correct and the price serialization is a problem, then shouldn’t all the other order engines also be raising fill-too-goods? That’s one thing you might think. The other thing you might think is, okay, well, maybe the price serialization change is to blame, but it’s only affecting this order engine for some reason, you know, it’s related to the specific protocol that this order engine is communicating. That’s kind of like one category of thing,

(00:41:22)

if you believe 100% in your alerting. The other possibility is you don’t 100% believe that your alerting is perfect. And so you think this price serialization change was rolled out to a bunch of other order engines. And maybe those other order engines are actually also losing money, but we haven’t yet discovered it. Maybe our fill-too-good thresholds are too, you know, risky or something. So at this point, we might start talking like, should we be preemptively halting these other order engines that are rolled with this suspicious code change? And I don’t think we’re going to do this immediately. We’re going to spend a few minutes talking about it. And that’s what’s happening right now. A few minutes later, as this discussion is taking place, traders join the incident call. And they report that market data looks stale. And so when I say stale here, what I mean is they’re receiving, they would expect to receive second by second or faster ticks for each individual symbol they’re trading.

(00:42:24)

They’re saying, like, I would have expected to receive a bunch of ticks in Apple by now, but actually, I haven’t received any. And I’m trading based on the market data that we got at 9:30 AM this morning at the open. That seems like a big problem.

(00:42:39)

As soon as they report this, the entire incident bridge immediately thinks to themselves, ah, maybe this price serialization thing is all a red herring. Maybe the problem is market data. So let’s have market data join the incident call. And the market data team will see some symbols that look stale. They pull up their monitoring, and they say, yeah, I agree with the traders. These symbols, the last time they were updated was yesterday’s close. They aren’t marked as stale. So when I say marked as stale, what I mean is you might think that we have a system for reporting to clients of the market data that actually you should not use this market data. You should use something else instead. This one is not reliable. We’re not doing that. So I think there might be a bug. I think people might be getting stale market data and thinking it’s live.

(00:43:33)

They start debugging. You use our meta search tool. You search one of the culprit symbols into our meta search tool. And you see that there was an exchange-driven change. What I mean is an email from the exchange saying, yeah, in two weeks, as of today, we’re going to split the market data feed. So again, split the market data feed, meaning maybe they’re sending this market data over four partitions right now, a quarter of the symbol universe in each partition. They’re going to create a fifth partition, and 20% of each of the old partitions will go to the new partition. You should start connecting to this new partition on this day. And the market data dev says, we should be robust to this, but I can imagine that we have a bug. And the bug is we mark entire feeds as stale or not stale. And maybe we missed this EDC, Exchange Driven Change. And so all of the symbols that are on the new partition, we’re just not reading that new partition.

(00:44:40)

We still have them internally in a config attached to the old partition. And we’ve said that old partition is current. So you guys are trading based on yesterday’s market data, like the closing price yesterday. The market data team says, this shouldn’t be possible. Like we have fixes for this. That’s not how the logic works. But I can imagine that we have this bug. And we definitely did miss this EDC.

(00:45:05)

I think we should keep trading halted right now on this order engine. I don’t think we need to halt any of the other order engines because this really does seem like the culprit. We’re going to roll a config change to account for the change. They do that a few minutes later. They roll a config change. We turn on the order engine, see if we continue to get fill-too-goods. We don’t. And we sort of conclude, okay, the situation is resolved. We found the culprit.

(00:45:38)

And then people start doing some thinking, like, okay, but how confident are we that we’ve actually solved the issue? Can we be convinced that the market data bug is the cause of the problem? And maybe two weeks ago, you get this exchange notification. We do some investigation and determine that there is a bug. The exchange did send us a, you know, notice that, in the market data feed, the exchange tells us that we’ve moved these symbols. And we are supposed to have logic that reads this and knows to mark the symbols as stale. But we’ve actually never exercised this logic because we’ve never missed an electronic, we’ve never missed an EDC before. So there’s sort of two things that have gone wrong. One is we missed the EDC. And two is we didn’t handle missing the EDC correctly. We had a bug in how we handled partitions, you know, that code path had never been executed.

(00:46:39)

So some interesting takeaways here. Again, my favorite alert, the fill-too-good is the thing that caught it. You know, an order engine alert caught an issue in the market data. That’s pretty cool. Good job, fill-too-goods. I think in real life other alerts probably would have caught it too. Maybe like our trades not appearing in the market data. You know, I think you can imagine a trader saying, we did this trade in Apple, but I don’t see the trade that I just did in the market data that I’m receiving. So I think market data is stale. I think that alert probably would have raised. Some other takeaways, going back to timing matters. So, a pretty interesting thing in this case is nothing we did really mattered that much. You know, we reacted really fast to everything, like one minute in between each thing. I think that’s a pretty impressive timeline, but actually, all the damage had been done as of 9:31. You know, we missed the open. That is the problem. So I think a lot of the postmortem here would have involved us asking whether we could have caught this anyway, like 20 minutes earlier.

(00:47:53)

If we could have caught this 20 minutes earlier with pre-open trading or some other alerting, then this quick resolution would have actually allowed us to make the open.

(00:48:05)

You can also imagine cases where, you know, maybe at some point in this incident, the trader would have said, “By the way, Fed minutes are in 20 minutes.” And again, that would have really affected the way that we handled the incident. We would have had to do things with a certain level of urgency, and we would have made different risk decisions along the way. The other thing I want to highlight is the communication between traders and tech. And again, I think this instance where the traders join the incident call and report that market data looks weird, that was the turning point in this incident. That gave us the key insight we needed to know to call in market data. And that is really common in Jane Street incidents, where from a technical perspective, sort of everything looks fine. Looks like our systems are behaving correctly. But when a trader can report that we’re doing something wrong from their perspective, that can really speed up resolution.

(00:48:58)

And again, this sanitized version of an incident maybe doesn’t make it quite so clear how useful it is to be able to talk, you know, understand each other’s languages so that the incident room isn’t just like total chaos of people not understanding what the other person is saying.

(00:49:19)

That is really the main highlights, takeaways that I wanted to get out of this sample incident. Just to wrap things up, the things we sort of talked about here, we rely a lot on event-based alerting. Again, we have SLO-based alerting. We have metrics-based alerting. But those tend to be more for long-term health, whereas we rely a lot on event-based alerting for actual live trading system, sort of core of our alerting infrastructure. And we rely a lot on these orthogonal epistemic alerts to protect against a really broad range of failures. And then our monitoring systems are among our most critical. We make sure that they are more reliable than the trading systems they are monitoring. We would rather a trading system go down than the monitoring system go down. And we have a lot of defense in depth. For the most important systems, you just have to use multiple, again, maybe orthogonal systems to get at the same thing.

(00:50:25)

You simply can’t create a system so reliable that you can be 100% confident that it will be reliable enough. And support staff needs business knowledge and internal communication to be able to make high-quality decisions and move fast in an incident. And then one last thing I want to say, just to keep myself honest. This is all a little bit of an exaggeration. There’s plenty of ways in which lots of production engineering at Jane Street looks a lot like it does elsewhere. But in this talk, I did want to highlight some of the differences. But there’s plenty of areas in the firm where, you know, it’s not quite so tight hot loop of trading, and you can afford to take a more normal approach to SRE production engineering, whatever the case may be. We’ve got a pretty wide variety of that. That is the end of my talk.

(00:51:17)

Any questions? Yeah.

Audience (00:51:19)

Hi, Mark. Thank you so much for the talk. I just had a quick question on the sample incident that you shared. So you did mention about the importance of monitoring systems and how critical they are. And as we look at the sample incident, the root cause was two weeks ago, an email from Exchange, which is super important. Mark: Yep Audience: Right? So I was just curious as to, you know, for any monitoring, obviously we do have fancy monitoring systems, but emails come first. So weren’t there any monitoring or any acknowledgment that someone acknowledged that email or monitoring that this email from Exchange was ack’d or any change had been rolled out in response to that?

Mark (00:52:07)

Yeah, great question. So this exact mistake is absolutely a mistake that we’ve made multiple times, and we try really hard to avoid making. And just the basic problem with these exchange-driven changes is we don’t need to do anything for like 99.99% of them. And unfortunately, we can’t convince the exchanges to only tell us about the ones that we care about. So they tell us about everything, and we need to spend some manual effort trying to parse them. So I know nowadays we’ve got various LLMs and stuff trying to get at some of the signal-to-noise. And we’ve got a whole process for people staring at them and trying to make the right decision. But there have been multiple cases where people have just made the wrong decision, read the email, and determined it didn’t affect us, and we were wrong about that fact. And that’s what happened in this incident.

Audience (00:52:55)

To conclude, it was a signal-to-noise ratio. Mark: Yeah, yeah Audience: Thank you.

Audience (00:53:08)

So it’s like, following up to the incident questions like, how is it even possible that the timestamp market data and start trading and a price is like from yesterday’s close. Sounds like a very important difference. And so before, like why would order engines catches is that it seems like, the least, like, the last piece of the software that cares about the staleness of market data. So this is like interesting questions I have for instance. And the second question I have is regarding the event-based alerts. So in the firm we rely heavily on metrics-based alerts because like once we start doing event-based, requires like in-depth integrations of this, like every piece of the software, being able to generate all these events. Is it like, how is it like these things always trading in Jane Street? Is it part of the responsibility to make sure these things are deeply integrated in the software by the production engineering team or is it like a practice that every team has?

Mark (00:54:14)

Okay, I’m going to, I think I caught all that, but I may not have, so I’ll go first question first. Sort of like, isn’t that like a giant mistake to accidentally be reading yesterday’s market data prices for today’s live market data? One thing that maybe I sort of glossed over in the beginning is we trade on a really wide range of exchanges from super well-behaved exchanges that are extremely liquid and this would be like wildly implausible to ever happen to like pretty weird sketch exchanges where maybe like they’re most of the time really inactive and it’s actually not that crazy that you might not have had market data in a symbol for a given day. So I think in the incident that this was based off of, it was an exchange that normally has very low volume, but due to some weird news, like became really, really high volume. Again, I think this might have been a geopolitical, I forget the exact details, but I think it was some geopolitical thing where some country became very important.

(00:55:16)

And so that country’s exchange was going crazy. And we… Yeah, so basically, a thing that was normally implausible became very plausible. That is, I think, the first part of your first question. The second part is, isn’t it implausible that the order engine catches the market data bug? And again, I guess the thing I would highlight is that the order engine, the error that the order engine is raising is not anything to do specifically with market data. It’s to do with some epistemic status about the world. And so the order engine didn’t notice that market data was wrong. the order engine noticed that we sent a trade that seemed too good. And that helped us identify market data. Does that answer your first question? I’m sorry, I forget your second question.

Audience (00:56:04)

Yeah, it’s like when you start doing all these like event-based thing, why it’s like really in-depth integration to the software. And is it like, how does this thing always trade in Jane Street? Like every software team needs to do these, like integrations or is like, production engineering is going to do some extra monitoring for third-party stuff?

Mark (00:56:22)

Yeah, so I think in general, like, most of our production engineering teams or really all of our software teams tend to monitor their own systems. So we don’t have one team that is writing the core logic and another separate team that is writing the alerting around that system. The people who are developing the software are also deeply integrated into writing the alerting. Does that answer your question?

Audience (00:56:51)

Yeah, yeah, it’s like, would that make like every single team, it’s like they have to engineer spaces specifically for every kind of events, there’s like built into software. So that’s like, for us, we rely heavily on metrics alerts. It’s like, it’s easy, it’s kind of hard to make every single software to make event-based alerts without generating tons of noise?

Mark (00:57:22)

Yeah, I think my short answer to your question, if I’m understanding it correctly, is basically the answer is yes. Like software engineer teams, all engineering teams at Jane Street need to think really hard about all the ways in which their software can go wrong and raise alerts accordingly, and think about how the humans on the receiving end of the alert are going to respond to that alert. And yeah, I think it’s a lot of work in a lot of ways. But it does mean that we have really reliable, robust alerting that we then feel really good about and can allow us to sort of paradoxically move really fast in other ways because we are so confident in our alerting that we know that we’ll catch something if something goes wrong. I think I saw a hand. Yeah.

Audience (00:58:10)

Thanks for taking the question. I’ll stay anonymous for now. I have two questions. I think first one is perhaps relatively straightforward for you to answer. And the second one, if you’re not feeling comfortable to answer, I can understand. Mark: Oh, okay, cool.

Audience (00:58:26)

So the first one is, I think, do you guys have some sort of common login standard across different teams responsible for different components? So the reason I ask this is, imagine you mentioned in your system, you have some sort of a risk alerting, risk monitoring, right? And they have a separate team that’s presumably, let’s say is responsible for generating the actual trade actions. And if something goes wrong, say the alert comes from the, from your alerting system. And if you try to look at the logs, so personally, I’ve experienced that, so for example, it goes into the close auction and the monitoring system says, oh, my participation rate is too high. So alert goes spun off. And then you get the devs to try to investigate and they look at the order system and they have one set of logs and they try to check, say, oh, what is the auction volume at the time? So they would search for the keywords, for example, what is the indicative, you know, auction volume and crossing volume? Whereas then you try to cross-reference, say, oh, why did the monitoring system do that alert? And when you go to data logs, rather than say, you know, the indicative and cross volume, they say IEV.

(00:59:44)

So sometimes you actually need to be able to diagnose, you need almost like a subject expert knowledge on two different systems purely because the devs decided to use different conventions and there isn’t a place to enforce the sort of conventions. So does Jane Street have some sort of common standards?

Mark (01:00:03)

Yeah, I think the answer to that is like a little nuanced, unfortunately. I think we have some common alerting libraries that most teams use. I think for, and I find this question actually, maybe this is, I hope this is the harder question. I know you said it’s going to be the other way around, but I find this question hard to answer. Yeah, I think we have common alerting libraries. I think we have some common standards. I think like, it depends a lot on the set of alerts you’re talking about. One thing that’s true about Jane Street is we don’t have a lot of top-down edicts. In fact, we have approximately zero top-down edicts. So really, every generator of an alert is sort of free to do it in their own way. And I think that does mean that if I am on one team and then I move to another team, I need to understand that team’s conventions for alerting.

(01:00:58)

All those teams are hopefully using a relatively common set of tooling provided by our core services team. But they’re kind of free to use that tooling in their own way. So, you know, we’re going to have like the same log levels throughout the firm using the same logging library. But a specific convention around, you know, what terminology you use around a specific condition could vary team by team. That being said, a lot of the teams I’m talking about here, like market data order engines, are building market data and order engines for all of our desks. So if you’re trying to debug something for the equities team that’s connected to the market data, that’s the same team that’s the bonds market data team. A lot of our infrastructure is shared across different trading strategies. So I think a lot of these risk question alerts are going to be from a common framework that’s going to apply to a bunch of different desks. Does that answer that question, kind of?

Audience (01:01:58)

Yeah.

Mark (01:01:59)

Okay. That was the easy question. Great.

Audience (01:02:02)

No, I said it’s hopefully more straightforward. Audience: So the second one is, again, ties back to the monetary system that you mentioned. And I can imagine for Jane Capital, you probably, as a firm, has overall, at any point in time, a risk appetite or risk capacity that you’re taking. And then inside it, you have your individual desks or functions that’s taking risk at any point. And then you have some sort of trade events that’s originated from your trading system or different trading components or trading desks. And then I’d imagine at some point, you want to keep counter of your overall aggregate risk position. So when that comes through, how does Jane Street… Do you guys have a system, I guess, the overall monitoring, does that act as the final gatekeeper and it’s sort of like a blocker that you essentially pay for a computation time cost to process that thing first and then as a single aggregation point, pay for the cost of the check before you decide to let it out the door or you do it some other in an asynchronous way?

Mark (01:03:18)

Yeah, again, I think the answer to that is a little complicated. And maybe I’ll go back to some slide around here, somewhere, maybe. Like the two kind of different kinds of alerting that we have. I think in this talk, I have mostly been thinking about like our technical health, not our trading, like sort of thoughtful trading risk. So I think the way we tend to do this is when you’re thinking about the overall firm risk that we’re putting on, that tends to be in our like, trading monitoring systems that traders are looking at and making human time decisions about how much firm risk we’re willing to put on as a whole.

(01:04:03)

And then we also have, I guess maybe my answer is we do a mix of the two, where for some amount of risk, there’s automated systems that will turn off if you’re exceeding this risk and then for other sort of more firm-wide risk things we have humans looking at them and trying to make decisions based on, yeah, what they think is safe over the course of, you know, days, weeks, months, whatever the case may be. Does that answer your question?

(01:04:31)

Dan’s telling me I don’t have much time. How many more questions? One more question, okay. Arbitrary, you in front, thanks for sitting in front.

Audience (01:04:46)

Great. Thank you so much for the talk till now. So I come from a startup hedge fund and it’s pretty interesting to see all the sophisticated systems you have here and the time period that you started 9:31 and you kind of resolved the issue by 9:45. I think it takes us like that time to figure out what went wrong really. So that’s impressive. But there are two main aspects that I think about here is that first is the event-based alerting, right? So we do that sometimes, but sometimes what happens is when one thing blows up, 100 other things are going to blow up as well. And then you have thousands of alerts. It just multiplies, right? So how do you deal with that?

(01:05:27)

And the second question is, you mentioned the order engine halts. Making that decision, is it as easy as just like, oh yeah, halt it. But what if the next set of orders are way more important than the orders that have already been executed? And then when you resume, the orders that came in between, like suppose if traders are still like, oh, we want to trade. Do you just put them in the market, or market has changed by now? So how do you deal with these two things?

Mark (01:05:56)

I’ll answer the second question, or the second part first, because my memory is bad, and it’s the part that I remember. So I guess as a sort of concrete technical matter, I think when the order engine is halted, it would reject the trading system’s order. And so the trading system would have internal logic knowing that it would need to resend the order if it wanted that to get filled. That’s sort of the concrete answer there. How am I so bad at…sorry, can you just repeat all the questions? Very quickly. So the first

Audience (01:06:25)

One was like when you’re allowed to multiply…. Oh, yeah, yeah, yeah. One thing breaks and like you’re like, it’s just a domino effect sometimes.

Mark (01:06:31)

Yeah, I think we don’t do great at that. It’s like my honest answer. I feel like we have absolutely that same problem. If anyone’s solved it, talk to me afterwards, maybe. Maybe we’ll copy some stuff from you. I’m trying to come up with a better sounding answer. But I don’t think we do great there. I think that happens here, too.

Audience (01:06:53)

Yeah, because if that happens, I think it becomes difficult to solve an issue within like, 14 minutes, right?

Mark (01:07:00)

Let’s see, it doesn’t feel like it often becomes a huge problem and I’m trying to think of why that’s the case. I guess it might be useful to think of a concrete example. In this example I gave, no system noticed anything wrong with any system. The only alert that raised in this case was fill-too-good. As far as we were concerned, all the systems were healthy. And so we wouldn’t raise a cascade of alerts, because as far as we’re concerned, like everything’s going fine. In cases where we know that there’s a system problem, I think, yeah, maybe the biggest problem is many different systems will raise alerts that they’re each down. But it tends to not be super problematic, because in fact, again, we do go on a symptoms-based alerting.

(01:07:44)

So this system that’s supposed to serve this user for this purpose is now down. That’s a real valuable alert that lets us know to contact those users and let them know that they need to change their behavior somehow, because the system isn’t going to behave as expected, and so on, for all the downstream systems. So I guess we don’t tend to have that problem because of the symptoms-based alerting that I’m talking about. And then I’m so bad with memory. Sorry, what was your second-

Audience (01:08:08)

No, I think that’s all, you answered.

Mark (01:08:10)

Okay. Cool. Thank you all. And apparently, that’s all the time for questions. But I’m happy to stick around and talk to anyone afterwards. Thanks.

The next great idea will come from you

View open roles