Performance engineering at Jane Street
We’re constantly chasing better performance at multiple scales, whether we’re shaving nanoseconds off the critical path in a trading system or optimizing the training loop for an ML model.
See below for examples of some performance problems we face and the tools we’ve invented to solve them.
Keeping up with the market
As markets grow, our trading infrastructure needs to process ever growing amounts of data in ever shorter time windows. That’s why we build highly-optimized packet processing systems that are capable of handling millions of multicast messages per second on a single core. Building this kind of system requires a disciplined approach to measurement, a focus on determinism and tail-events, and a good dose of mechanical sympathy.
Performance isn’t just important for the most latency-sensitive trading. We’ve built a distributed systems framework based on state machine replication (and inspired by the architecture of financial exchanges) which provides high throughput, low latency, and strong reliability guarantees to a wide variety of internal applications. The architecture of this system depends on a very high-performance backbone for sequencing, distributing, and filtering the transactions that drive these applications.
At the lowest level, we use FPGA accelerators as a way of achieving performance that can’t be gotten on CPUs alone. We lead development of Hardcaml arxiv.org Hardcaml: an ocaml hardware domain-specific language for efficient and robust design , an open-source hardware design library github.com Hardcaml is an OCaml library for designing hardware. - janestreet/hardcaml . By building our own tools, we’ve been able to build a highly productive hardware design workflow, with fast feedback for engineers, and integrated simulation and testing. If you’re interested in seeing Hardcaml at work, some of our engineers won a competition about accelerating zero-knowledge cryptography using it, and we’ve posted the detailed results zprize.hardcaml.com In 2022, we, the team who develops Hardcaml, participated in the ZPrize competition. We comp... .
Accelerating machine learning
We do a lot of machine learning, and performance engineering is a critical part of that work. Making good use of our GPU clusters requires careful profiling and optimization of our training runs across the whole stack, from storage to network to host.
In most of the ML world, inference is largely a throughput problem, with responses aimed at human timescales. Because our models drive microsecond-scale trading, we need to architect for latencies far below those that are typical for ML workflows, while handling high-throughput market data. This leads us towards a variety of techniques, from writing heavily optimized CUDA code that stretches the bounds of what GPUs were designed for, to leveraging custom hardware, to writing our own compilers.
Industry-leading tools for performance debugging
Magic-trace
We developed magic-trace, a powerful open-source tool for collecting and displaying high-resolution traces of what a process is doing. It’s useful not just for detailed performance debugging, but also just for understanding your program. Magic-trace uses Intel Processor Trace man7.org Intel Processor Trace (Intel PT) is an extension of Intel Architecture that collects information about… to snapshot a ring buffer of all control flow leading up to a chosen point in time, which it then presents to users in an interactive timeline.
Memtrace
We also built memtrace, a tool for understanding memory usage and finding leaks. Memtrace builds on OCaml’s statistical memory profiler to get callbacks on GC events for a sample of a program’s allocations. The Memtrace viewer then analyzes these events and presents graphical views of them, as well as filters for interactively narrowing the view until the source of the memory problem becomes clear.
Pushing programming language design for high performance
We write our lowest-latency software systems in OCaml, which combines a powerful type system with good and predictable performance and a low overhead runtime. Over the last couple of years, Jane Street has developed major extensions to OCaml, in particular:
-
The addition of modal types icfp24.sigplan.org PACMPL (ICFP) seeks contributions on the design, implementations, principles, and uses of functio... opens up a variety of ambitious features, like memory-safe stack-allocation; type-level tracking of effects, and data-race freedom guarantees for multicore code.
-
Unboxed types github.com Unboxed types in OCaml provides more control over the representation of memory, in particular allowing for structured data to be represented in a cache-and-prefetch-friendly tabular form.
Together, these features pull in some of Rust’s best features for writing high performance code, with a simpler and more ergonomic type-system that maintains the relative simplicity of programming in OCaml.
We’ve also made a wide variety of other improvements to make OCaml more efficient, from adding prefetching to the GC github.com Speed up GC by prefetching during marking to working on the middle and back-end of the compiler to improve code generation and register allocation.