What I've learned building low latency systems
Published in Automated Trader Magazine Issue 41 Q4 2016
Software development for low latency trading tends to be shrouded in mystery. Development practices are often wrapped in layers of computational alchemy that tend to be impenetrable to outsiders. The industry rarely gives insights, even though it borrows heavily from other sectors to drive its own progress.
The high frequency trading world is secretive and it's not really a surprise that nobody wants to give anything away. Apart from the obvious commercial implications, there are also legal constraints: Everywhere you work makes you sign non-disclosure agreements, which scares people from talking about even the most rudimentary aspects of technology.
And once you have worked on a few different projects or in a few different companies, then you start building up a veritable collection of NDAs. That is my case anyway. But talking about many of the technical aspects of high frequency trading is not giving away any secrets. Sure, there may not be many people talking about it, but it's not a secret. It's simply best practice as it would be known in any other industry. After spending many years with this problem domain, I have compiled a few points that I wanted to share.
Many of the software developers that work in hedge funds and HFT firms have a gaming or communications background. This allows financial firms to utilize their up-to-date knowledge that actually originates from other industries.
High frequency trading is about low latency processing and ultra-fast communication: everything is geared towards speed. From the design and prototyping stage all the way to the final implementation speed is always the primary concern. If you can manage to gain just a couple of microseconds somewhere along the way then it is considered a big deal.
To give you an idea of what kind of speeds we are talking about, let's go through various types of computer operations and how much time they typically take. This is summarized in Table 01 below.
I've discovered that in order to operate with the kind of speeds shown in the upper half of Table 01, I really do need to forget everything I have learned about modern software engineering. It requires a different mindset and the last 30 years of 'best practice development' basically go out the window: latency reigns supreme, no matter how ugly the code gets.
I still remember how I used to lay out my system designs in layers. There would be a Data Access Layer and a Business Logic Layer; there could also be a User Interface Layer. If you are designing an e-commerce platform, it doesn't matter if your code takes an additional 50 microseconds. Nobody even notices. In the end you might be happy if the checkout operation completes in a few seconds.
Choose the right language
Forget scripting languages (i.e. interpreted languages). They are simply of no use for this field. I actually get a lot of people asking me: "Hey, is Python a good language to use for HFT?". No! Python is slow. Terribly slow in fact! Your application will be the software-equivalent of a turtle. Or more accurately perhaps: it will be some kind of slow-moving snake. Pythons are not quick. And languages like Python are good for exploring and testing new ideas and doing some research. No more than that.
When you are looking to shave those last few microseconds off your overall processing time you simply cannot have the overhead of an interpreted language. Additionally, you really want a strong memory model to enable lock-free programming so you should be looking at languages like:
- C#
- Java
- Scala
- C/C++11 (my preferred one, but that's just me)
- OCaml
- Haskell
- Smalltalk
Task | ns | µs | ms | Notes | |
---|---|---|---|---|---|
L1 cache reference | 1 | ||||
L2 cache reference | 3 | 3x L1 cache | |||
Branch mispredict | 4 | ||||
L3 cache reference | 11 | 11x L1 cache | |||
Mutex lock/unlock | 25 | ||||
Main memory reference | 60 | 20x L2 cache, 60x L1 cache | |||
Compress 1K bytes with Zippy | 3,000 | 3 | |||
Send 1K bytes over 1Gbps network | 10,000 | 10 | |||
Read 4K randomly from SSD* | 150,000 | 150 | ~1GB/sec SSD | ||
Read 1 MB sequentially from memory | 250,000 | 250 | |||
Round trip within same datacenter | 500,000 | 500 | |||
Read 1 MB sequentially from SSD | 1,000,000 | 1,000 | 1 | ~1GB/sec SSD, 4X memory | |
Disk seek | 10,000,000 | 10,000 | 10 | 20x datacenter roundtrip | |
Read 1 MB sequentially from disk | 20,000,000 | 20,000 | 20 | 80x memory, 20X SSD | |
Send packet CA → Netherlands → CA | 150,000,000 | 150,000 | 150 | ||
1 ns = 10 -9 seconds | 1 µs = 10 -6 seconds = 1,000 ns | 1 ms = 10 -3 seconds = 1,000 µs = 1,000,000 ns | |||||
Table 01: Latency Numbers in Modern Computers (2016) |
Abstraction
Forget encapsulation and making your code nice, clean and reusable. Always remember that your goal is to make your code 'as fast as possible', not nice or easy to use.
Keep it all in memory
Read everything at the beginning of your program. Forget about databases or active persistence, this will add huge latencies, giving you enough time to whack a golf ball or two between two orders.
Anything related to I/O will simply destroy your overall latency, so make sure all of your data is pre-loaded (warming up cycle) and you have it all in memory.
This generally means managing your own in-memory data structures and maintaining a persistent log, so that you can rebuild the state after a machine or process restart.
Alternatively, you might (just might) be able to get away with running a local, persisted in-memory database like Redis or MongoDB (with available memory exceeding the data requirements). Note that you can still lose data should something unexpected happen while it is being persisted to disk in the background, but that is the price you have to pay (also see the section on asynchronous operations at the end of the article).
For best performance I would recommend not having any database/disk access on your critical path.
Keep your hardware underutilized
Low latency requires always having resources freely available to process the request: free CPU cores, free memory etc. When setting up your server, make sure you will have enough capacity for your intended use. If you know that your system will use 24 CPU cores and 20 GB of memory, make sure that you add 30% additional resources to have enough room.
Do not try to run at the limit of what your hardware can handle. Always have lots of head room to allow for bursts. Your system will eventually grow and will slow down at those all-important peak times. Make sure you keep plenty of spare resources available.
Keep context switching to a minimum
In concurrent applications you will end up using multiple threads to handle workloads simultaneously. Context switches are a sign that you are doing more compute work than you have resources for. Usually, the operating system will handle running those threads and will decide for you which CPU core will take which thread. Now, if you are running multiple threads at the same time, as in most HFT systems, the operating system will schedule alternate threads on the CPU in order to give them all a chance to execute. Every time the operating system instructs a CPU to switch from one thread to another, the CPU needs to save the current thread's state (in order to resume it later) and load the state for next scheduled thread. This could kill your processing performance.
My best approach has been pinning critical threads to a specific core by simply using busy spinning. That way I avoid triggering a context switch and incurring excessive cache faults (as explained in the next point).
Keep your reads sequential
Make sure you are using the CPU caches (L1/L2). All forms of storage perform significantly better when used sequentially. When issuing sequential reads to memory you also trigger a pre-fetch from main memory. If done properly, the next piece of data you need will already be in the L1 cache right before you need it. The easiest way to help the CPU out with this process is to make heavy use of arrays of either primitive data types or structs.
Stepping through a sequence of reference pointers, either through the use of linked lists or through arrays of objects, should be avoided at all costs.
Use non-blocking calls where possible
Design your system to use non-blocking and wait-free data structures and algorithms wherever possible. Every time you use a lock you have to reach down to the OS to arbitrate the lock, which again comes with a huge overhead. If you know what you are doing then you can often get around locks by understanding the memory model of the CLR, the JVM or the C++11 runtime.
Avoid exception handling
Yes, avoid it! It's expensive. Exception handling adds at least 10-20% overhead. Compilers need to add additional code and implement additional stack management to handle exceptions and that costs time. And before somebody tells me about how the GCC uses the 'zero-cost model': I suggest you profile your application and measure it. Remember: each microsecond counts.
Async as much as possible
Any processing and particularly any I/O that is not absolutely necessary for building the response message should be done outside of the critical path. For instance, if your system includes logging transactions to disk or sending transactions to a secondary server, then those actions can happen asynchronously. You don't have to wait until this has completed to continue on to the next instruction, as this will just block your execution path.
As you can see there is more to coding in this problem domain than just coding itself. While writing every line of code, you have to keep in mind what is really happening behind the scenes.
One last piece of advice: Don't just take anything for granted. Perform benchmarks on different approaches and look at how they differ and why. Not only will you learn something, but you will eventually find the best solution for your problem.