The Gateway to Algorithmic and Automated Trading

The kerf time series database system

Published in Automated Trader Magazine Issue 39 Q2 2016

Today's automated traders have many difficult data and infrastructure problems that revolve around large and complex time series. - Increasingly, they turn to specialized tools to address these challenges.

Capturing live data can be a challenge, though most sophisticated traders have some solution to this problem. Research on the captured data can be a real issue. Standard database technology does not generally scale well or provide the types of operations which are of interest to quantitative researchers. Integrating custom data stores with programming languages such as Python used in quantitative research has challenges.

There are a few systems on the market which assist traders with these kinds of problems. The two best known are Onetick and Kx Systems. A new entry in the market is Kerf. Kerf is an old English word meaning the width of a saw cut; the creators clearly hope it is a name which represents relentless and precise sawing through data.

Kerf is a timeseries database with an integrated vector programming language. Kerf Software was founded by Kevin Lawler and Scott Locklin in Berkeley, California in 2015. Kevin Lawler is a veteran programming language designer, author of the open source Kona language as well as of the iOS spreadsheet software Permanent. Scott Locklin is active in the array programming community and has been working as a data scientist and quant.

Kerf, like Kx Systems' kdb database, is related to the APL family of array-oriented languages. APL languages have powerful primitives for data analysis; the core operators of APL can be simply combined to create database queries, and vector-wise operations can provide very high performance.

While most APLs emphasize terseness at the expense of readability, Kerf emphasizes readability and provides more familiar landmarks to programmers who are used to conventional programming languages. One of the surprises that Kerf bestows on its users is its JSON-style notation for specifying data. Maps (hashes in Python) have the same notation as JSON. Equality is set with the ':' character rather than '=', which in turn is mostly used for comparison.

Setup

Setting up Kerf is straightforward. There is a stand-alone binary, a PDF manual and an emacs mode (the latter are also available on the company's website). The binary can be run from the command line or in server mode, and can be placed on the path just like any other executable. There are no shared object dependencies. Binaries are presently available for various flavors of Unix. Linux and OS X are primary platforms. Solaris and the BSD family can also be supported. Windows binaries are not yet available, but they are on the product roadmap. We conducted the following tests on OS X.

Data types and manipulation

We can put all the usual datatypes into a map. Strings, integers, dates, floating point numbers, etc.:

mymap: {names: ["Koenig", "Bowman" , "nexus6"], amount:[4 ,5, 3] ,dates:[1999.09.13,2001.01.01,2016.01.07],tag:"movies"}

mymap['amount']

[4, 5, 3]

Tables in Kerf are interned maps where all of the elements have the same length:

mytable: {{names: ["Koenig", "Bowman" , "Nexus6","Hal9k"], amount:[4 ,5, 3, 7] ,dates:[1999.09.13,2001.01.01,2016.01.07,1999.01.12],kind:["h","h","c","c"]}}

names amount dates kind
Koenig 4 1999.09.13 h
Bowman 5 2001.01.01 h
Nexus6 3 2016.01.07 c
Hal9k 7 1999.01.12 c

Tables can held in memory, or be stored as mmapped persistent objects. The 'striped table' is the storage mechanism of choice for larger data sets; this stores individual columns as separate mmapped 'stripes' on the file system. As long as available memory is larger than a column, performance should be high. This is the basic technology of all column-based databases.

The tables can be queried with fairly standard SQL syntax:

select * from mytable where amount = 5

names amount dates kind
Bowman 5 2001.01.01 h

select avg(amount) as average from mytable group by kind

kind average
h 4.5
c 5.0

select dates,names from mytable where dates

dates names
1999.09.13 Koenig
1999.01.12 Hal9k

Growing a table also uses standard SQL syntax:

insert into mytable values {names:'Bond',amount:6,dates:2011.01.01,kind:'a'}
mytable

names amount dates kind
Koenig 4 1999.09.13 h
Bowman 5 2001.01.01 h
Nexus6 3 2016.01.07 c
Hal9k 7 1999.01.12 c
Bond 6 2011.01.01 a

For standard SQL joins, leftjoin is provided as a function. This is a little different from something like PostgreSQL, but is still fairly straightforward.

genre:{{kind:['a','h','c'],descr:['agent','human', 'artificial']}}

kind descr
a agent
h human
c artificial

left_join(mytable,genre,'kind')

names amount dates kind descr
Koenig 4 1999.09.13 h human
Bowman 5 2001.01.01 h human
Nexus6 3 2016.01.07 c artifical
Hal9k 7 1999.01.12 c artifical
Bond 6 2011.01.01 a agent

Timestamps, as can be seen above, are included as a native type. They have convenient methods for aggregating and accessing the numeric part of the timestamp. The Kerf timestamp type has nanosecond resolution (which is going to be important to users subject to MiFID 2, as discussed on page 34):

a: 2016.01.16 + 3m1d5h12i2s

2016.04.17T05:12:02.000

a['minute']

12

a['month']

4

Time series databases can achieve very high performance using sorted columns. One of the most useful tools for a true TSDB is the 'as of' join. The 'as of' join can be used to calculate bid/ask spreads in order books, apply stock splits or dividend adjustments, or calculate the midpoint or percent spread at the time of a trade, as shown in the example on the following page.

KeRF> trades

symbol timedate exprice exsize
IBM 2014.01.01T00:00:00.000 20.57 7100
IBM 2014.01.01T00:00:10.000 20.44 6300
IBM 2014.01.01T00:00:20.000 20.47 9900
IBM 2014.01.01T00:00:30.000 20.55 8100
IBM 2014.01.01T00:00:40.000 20.74 2700
IBM 2014.01.01T00:00:50.000 20.71 4400
IBM 2014.01.01T00:01:00.000 20.74 7900
IBM 2014.01.01T00:01:10.000 20.36 700

KeRF> quotes

IBM20.2820.632014.01.01T00:00:01.000IBM20.1120.692014.01.01T00:00:02.000IBM20.3320.592014.01.01T00:00:03.000IBM20.3020.992014.01.01T00:00:04.000IBM20.1120.522014.01.01T00:00:05.000IBM20.2020.842014.01.01T00:00:06.000IBM20.2920.852014.01.01T00:00:07.000
symbol bid ask timedate
IBM 20.06 20.52 2014.01.01

KeRF> select (bid + ask)/2 as midpoint,timedate from asof_join(trades,quotes,"symbol","timedate")

midpoint timedate
20.290 2014.01.01
20.440 2014.01.01T00:00:10.000
20.435 2014.01.01T00:00:20.000
20.495 2014.01.01T00:00:30.000
20.435 2014.01.01T00:00:40.000
20.460 2014.01.01T00:00:50.000
20.515 2014.01.01T00:01:00.000
20.420 2014.01.01T00:01:10.000

KeRF> select (ask-bid)/ask as pctspread,timedate from asof_join(trades,quotes,"symbol","timedate")

0.02241722014.01.01
pctspread timedate
0.0184198 2014.01.01T00:00:10.000
0.0284612 2014.01.01T00:00:20.000
0.0159729 2014.01.01T00:00:30.000
0.0160194 2014.01.01T00:00:40.000
0.0355257 2014.01.01T00:00:50.000
0.0121124 2014.01.01T00:01:00.000
0.0136187 2014.01.01T00:01:10.000

Functions

One of the features which makes Kerf different from some other time series or columnar database systems is its ability to allow the definition of new mathematical functions in a high level programming language.

Functions are defined in a familiar way in Kerf with the Python-esque 'def' available for function declarations:

def covariance(x,y) {

avg(x*y) - (avg(x) * avg(y))

}

covariance(rand(100,1.0),rand(100,1.0))

-0.0128138

Another way to define functions is with 'function':

function correlation(x,y){

cov(x,y) / (std(x) * std(y))

}

This ability to define new functions allows for database operations which are not defined in the core language. Kerf's roots in the APL family of languages make function application vector-wise by default. Most programming languages use looping constructs. Kerf has a combinator construct for modifying how functions are applied to data in cases where something besides simple vector application is desired.

For example, the mapback combinator applies a function to its predecessor in a list, so log returns can be calculated using this combinator:

prices: 15 + rand(1000,0.01) + 0.0005*range(1000) // generate some artificial "prices"

[15.45, 15.44, 15.07, 15.06, 15.36, 15.48, 15.38, 15.05, 15.08, 15.46, 15.03, 15.27, 15.17, 15.4, 15.12, 15.44, 15.16, 15.07, 15.32, 15.23, 15.08, 15.27, 15.07, 15.18, 15.49, 15.11, 15.24, 15.37, 15.47, 15.26, 15.3, 15.04, 15.15, 15.36, 15.36, 15.07, 15.08, 15.27, 15.44, 15.45, 15.27, ...]

rets: log / mapback prices

rets[0]:0// this sets the first element to be zero

[0, -0.000281188, -0.010534, -0.00028828, 0.00856624, 0.00337974, -0.00281462, -0.00941984, 0.000864842, 0.0108081, -0.0122505, 0.00688006, -0.00285346, 0.00653514, -0.00796893, 0.0090955, -0.00794809, -0.00258595, 0.00714551, -0.00255886, -0.00429856, 0.0054377, -0.00572578, 0.00315852, 0.00877965, -0.010787, 0.0037205, 0.0036889, 0.00281645, -0.00593578, 0.0011369, -0.00744359, 0.0031648, 0.00597858, 0, -0.00827796, 0.000288089, 0.0054377, 0.00480826, 0.000281188, -0.00508945, ...]

To calculate cumulative returns, the fold combinator can be used. Fold inserts the verb to its left in between all of the elements in the list:

+ fold rets

-0.00423705

log(last(prices)) - log(first(prices))

-0.00423705

Elementwise applications of a user defined function can also be done with mapright:

def posneg(x){ if(x

posneg mapright rets

[1, -1, -1, -1, 1, 1, -1, -1, 1, 1, -1, 1, -1, 1, -1, 1, -1, -1, 1, -1, -1, 1, -1, 1, 1, -1, 1, 1, 1, -1, 1, -1, 1, 1, 1, -1, 1, 1, 1, 1, -1, ...]

User interface

The native user interface for Kerf is the command line. 'Arrow up' and 'arrow down' work as a history key for rotating through previously executed commands. Data which doesn't fit onto the screen can be paged. Hitting 'enter' at the prompt allows the user to see more lines of data. Any character except the 'enter' key stops the paged display of query results. This is unusual, as most 'pager' systems use the spacebar to display more text, but this behaviour ends up being very handy for displaying long tables, and is fairly intuitive. The 'help' functionality seems to use the same paradigm; since it is a database oriented programming language, it makes sense that the help tool is also a query-able database.

help("ipc")

usage description
open_socket (host, port) Host and port must be strings.
close_socket(handle) Close an IPC socket handle.
send_async(handle, expr) Doesn't block, Returns 1.
send_sync(handle expr) Returns eval(y) on remote host

There is also a usable emacs mode with syntax highlighting for people who use the emacs editor for code development. The user interface is fairly typical for an interactive programming language; the experience is similar to what you would encounter with R or Python without a custom IDE.

Connectors

Kerf has connectors for Bloomberg PIPE, Quandl and a native import method system for NYSE TAQ formatted data files. There are more data feed connectors in progress. There are built-in facilities for interprocess communication, so data feeds can be used with Kerf as ticker plants, and Kerf processes can be spread across a network for large data processing jobs. Kerf also has connectors for C/C++, Java and Python.

The use of JSON as a native type means it should be straightforward to write new connectors to tools that support JSON.

Assessment

One of the difficulties in training employees on new technologies is their lack of familiarity. Kerf has some strong points here. The database query syntax is fairly standard SQL. The manual is well written and in most cases we were able to find primitives that we needed. There is also a usable online help tool for finding information about Kerf primitives. Many common operations of interest to traders (moving averages, basic statistics) are available as primitive functions.

The JSON-like notation for data and SQL queries are relatively intuitive to use. Writing simple functions is also straightforward to anyone who has used a high level programming language, though it is a little peculiar that there are two equivalent ways of defining a function. Combinators seem powerful, but they may be unintuitive to some users who are not used to vector computing concepts.

Kerf Software is a new company, but their tool shows a lot of promise. Many firms may not want to hire specialized consultants to make competing tools work within their stack. The fact that consulting firms are now acquiring specialized database vendors implies that the Total Cost of Ownership (TCO) is indeed far more than the list price of the license acquisition. This is where smaller vendors can really make a difference. Other tools and new offerings in the marketplace seem to be moving towards cloud based solutions. Companies which require in-house infrastructure and analysis may want to consider Kerf moving forward.