Production data pipeline

Introduction

Problem: financial markets data features specific challenges that one must carefully deal with.

Solution: the quant production data pipeline leverages the most advanced techniques but also best practices in the financial industry.

Concrete production data pipeline makes use of PostgreSQL and TimescaleDB.

Timescale has been especially engineered for time series and events. As a result, it is one of the fastest and more efficient time series data base.

Architecture

The CoreDB contains information about the security master and can be found here.

A security master is an operational data store for the reference data about financial securities (instruments and products).

The Firehose will contain both the historical and real time (event) data bases, as well as a staging table (used as a cache).

The resolution will be 1 minute bars.

SQL code can be found here.

Cloud architecture

The timescaleDB quant prod database in deployed in the alexdb2 server.

The quant production data pipeline engine is deployed in the quants-2 server.

The data pull from data proviers runs every 15 minutes for the cryptocurrencies data and every hour for the tradFi futures contracts.

Schemas design

Security master architecture

Historical, event and staging tables architecture

Integration with the quant framework

Easy, modular and fast data pull implementation. Data is delivered to the quant framework into time bins, ranging from 1 minute to 3 days.

A runnable example and more details can be found here and the example section below.

The production data pipeline is the first building block in the quant framework, see architecture here.

Example

Python data pull runnable can be found here.

Pulling minutes bars data for ETH from 1st of January 2020 takes 1 seconds and 220 ms (see screenshot).

Note that the query has been executed at 4.15pm UTC; one can see in the screenshot further below that the data has been updated automatically by the prod data engine up to 4.15pm UTC.

Code snipped:

SELECT timestamp,
       product_id,
       product_ticker,
       instrument_id,
       venue_id,
       open,
       high,
       low,
       close,
       volume,
       timestamp_close,
       ts_created
FROM evnt.ohlcv_1m
WHERE product_id = 1001 and timestamp >= '2019-01-01'
order by timestamp desc

Last updated