Factors construction

a. Derived data preparation

Financial time series are chaotic and non-stationary. One of the challenges with non-stationary data is that a model that has learned from one specific data distribution will not be able to be applied on a dataset where the distribution has shifted.

The derived data preparation steps are as follow:

i. Data cleaning: this is where we visualize and deal with missing data, week-ends gaps (for some of our datasets coming from traditional markets), etc

ii. Outliers cut: we detect and deal with outliers using a proprietary method

iii. Normalization and standardization: some of our models (neural networks in particular) require data to be in the same scale, otherwise they will not converge

iv. Differentiation: time series must be stationary in order to maximize predictive power and robustness of fit

v. Liquidity filter: we have developed a multi-level and real-time liquidity filter for historical and live data, taking into account higher than expected volatility and volumes, missing data and market capitalization. E.g.: if volatility is higher than the given threshold for any pair (coin, time bin), this pair will be excluded from the sample, similarly with missing prices or volumes; we apply rigorous and automated data cleaning

Steps ii/-iv/ are applied at the features engineering level as they are part of the features construction, they will influence the features predictive power and must therefore be tailored to each feature.

Any time series transformation must be backward looking, otherwise it is forward looking, which would be a mistake (using data from the future to predict the same future is cheating).

b. Features engineering

Features, also sometimes called factors, are grouped by themes, as follow:

i. Prices: prices based factors, from log returns to the slope factor, which compares the n days returns to its long-term average

ii. Statistical arbitrage: cross-coins prices differences and reversion to the long term mean of this difference

iii. Technicals: used by momentum and mean reversion strategies, they are continuous and more advanced versions of breakout trading signals

iv. Volatility realized: realized volatilities of trading prices

v. Volumes: similar to volatility but related to trades volumes

vi. Global macro: prices returns differences across global equities, global fixed income, currencies and commodities

1. Factors selection

The sample is split between an in sample and an out of sample set. The in sample set starts in October 2020 and the out of sample begins in July 2022, giving roughly a 2/3, 1/3 ratio, following industry best practices.

The in sample set will be used to select factors, models and hyper parameters. Note that the in sample has a validation set as well, used in the modelling (more on that in the modelling section).

By definition, the out of sample set is unseen and used to evaluate forecasting performance. A back and forth between in sample and out of sample must be minimized (ideally non-existent) in order to minimize overfitting and selection bias.

The factors selection methodology combines experience, data visualization and machine learning.

The single evaluation metric used is the rolling correlation, also known as the information coefficient (IC). It is commonly known by insiders and practitioners that a trader achieving an IC of more than 10% is considered to be a good trader.

We will not use the win ratio as it does not take into account the non-normality of financial time series distributions.

For a 10 days horizon model, the chosen correlation lookback window will be 90 days.

When selecting a factor, the criteria are:

i. Good predictive power across different market regimes

ii. Correlations with existing factors (must be low and orthogonal)

iii. Rational and intuition: does the relationship make sense from a causality and Economics perspectives?

iv. Machine learning, specifically the Random Forest model, is used as another tool for factors selection validation

Note: it is sometimes useful to add a factor only relevant during specific market regimes as it provides diversification.

Figure 1 shows the triple_ma (triple moving average) technical factor correlation with the 10 days ahead returns. The correlation is positive until Q3 2021 (momentum) before switching to a mean revertion (negative) relationship. This factor will be selected.

In figure 2, we can see the feature importance representation from the random forest model. The feature importance describes which features are the most relevant.

PreviousData pipeline NextModelling

Last updated 1 year ago