GCP Big Query
Google Cloud Platform / Big Query
Introduction
Big Query is the Big Data Analytics platform proposed by GCP, similar to Snowflake and other competitors.
We selected it for 2 main advantages:
No upfront licence fee, pay-per-use pricing model convenient for going from 0 to 1.
Comes with interesting crypto datasets already uploaded and maintained by the open source community; for instance a full record of all ethereum transactions updated every 5 minutes or so.
We use it to store and analyze big data when there is no live production constraint. It is very slow if you need to access one row of data for production, maybe up to a few seconds; but very fast if you want to run a query against 500 million ethereum wallets, which will generally return under a minute, sometimes in a few seconds.
Risk Engine Datasets
So far, we created 2 datasets in Big Query:
ds_intotheblock: We have an ETL that reads data from IntoTheBlock API and pushes it into tables here for modelling purposes.
ds_ethereum: Some models requiring features about ethereum wallets start by aggregating all transactions into tables here before creating their features

Public Datasets
Interestingly, Big Query comes with some crypto datasets for free; but no classic finance one.

Tips and tricks
You can perform queries very easily by using the console UX, simply click new Query, enter the query and hit run. But before hitting run, if you wait a bit, you will see a small icon indicating if the query is valid and how much data it will consume. This is very convenient to understand Big Query performance and billing mechanics, and optimize the query accordingly. For example, selecting only the columns you really need in the output can significantly reduce query time and cost.
To query data from Python, it goes like this:
import pandas
from google.cloud import bigquery
client = bigquery.Client()
query = """
SELECT twm.*,
tm.num_wallets as token_num_wallets,
tm.vol_transfers as token_vol_transfers,
tm.deciles as token_deciles
FROM `bpf-datascience`.ds_ethereum.token_wallet_month as twm
INNER JOIN `bpf-datascience`.ds_ethereum.token_month tm on tm.token=twm.token and tm.block_month=twm.block_month
"""
df = client.query(query_wallets).to_dataframe()
Useful Links
Last updated