Numerai ($NMR)?

9may...WMMw

3 May 2024

108

Numerai is a data science competition where you build machine learning models to predict the stock market. You are provided with free, high quality data that you can use to train models and submit predictions daily. Numerai computes the performance of these predictions over the following month and you can stake NMR on your model, to earn (or burn) based on your model's performance.

Data

Numerai's free dataset is made of clean and regularized financial data. The dataset is obfuscated so that it can be given out for free and modeled without any financial domain knowledge. This also means that models you build on this data cannot be used outside of the Numerai tournament.
Here is an example of the general structure of our dataset:
Numerai's obfuscated dataset
Each row in the dataset corresponds to a specific stock at a specific point in time. The point in time is noted by the era - each represents a week. The IDs are unique in each era such that you cannot match stocks across eras - this is necessary for the obfuscation. The features are quantitative attributes known about the stock at the time (e.g P/E ratio, ADV, etc.). The target is a measure of stock market returns 20 days into the future where low means bad performance and high means good performance.
Here is an example of how to get our dataset:
Copy

from numerapi import NumerAPI
import pandas as pd

napi = NumerAPI()
# Use int8 to save on storage and memory
napi.download_dataset("v4.3/train_int8.parquet")
training_data = pd.read_parquet("v4.3/train_int8.parquet")

See the Data section for more details.

Modeling

Your objective is to build machine learning models to predict the target given the features. You can use any language or framework that you like.
Here is an example model in Python using LightGBM:
Copy

import lightgbm as lgb

features = [f for f in training_data.columns if "feature" in f]

model = lgb.LGBMRegressor(
      n_estimators=2000,
      learning_rate=0.01,
      max_depth=5,
      num_leaves=2 ** 5,
      colsample_bytree=0.1
)
model.fit(
      training_data[features],
      training_data["target"]
)

See the Models section for more examples.

Submissions

Each day (Tuesday through Saturday), new live data is released. This represents the current state of the stock market. You must generate live predictions and submit them to Numerai. You are asked to submit a prediction value for each id in the live data.
This is what a submission might look like:
Here is an example of how you generate and upload live predictions in Python:
Copy

# Use API keys to authenticate
napi = NumerAPI("[your api public id]", "[your api secret key]")

# Download latest live features
napi.download_dataset(f"v4.3/live_int8.parquet")
live_data = pd.read_parquet(f"v4.3/live_int8.parquet")
features = [f for f in live_data.columns if "feature" in f]
live_features = live_data[features]

# Generate live predictions
live_predictions = model.predict(live_features)

# Format and save submission
submission = pd.Series(
    live_predictions, index=live_features.index
).to_frame("prediction")
submission.to_csv(f"submission.csv")

# Upload submission
napi.upload_predictions(f"submission.csv")

Behind the scenes, Numerai combines the predictions of all models into the Stake-Weighted Meta Model, which in turn is fed into the Numerai Hedge Fund for trading.
See the Submissions section for more details and examples.

Scoring

Submissions are scored against two main metrics:

Correlation (CORR): Your prediction's correlation to the target
Meta Model Contribution (MMC): Your prediction's contribution to the Meta Model

Since the target is a measure of 20 business days of stock market returns, it takes about 1 month for each submission to be fully scored.
See the Scoring section for more details.

Staking

When you are ready and confident in your model's performance, you may stake on it with NMR - Numerai's cryptocurrency. After the 20 days of scoring for each submission, models with positive scores are rewarded with more NMR, while those with negative scores have a portion of their staked NMR burned (destroyed such that no one, not even Numerai, can access it).
Staking serves two important functions:

"Skin in the game" allows Numerai to trust the quality of staked predictions.
Payouts and burns continuously improve the weights of the Meta Model.

See the Staking section for more details.

FAQ

Do I have to stake?

No. You can submit your prediction file and receive performance without staking.

Why wouldn't I just trade it myself?

You can't trade the predictions for the Numerai Tournament. Since our data is obfuscated, it's impossible to use it for your own trading.

Why not pay in USD?

USD cannot be burned. NMR was designed to be burned and thus can be sent to a null address, making it completely unusable by anyone. This is important because if NMR is burned due to bad performance, you can be sure that the NMR is disappearing, not simply being sent to another user.

Data Structure

The Numerai dataset is a tabular dataset that describes the global stock market over time.
At a high level, each row represents a stock at a specific point in time, where id is the stock id and the era is the date. The features describe the attributes of the stock known on that date. The target is a measure of future returns relative to the date.

IDs

The id is unique per stock per era, this means you cannot track performance of any stock over time. This is pivotal to the encryption of the data and means that you should treat the id as a simple unique identifier for a specific stock at a specific point in time.

Eras

Eras represents different points in time. Specifically, they represent the Friday of each week because that is always the latest market close data to generate predictions over the weekend.
Instead of treating each row as a single data point, you should strongly consider treating each era as a single data point. For this same reason, many of the metrics on Numerai are "per-era", for example mean correlation per-era.
In historical data (train, validation), eras are 1 week apart but the target values can be forward looking by 20 days or 60 days. This means that the target values are "overlapping" so special care must be taken when applying cross validation. See this forum post for more information.
In the live tournament, each new round contains a new era of live features but are only 1 day apart.

Features

There are many features in the dataset, ranging from fundamentals like P/E ratio, to technical signals like RSI, to market data like short interest, to secondary data like analyst ratings, and much more. Feature values for a given era represent attributes as of that point in time.
Each feature has been meticulously designed and engineered by Numerai to be predictive of the target or additive to other features. We have taken extreme care to make sure all features are point-in-time to avoid leakage issues.
While many features can be predictive of the targets on their own, their predictive power is known to be inconsistent across over time. Therefore, we strongly advise against building models that rely too heavily on or are highly correlated to a small number of features as this will likely lead to inconsistent performance. See this forum post for more information.
Note: some features values can be NaN. This is because some feature data is just not available at that point in time, and instead of making up a fake value we are letting you choose how to deal with it yourself.

Targets

The main target of the dataset is specifically engineered to match the strategy of the hedge fund and represent the future performance of a given stock relative to the given era.
Our hedge fund is neutral to many forms of "beta" - we remove our exposure to markets, countries, sectors, or common factors This is because we are in search of "alpha" - returns that you can get from a stock that are not explained by broader trends (e.g. if the whole market goes up or down, we want to make money on specific stocks that over- or under-perform compared to the market). You can can interpret the main target as representing these stock-specific returns.
Apart from the main target we provide many auxiliary targets that are different types of stock-specific returns. Like the main target, these auxiliary targets are also based on stock specific returns but are different in what is "residualized" (eg. neutral to market and country vs neutral to sector and factors) and time horizon (eg. 20 market days into the future vs 60 market days into the future).
Even though our objective is to predict the main target, we have found it helpful to also model these auxiliary targets. Sometimes, a model trained on an auxiliary target can even outperform a model trained on the main target. In other scenarios, we have found that building an ensemble of models trained on different targets can also help with performance.
Note: some auxiliary target values can be NaN but the main target will never be NaN. This is because some target data is just not available at that point in time, and instead of making up a fake value we are letting you choose how to deal with it yourself.

Downloading the Data

The best way to access the Numerai dataset is via the data API. There are many files available for download, but it's best to start by listing them.
Here is an example of how to list our files and versions in Python:
Copy

from numerapi import NumerAPI
napi = NumerAPI()
datasets = napi.list_datasets()

You can also review the datasets available on our Data Page.

Versions

There are several versions in the data API because Numerai is constantly improving our dataset. In general, if you are building a new model you are encouraged to use the latest version. Each minor version (i.e. v4 vs v4.1 vs v4.2) generally maintains backwards compatibility with each other and makes it easy for you to plug in the latest data to your trained models. Major versions (i.e. v3 vs v4) may have large, breaking changes to the structure and/or contents of the datasets, so it's usually best to re-train your models when major versions are released.
You'll even see versions available for our other tournament, Signals. This is a "bring your own data" tournament that you can read more about here.

Files

The Numerai dataset is made up of many different files in a few different formats. Here is a list of the common files we will usually give out with every version:

features.json contains metadata about the features and feature sets, this is critical to use when you have limited resources, more on this below
train_int8.parquet contains the historical data with features and targets
validation_int8.parquet contains more historical data with features and targets
live_int8.parquet contains the latest live features with no targets of the current round
meta_model.parquet contains the meta model predictions of past rounds
live_example_preds.parquet contains the latest live predictions of the example model
validation_example_preds.parquet contains the validation predictions of the example model

Here is how to get a small set of training features from the data API in Python:
Copy

import pandas as pd
import json

# Download and read the features json file
napi.download_dataset("v4.3/features.json")
feature_metadata = json.load(open("v4.3/features.json"))

# use the small feature set to reduce memory usage
small_feature_set = feature_metadata["feature_sets"]["small"]
columns = ["era"]+features+["target"]

# Download and read the training data 
napi.download_dataset("v4.3/train_int8.parquet")
training_data = pd.read_parquet("v4.3/train_int8.parquet", columns=columns)

Formats

The main file format of our data API is Parquet, which works great for large columnar data - we highly encourage that you become familiar with Parquet if you are not already. You can also find CSV versions of all files if you prefer, but training and validation data are quite large and take a while to load in a CSV format.
You'll also notice that many files have an int8 suffix. This denotes that the data in that file is stored using integers (0, 1, 2, 3, 4) rather than floats (0.0, 0.25, 0.5, 0.75, 1.0). We offer these files because of the size of our datasets, it requires less resources for models to train on int8 than on float32.

Summary

What is Numeraire (NMR)? How does it work?
Numerai is a multi-system that includes cryptography, machine learning, artificial intelligence, data science. Numerai is built on the Ethereum blockchain. It operates as a risk protection fund. It is emphasized that Numeraire is the first cryptocurrency created by a hedge fund. NMR, the token of the Numeraire platform, is used to reward users. Data scientists who perform successfully in the Numerai Tournament are rewarded with NMR. The Numerai Tournament is a competitive competition where data scientists predict finances on a weekly basis. NMR is built on the Ethereum blockchain and operates with the ERC-20 standard.

NMR is not mined.

The maximum supply of NMR is limited to 11,000,000 units.

How to Store Numeraire (NMR)?
Numeraire (NMR) is an ERC-20 based token that aims for decentralization in the field of data science. You can manage your NMR balances with desktop, mobile and hardware wallets that support ERC-20-based Ethereum assets.

Community:
https://twitter.com/numerai
https://reddit.com/r/numerai
https://t.me/NMR_Official
https://www.youtube.com/@Numerai

Resources:
https://docs.numer.ai/