Selection of significant predictors from a large number of time series
Undervalued and overvalued stocks in our model are currently calculated as part of a proof of concept solution that is based on proprietary software with inner limitations.
Therefore, a new robust production system is being prepared. In this way, we can take advantage of synergies with existing components from other areas, for example for postprocessing of results.
- One of the necessary steps in the data preprocessing *, which we have been working on in recent weeks, was to prevent the model from being overfitted, ie we taught to generalize the model **.
- The model can easily overfit, for example when it takes into account too many parameters. That was exactly our case, we originally took into account approximately 1000 time series ***.
- Until recently, there could have been multiple significant time series in our production model, even though they are highly correlated.
- Our goal is to keep only variables that are sufficiently different from each other for further calculations (the result is higher speed, better maintainability of the system, see the principle of the logical economy).
- Therefore, we removed the mutually correlated series and left only those sufficiently different from each other.
These steps successfully completed the Feature selection section.
*The most demanding computational phase in creating a model is data preparation / preprocessing.
**The model could achieve great results when using training data (ie data that he saw during the learning process), but it could achieve poor results when using test data (that it did not see during the learning process).
***Under the term time series, you can imagine a change in the values of indicators over time, such as close price or (not only from it) derived indicators such as Currency Volume. For these indicators, we always determine their significance to the target variable (for example, normalized daily yield) using various statistical tests (for example, the T-test). To prevent overfitting, we remove insignificant time series from the model.