Data Harvesting – Extract, Transform, Load
It is important to have a rich and comprehensive data set for every system decision, including machine learning models. Only then can one build a robust and powerful model with desired performance. Otherwise, the developer would face the so-called GIGO (garbage in – garbage out) effect. It is an expression that says poor-quality input data provides poor-quality outputs. Although there are several data platforms automating the whole ETL (extract, transform, load) process, sometimes these solutions are overabundant, or the pre-made solutions might not meet your specific needs.
Before you start training or even running your model, you have to collect all the necessary data and dynamically update the data warehouse. To achieve this goal, it is necessary to build an auxiliary application, which manages everything mentioned above.
Because we are dealing with financial data, our problem was not only extracting, transforming, and uploading data. There is a large number of financial ratios and firm-specific data which might be related to each other. All of these ratios have to be kept in the right time order. To manage this, we have designed several special-purpose objects responsible for:
- querying accessible APIs with financial data
- processing and transformation of harvested data
- saving processed data to the data warehouse
Querying accessible APIs is a once-a-day scheduled process focusing on harvesting all accessible data we can find, including Yahoo Finance data. The process runs on a virtual private server being a reliable solution for a reasonable price. To process and transform data, we have designed a special transformator, which joins desired datasets from various sources together and concatenates them. Thus we gain datasets with a long history and a minimum of missing numbers. Next, the processor calculates desired ratios that were missing from the original data source.
Let’s have a look at an example: considering „price to cash flow“ as a potentially good feature bringing information about corporate solvency and liquidity, we can design this feature with known data from the original data source. Namely, „operating cash flow per share“ and „current market value per share (price)“ can be used to calculate the P/CF ratio.
Price to Cash Flow Ratio = Share Price/Operating Cash Flow per Share.
There are dozens of those financial indicators calculated in this fundamental processor for over 3.000 US-listed stocks. Comparing partial and relative values adds value to the dataset.
After all these tasks are completed, the transformer converts the data into files and uploads them back into the data warehouse, where the datasets are ready to be uploaded by machine learning developers.
To sum up, a good machine learning model needs rich and wide datasets; otherwise, the final output is not capable of bringing added value by extracting useful information and the overall performance is substandard. In order to automatically update historical time series, you need to prepare an application that collects all the required data and transforms its own financial indicators to get the most out of your datasets. Only then you are able to design a machine learning model with reasonable performance.
The results of research and development are used, for example, in our StockPicking Lab application, which scores undervalued and overvalued stocks every day.