Feature Binning and Quantile Transformation
As part of the research used, for example, in the StockPicker application, we have recently implemented the method Feature Binning and Quantile Transformation to better classify data. Due to upgraded data preparation, our machine learning models now achieve better results.
- You can see this in detail in the following lines.
- If you need to extract information from your data, do not hesitate to contact us.
Why (goal)
- many machine learning algorithms prefer or perform better when numerical input variables have a standard probability distribution => changing data distribution
- some machine learning algorithms may prefer or require categorical or ordinal input variables (better performance)
- discretization transforms are a technique for transforming numerical input or output variables to have discrete ordinal labels
What (key points)
- binning or bucketing is used to reduce the effect of outliers (to achieve normal distribution)
- the original data values (bins) are replaced by a value representative of that interval (typically central value as mean)
- approach to using the transformation of the numerical variable to have a discrete probability distribution where to each numerical value a label is assigned a label and the labels have an ordered (ordinal) relationship
How (procedure)
Discretization
Binning, also known as categorization or discretization, is the process of translating a quantitative variable into a set of two or more qualitative buckets (i.e., categories)
— Page 129, Feature Engineering and Selection, 2019.
Different methods for grouping the values into k discrete bins can be used; common techniques include:
- Uniform: Each bin has the same width in the span of possible values for the variable.
- Quantile: Each bin has the same number of values, split based on percentiles.
- Clustered: Clusters are identified and examples are assigned to each group.
Discussion
- better performance for a number of machine learning algorithms because of the normality of the features
- on the other hand, some authors argue that such transformation results in loss of valuable information, and moreover, binning transformation is more difficult in comparison with other techniques resulting in the same information
- on the contrary to the argument, binning can be useful for the transformation of continuos data to discrete with crude and arbitrary approximation
My short answer to when binning is OK to use is this: When the points of discontinuity are already known before looking at the data (these are the bin endpoints) and if it is known that the relationship between x and y within each bin that has non-zero length is flat.