While often neglected in the academic literature, the pre-processing step is certainly one of the most important element of the modelling process. Apart from the necessary cleansing of the data, it also includes the feature engineering, with the proper encoding of categorical features, and the definition of ratios derived from primary features. Being dimensionless quantities, ratios are most stable through time and often also have better generalization properties.
In our case, the pre-processing step involves the train/test split of the dataset, and the creation of different ratios. We also performed the replacement of missing values (based on the median), and 5-95 outlier capping. Specific treatments were reserved for ordinal variables. Categorical variables were one-hot encoded. For some variables for which an ordering was possible, the variables were furthermore ordinally encoded.
For the pre-processing step, Python and R proved to be extremely efficient. Many of the operations could be performed using built-in functions. The scripting nature of these languages also allowed to directly interact with the dataset.
In Python, the data is read from csv files using the pandas library, which provides a solution for manipulating tabular data. Analysing the dataset could directly be performed using simple commands. Python also offers some powerful graphical libraries and plots of variables are also easily generated in one line of code.