02 Aug, 2023

Written by Prashant Dimri, Consultant.

Before developing a credit risk model, it is important to prepare the data. This is one of the first and arguably the most challenging step in the model development process. High quality data is essential to building an accurate and reliable credit risk model. The goal here is to transform raw data into clean and organized data that have suitable format for analysis and modeling. Additionally, the process should be iterative with periodic updates and refinements to keep the model relevant and effective as new data becomes available.

There are two types of data - structured and unstructured data. Structured data is data which is presented in the form of rows and columns, unstructured data is generally in the form of text. Structured data can be further divided into time-series data, which can be defined as any data collected of any variable over a period of time (e.g., default rates, GDP etc.). Further variables can be of two types- Numerical and categorical variables. Numerical variables can be discrete (which is countable) or continuous (which is uncountable), whereas categorical variables can be divided into nominal (which is not in order) or ordinal (which is in order). Different aspects are to be investigated while dealing with the data preparation like-

- Exploratory Data analysis
- Treating missing value
- Treating outliers
- One hot encoding
- Scaling the variable
- Transforming variable
- Data partitioning and model selection

After the above steps, finally the model should be built on the training dataset with all the statistically significant variables which explain the dependent variable and the model performance should be checked on the validation dataset and test dataset.

In exploratory data analysis data is explored or analyzed. Analysis can be checked based on the visualization of different variables (this can be by plotting different plots using boxplots, histograms etc.), summarizing the variable which is also called as descriptive statistics. From EDA, things like mean, mode, median can be checked. Further checking upon whether the data is skewed or not. If skewed, normal transformation can be done on it, to check upon the percentage of missing values and outliers in different variables, or to check how many different labels are there in the categorical values. Putting categories with high number of labels into the model should be avoided as this does not explain the dependent variable well and is generally reflecting a noise rather and other analysis as per the business requirements.

If there are missing values in a variable, then it gets imputed using different methods. There are three different types of missing values- MCAR (missing completely at random), MAR (missing at random) and MNAR (missing not at random). MCAR is when there is no pattern in the missing values, MAR is when missing values is due to the known value of other variable and MNAR is when missing value is due to unknown value of the observed variable itself or the other variables that are not included in the analysis. There are different methods to deal with missing values, some important methods are:

- Mean or Median or mode imputation - MCAR can go with this approach. In this, the average of a complete value of a variable is taken as the value for a missing cell for that variable. If there is a normal distribution then use mean otherwise use median value for the imputation, as mean gets affected if the distribution is skewed.
- Regression imputation - MAR can go with this approach. In this, regression technique is used to impute missing values.
- Deletion - MCAR can go with this approach. In this, whichever is the missing cell of any variable, that entire row is deleted for that.
- Missing indicator approach - MNAR can go with this approach. It also known as dummy variable approach. In this, missing indicator variable is created and is flagged in such a way that it is 1 when the data point is missing and 0 when the data point is complete. Thus, in this way it might show the pattern and helps in treating the missing value. This may be used when missing variable is numeric, but another variable is category.

Treating outliers is an important step in making a robust model, if there is any outlier in the model then it can affect the model and could give distorted results. There are different methods to treat outliers. Methods like:

- Trimming- In this method, simply remove the outliers and proceed further.
- Winsorizing method- In this method, extreme values are capped to non-outlier value.

Boxplots are good in treating outliers, where:

- Median, first quartile(Q1), third quartile(Q3), upper and lower boundary are shown.
- Here interquartile range (IQR) is calculated which is given by- IQR=Q3-Q1.
- Lower boundary is calculated using the formula- Q1-1.5*IQR whereas upper boundary is calculated using the formula- Q3+1.5*IQR.
- If any value exceeds the above or lower boundary, then they are considered as outliers. So, these values can be capped or floored to upper and lower boundary respectively or they can simply be removed depending on whether we use Winsorizing or Trimming.

Q-Q plots- This is the plotting of values with respect to the percentiles. In this approach, the values which are above 99 percentiles, get capped to 99 percentile and those values which are below 1 percentile get floored to 1 percentile. Alternatively, these extreme values can be removed as per trimming.

Note: If outliers are genuine then those should be kept in the model, if the outliers are noise, then they should be treated with the methods above. If the data is huge then trimming can be done but if the data is low, then winsorization method should be used.

Because machines only understand numbers some categories need to be converted into numbers. There are few methods to encode categorical variable:

1. One-hot encoding consist of creating a binary variable for each class in a categorical variable. E.g.

City | Dublin | Cork |
---|---|---|

Dublin | 1 | 0 |

Cork | 0 | 1 |

Dublin | 1 | 0 |

It should be noted that the one-hot encoding is useful when the variable has few classes. If the variable has many classes, then there is no point in using one-hot encoding as it will create many numbers of variables and thus making the model complex and less robust. So, when number of classes is high, we can only choose to use those classes into the model whose frequencies are high. E.g., In the below table, four regions are taken, we see Dublin and Cork have high frequencies and so can be chosen in the model with the rest having less frequencies and therefore excluded from the model. This method is useful only when large number of labels are present in the variable.

City | Frequency |
---|---|

Dublin | 80% |

Cork | 15% |

Galway | 2.5% |

Limerick | 2.5% |

2. Mean encoding consists of calculating an average of the target variable for each class in a categorical variable and thus classes in the categorical variable are replaced with the corresponding means. This technique maintains the monotonicity of the data.

e.g., calculate the mean of the default rate for different rating grade.

Mean default rate for rating A: 0.02

Mean default rate for rating B: 0.04

So, replace the rating grade with the average of that rating grade i.e.

Rating A: 0.02

Rating B :0.04

It is the process of transforming numerical variable to a standardized scale or range. This is useful when variables are on different scales and those variables are needed in the model, so for that those variables needed to get scaled first before being put into the model (regression model) so that the coefficients in the regression can be compared significantly. If variables are not scaled, then the coefficients can’t be compared, and model might not give the desired results. There are different scaling methods-

- Standardization- This method transforms the variable to have a mean as 0 and standard deviation as 1. It is calculated using the formula-
. In these, values will lie between -infinity to +infinity.*(X-Mean)/ (Standard deviation)* - Min-Max scaling- In this each data point is subtracted by the minimum value of the variable and then divided by the range (maximum value minus minimum value). It is calculated using the formula-
. In these, values will lie between 0 to 1.*(X-min)/(max-min)* - Mean normalization- It is like min-max except in the numerator each data point is subtracted by the mean. It is calculated using the formula-
In these, values will lie between -1 to 1.*(X-mean)/(max-min).*

This means transforming the distribution of any variables. It is used when there is any assumption in the model (e.g., in the regression model, it is assumed that dependent variable is normally distributed) or sometimes when there is a need to increase the accuracy of the model, and this can be achieved using transformation functions.

Box-cox transformation- It involves applying a power transformation to the original data. The general form of the box-cox transformation is- ** Y(lambda)=(X^lamda-1)/lambda.** Here lambda is the transformed variable, X is the original variable and lambda is the transformation power. The lambda can take any real value including zero and determines the type of transformation applied. To select the optimal value of lambda, Box-Cox method uses MLE (maximum likelihood estimation). It tests various values of lambda and chooses the value that maximizes the log-likelihood of the transformed data, such that for different lambda values different transformations are formed. If lambda is estimated as 0, then distribution transformed into logarithmic. The drawback of this box-cox method is that it assumes that the data values are all positive. If the data contains negative values, then other method like Yeo-Johnson transformation can be used.

In this step, population is divided into training and testing data. Data can be divided into 60%, 20%, and 20% - the training dataset, validation, and testing dataset respectively. In the training dataset, the model is trained. In the validation, best fitted model is chosen with significant variables after fine tuning it and finally, in the testing data, model performance is evaluated to ascertain how well the model performs on the unknown dataset. It is to be noted that if there are many significant variables then different models (also called candidate models) can be built based on different significant variables because building the model with large number of significant variables can make the model complex and thus can lose its robustness. In this, choosing the final model out of many models can be difficult because often the model can perform better compared to the others purely by random chance, which is not the right way to select the final model. So, in this case K-fold cross validation is used to avoid any kind of random chance; the data is divided into K equal size and non-overlapping subsets called folds. Here the model is trained K times, each time using K-1 fold for training and one-fold for validation, such that in each iteration different fold is used as a validation set. Eventually model is evaluated based on the average performance across all iterations and the model which is giving the highest average is chosen as the final model. Example below of K-cross validation; suppose there are 200 observations, then each fold will have 200/5=40 observations, out of that 160 observations will be in the training dataset and remaining 40 observations will be used in the validation dataset, but validation dataset is getting changed each time i.e., it is getting applied in each fold for each iteration.

| Training | Training | Training | Training | Validation |

| Training | Training | Training | Validation | Training |

| Training | Training | Validation | Training | Training |

| Training | Validation | Training | Training | Training |

| Validation | Training | Training | Training | Training |

In this case 5 different models will be built based on these 5 folds having 5 different performances and finally the average of all 5 performances from these 5 folds are taken and compared with the average of other models. The best model is the one with the highest performance average. This is called K-cross validation.

It is important to note that sometimes there might be a case where overfitting can happen when the model performance is very high on validation dataset but in the testing dataset it is worse.

Data preparation is very important part of credit risk modeling. Before putting the data into the model, it important to know about the data which can be treated using exploratory data analysis which includes different steps like checking the total percentage of missing values and outliers for different variables in the data, or checking mean, median and mode for different variables, checking relationships between variables and treating the missing and outliers for different variables. Applying one-hot encoding to convert categorical values with different classes into binary variables. Scaling and transforming the variable if necessary - scaling is important when different variables are in different scales and thus needed to get converted on to the same scale to provide significant comparison between the coefficients. Similarly, transformation is needed such that the desired distribution includes the maximum likelihood of observed data in it. Finally, once the data is ready, it is partitioned into training, validation and testing dataset with generally 60/20/20% rule respectively, where in the training dataset model is trained, in the validation dataset best fitted model is selected after fine-tuning it and in the testing dataset, final model performance is evaluated on the unseen data.