Working with missing observations is a sensitive topic because the missing type and pattern characteristics need cautious checking. Therefore, missing treatment should clearly define the distinction between 'meaningful missing data' (e.g., the value of the collateral if there is no collateral) or 'technical missing data' (e.g., lack of information on the customer's application) and follow a reasonable and conservative approach. This way, results, such as low discriminatory power and biased predictions, are also prevented.
Missing data values are usually split into two categories:
In some cases, the reason behind the missing data of a facility or an obligor is known; in this case, it is necessary to treat it differently because these values are interpretable and are called ‘meaningful missing values’.
In other cases, the data is truly missing and represents nothing. Examples are technical error during data merges, not filling the information in the application form, privacy reasons, etc. These missing values are called ‘technical missing data'.
There are different treatments for 'meaningful missing values' and 'technical missing data,' but such a level of detail is out of the scope of this article. If a high proportion of 'technical missing data' is present in the dataset – and it is not limited to specific periods in time – the variable would not be considered as a risk driver. Institutions should help the cooperation of modellers and validators by specifying unacceptable levels of this ratio. Financial institutions should have thoroughly considered and decided what kind of treatment will be applied.
The governance of missing values is challenging. Treating these records should be approached with care. To give an example, it is necessary to provide justifications when the modeller decides to remove the data completely. Also, ratio risk drivers must be defined with attention since they are based on a numerator divided by a denominator. For instance, what happens when the denominator equals zero must be specified. It should also be ensured that the missing values are not recorded as zero. Imputation(replacement) can also be a method to deal with missing data. Using known values of risk factors to replace missing values with mean or median values is another method.