As you may know, one of the most important, or at least time consuming, part of the whole data mining process is the data preprocessing. One common task that has to be done concerns the missing values. In most databases, they are noted NaN (Not a Number) or simply ?. Before normalizing or standardizing a data set, you should take care of these values.
For handling missing values, several techniques exist:
- Ignore the entire line (problematic when many missing data)
- Put a value manually (usually very difficult)
- Use the attribute mean or median (what about outliers?)
- Use the class mean or median (need to know the class distribution)
- Predict the value with regression, decision tree (not straightforward)
In many research, the technique used to handle missing values is not explicitly mentioned. A lot of data mining research papers use the UCI repository for testing algorithms. However, most of the data sets present on this repository have (several) missing values. So, what? Did they ignore the records or use a specific method? And what do you use for handling missing values?