Stratification for data mining
One common issue in data mining is the size of the data set. It is often limited. When this is the case, the test of the model is an issue. Usually, 2/3 of the data are used for training and validation and 1/3 for final testing. By chance, the training or the test set may not be representative of the overall data set. Consider for example a data set of 200 samples and 10 classes. It is likely that one of these 10 classes is not represented in the validation or test set.
To avoid this problem, you should take care of the fact that each class should be correctly represented in both the training and testing sets. This process is called stratification. One way to avoid doing stratification, regarding the training phase is to use k-fold cross-validation. Instead of having only one given validation set with a given class distribution, k different validation sets are used. However, this process doesn’t guarantee a correct class distribution among the training and validation sets.
And what about the test set? The test set can only be used once, on the final model. Therefore, no method such as cross-validation can be used. There is no guarantee that the test contains all the classes that are present in the data sets. However, this situation is more likely to happen when the number of samples is small and the number of class is high. In this situation, the stratification process may be crucial. I’m wondering if people usually apply stratification or not and why. Feel free to comment on this issue regarding your personal experience.
More details about stratification can be found in the book Data Mining: Practical Machine Learning tools and techniques, by Witten and Frank (2005).