In certain situations, the data miner has to perform sampling on the dataset before applying any algorithm. The main reason being too many data to mine. In such a case, a possible technique is random sampling. If classes are uniformly distributed, one may use random sampling before supervised learning.

But what about association rule mining? If you use random sampling before an association rule algorithm, you may end up finding no rule. The reason is that association rule mining analyses the data as transactions. The idea is to find recurrent trends in a set of transactions that are usually continuous. Here is an example:

`Transaction ID / product`

112 / bread

112 / butter

112 / jam

113 / cheese

113 / bread

...

The issue with random sampling is that it will not take into account the continuous sequence of events. In the case of association rules, one should take a continuous subset of the data in order to get meaningful rules.

Do you have any other examples where random sampling can’t be used? Other issues with association rule mining? Feel free to comment this post.

An interesting example is related to all “points” related each others by a process like in the following:

consider a stochastic process to generate a fractal (like the famous fern).

One of the most challenging problem is to find the affine transformations that describe the underlying process where every point is generated starting from the former (memoryless process).

…As you can imagine, the choice to extract randomly a set of points with the aim to find out the relations among them, doesn’t work.

Indeed all precious information lays in some consecutive points of the history!.

Cheers.

http://textanddatamining.blogspot.com/

The case where simple random sampling can break down it would seem is also when one target level is exceptionally rare. In this case, you need to use stratified sampling.

Any modeling using time series data (assuming a transactional type format where there are multiple rows per observational unit) cannot be sampled randomly – but instead needs cluster sampling like you have listed for association / sequence analysis.

Any time each record is not independent, you can’t sample randomly. Your example is a good one, and this occurs in other situations as well. For example, if each record is a credit card transaction, with an individual customers having regular purchase patterns and dozens to hundreds of transactions, a simple random split into training/testing causes other problems as well. If we use held out data to assess models, they will be unfairly biased and will appear to be better than they really are because examples the same customer appears in both training and testing subsets.

Thanks for your examples! Really interesting to see other applications where random sampling is not a good choice. Unbalanced classes is a good one: depending on the sample size and rare class frequency, one may end up with no example of the rare class.

Random sampling for association rules is a well-studied problem. The vignette to the R package ‘arules’ has a quick overview with references; see section 4.3 of http://cran.r-project.org/web/packages/arules/vignettes/arules.pdf . The package implements a sample() function that you might find interesting.

@Allan: thanks for the info and the link!

The basic algorithm that actually works in selecting the item set(items of interest in analysis) is Apriori Algorithm .. This cant also work on random sampling . The main reason lies in the fact that Apriori tries to find out the interestingness criteria in the data samples. In case, we use the ‘Rarity’ Criteria then we gonna surely miss the rare case if the random samples are selected for analysis.

The association rules are formed by studying data with recurrent “if/then” configurations long with the criteria “support and confidence.” This is done in order to find the most significant connections. And this rules are important in data mining particularly in analyzing and predicting consumer behavior. They are significant in product grouping, shopping basket data analysis, store layout and catalog design..

The Role of Association Rules in Data Mining