A note on correlation

March 13, 2007 by
Filed under: correlation, variable relationship 

Correlation is often used as a preliminary technique to discover relationships between variables. More precisely, the correlation is a measure of the linear relationship between two variables. Pearson’s correlation coefficient is defined as:

As written above, the main drawback of correlation is the linear relationship restriction. If the correlation is null between two variables, they may be non-linearly related. As written in Tan et al. (2006), x and x^2 have a correlation of zero but are non-linearly related. Remind that non-linear does not mean polynomial. Consider for example x and cos(x). Although their correlation is close to zero, they are related.

P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.

Share

Comments

7 Comments on A note on correlation

  1. Kevin Hillstrom on Wed, 14th Mar 2007 1:18 am
  2. What metric do you recommend using in place of correlation? In other words, what metric do you recommend, one that is good at telling me that “x” and “x^2” are essentially correlated?

  3. Dean Abbott on Wed, 14th Mar 2007 5:52 am
  4. Ahh, the ol’ x vs. x^2 example! I use that one all the time in teaching about correlations. I usually use correlations as a first step in eliminating variables that are replicates of one another: if two variables are correlated at .95 or above (or -0.95 or below), they bring essentially the same information to the table, and only one is needed to convey that piece of information.

    Regarding Kevin’s question–it is a good one. I actually wrote a paper (not a very good one) for an IEEE Systems, Man, and Cybernetics conference on the topic of “nonlinear” correlations, and proposed an algorithm to find these nonlinear relationships. It basically fit simple nonlinear models of every variable and used a scoring metric (like R^2) to assess how related the variables were to each other nonlinearly. But I have never found anything very clean in this regard, and typically don’t worry about these nonlinear relationships.

    The biggest reason people worry about the linear correlation problem is that co-linearity is a destructive problem in regression models. But “nonlinear correlation” is not, and in fact, including both terms in a regression model can be quite a good idea precisely because they are orthogonal (linearly).

    Dean

  5. damien fran├žois on Wed, 14th Mar 2007 4:13 pm
  6. Indeed correlation is only able to spot linear dependencies between variables. For nonlinear dependencies, one can consider an order-based version of the correlation, known as rank correlation that correlates ranks instead of values. This approach is still not suitable for detecting non monotonuous relationships (as is x->x^2 over a domain centered on zero). Then, mutual information can be used, but it is much more difficult to estimated than correlation.

  7. Amit on Thu, 15th Mar 2007 5:22 am
  8. Damien has answered that question to an extent. Spearman’s Rank Correlation is a way out of measuring correlation in monotonic relationships.

    And to add, it depends on the two variables in question, whether Pearson’s Correlation, described by Sandro, will give good results.

    For example, if you are talking about 2 non-continuous variables, the story changes and needs either a Chi-square, Chi-square, or a point-biserial correlationPoint-Biserial Correlation,

  9. Push on Thu, 12th Apr 2007 3:26 am
  10. Hi,

    I am using pearson correlation for a movie rank prediction problem. What I am wondering about is will I get good results even if there is no linear relationship between users who rank movies.

    Thanks,
    Pushkar Raste

  11. Jeff Zanooda on Thu, 24th May 2007 8:52 pm
  12. In credit scoring, information value is routinely used in univariate analysis.

    Another popular approach is to look at both Spearman’s rank correlation and Hoeffding’s measure of dependence.

  13. Vaishnavi S on Sat, 24th Sep 2016 8:09 am
  14. So I have two text files with some educational content. What would it signify to find the degree of correlation between the two text file?

Tell me what you're thinking...





  • Swiss Association for Analytics

  • Most Popular Posts

  • T-shirts, Mugs & Mousepads


    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations