A note on correlation
Correlation is often used as a preliminary technique to discover relationships between variables. More precisely, the correlation is a measure of the linear relationship between two variables. Pearson’s correlation coefficient is defined as:
As written above, the main drawback of correlation is the linear relationship restriction. If the correlation is null between two variables, they may be nonlinearly related. As written in Tan et al. (2006), x and x^2 have a correlation of zero but are nonlinearly related. Remind that nonlinear does not mean polynomial. Consider for example x and cos(x). Although their correlation is close to zero, they are related.
P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, 2006.
Comments
7 Comments on A note on correlation

Kevin Hillstrom on
Wed, 14th Mar 2007 1:18 am

Dean Abbott on
Wed, 14th Mar 2007 5:52 am

damien françois on
Wed, 14th Mar 2007 4:13 pm

Amit on
Thu, 15th Mar 2007 5:22 am

Push on
Thu, 12th Apr 2007 3:26 am

Jeff Zanooda on
Thu, 24th May 2007 8:52 pm

Vaishnavi S on
Sat, 24th Sep 2016 8:09 am
What metric do you recommend using in place of correlation? In other words, what metric do you recommend, one that is good at telling me that “x” and “x^2” are essentially correlated?
Ahh, the ol’ x vs. x^2 example! I use that one all the time in teaching about correlations. I usually use correlations as a first step in eliminating variables that are replicates of one another: if two variables are correlated at .95 or above (or 0.95 or below), they bring essentially the same information to the table, and only one is needed to convey that piece of information.
Regarding Kevin’s question–it is a good one. I actually wrote a paper (not a very good one) for an IEEE Systems, Man, and Cybernetics conference on the topic of “nonlinear” correlations, and proposed an algorithm to find these nonlinear relationships. It basically fit simple nonlinear models of every variable and used a scoring metric (like R^2) to assess how related the variables were to each other nonlinearly. But I have never found anything very clean in this regard, and typically don’t worry about these nonlinear relationships.
The biggest reason people worry about the linear correlation problem is that colinearity is a destructive problem in regression models. But “nonlinear correlation” is not, and in fact, including both terms in a regression model can be quite a good idea precisely because they are orthogonal (linearly).
Dean
Indeed correlation is only able to spot linear dependencies between variables. For nonlinear dependencies, one can consider an orderbased version of the correlation, known as rank correlation that correlates ranks instead of values. This approach is still not suitable for detecting non monotonuous relationships (as is x>x^2 over a domain centered on zero). Then, mutual information can be used, but it is much more difficult to estimated than correlation.
Damien has answered that question to an extent. Spearman’s Rank Correlation is a way out of measuring correlation in monotonic relationships.
And to add, it depends on the two variables in question, whether Pearson’s Correlation, described by Sandro, will give good results.
For example, if you are talking about 2 noncontinuous variables, the story changes and needs either a Chisquare, Chisquare, or a pointbiserial correlationPointBiserial Correlation,
Hi,
I am using pearson correlation for a movie rank prediction problem. What I am wondering about is will I get good results even if there is no linear relationship between users who rank movies.
Thanks,
Pushkar Raste
In credit scoring, information value is routinely used in univariate analysis.
Another popular approach is to look at both Spearman’s rank correlation and Hoeffding’s measure of dependence.
So I have two text files with some educational content. What would it signify to find the degree of correlation between the two text file?
Tell me what you're thinking...