Standardization vs. normalization
Filed under: data preprocessing, normalization, scaling, standardization
In the overall knowledge discovery process, before data mining itself, data preprocessing plays a crucial role. One of the first steps concerns the normalization of the data. This step is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.
Two methods are usually well known for rescaling data. Normalization, which scales all numeric variables in the range [0,1]. One possible formula is given below:
On the other hand, you can use standardization on your data set. It will then transform it to have zero mean and unit variance, for example using the equation below:
Both of these techniques have their drawbacks. If you have outliers in your data set, normalizing your data will certainly scale the “normal” data to a very small interval. And generally, most of data sets have outliers. When using standardization, your new data aren’t bounded (unlike normalization).
So my question is what do you usually use when mining your data and why?
Note: Thanks to Benny Raphael for fruitful discussions on this topic. Thanks to Jim for a correction on May 26th 2014.
Comments
26 Comments on Standardization vs. normalization

Fay on
Thu, 12th Jul 2007 2:47 pm

Will Dwinnell on
Thu, 12th Jul 2007 11:00 pm

Sandro Saitta on
Fri, 13th Jul 2007 3:46 pm

Will Dwinnell on
Fri, 13th Jul 2007 6:44 pm

mb on
Fri, 30th Nov 2007 1:23 am

Sandro Saitta on
Thu, 6th Dec 2007 10:26 am

shrawan bhattacharjee on
Tue, 14th Sep 2010 12:19 pm

Sandro Saitta on
Tue, 4th Jan 2011 6:07 pm

Data Mining And Statistics For Decision Makingプレ勉強会その３ on
Wed, 14th Sep 2011 3:24 pm

Mrinal on
Wed, 14th Dec 2011 11:05 am

silvina on
Sat, 9th Jun 2012 9:16 pm

Ravi on
Sun, 10th Jun 2012 1:19 am

Rahul Kulkarni on
Mon, 30th Jul 2012 8:07 am

Abraham Aldaco on
Fri, 3rd Aug 2012 1:20 am

Martin Hlosta on
Wed, 26th Sep 2012 10:59 am

Arunava on
Mon, 29th Oct 2012 6:39 pm

Steven on
Thu, 6th Dec 2012 10:46 pm

TonyD on
Sun, 6th Jan 2013 12:51 am

Sandro Saitta on
Mon, 7th Jan 2013 9:20 am

DAVID OMBUNI on
Thu, 8th Aug 2013 7:44 am

David Garcia on
Fri, 20th Sep 2013 10:23 am

Sandro Saitta on
Sat, 21st Sep 2013 7:01 pm

Data Mining Research in 2013  Data Mining Research  www.dataminingblog.com on
Sat, 4th Jan 2014 7:18 pm

Data Mining Research in 2013 DataMining4U on
Tue, 7th Jan 2014 5:43 am

Data Mining Research in 2013  Bugdaddy on
Sat, 1st Feb 2014 10:15 am

Jim on
Mon, 26th May 2014 9:09 pm
Sometimes perhaps we can take logarithms of input data when they contain orderofmagnitude larger and smaller values. However, since logarithms are defined for positive values only, we need to take care when the input data may contain zero and negative values.
You did a very good work on your blog!
A few points come to mind:
1. Monotonic scaling of the data (assuming that distinct values are not collapsed) will have no affect on the most common logical learning algorithms (tree and ruleinduction algorithms).
2. There are robust alternatives, such as: subtract the median and divide by the IQR, or scale linearly so that the 5th and 95th percentiles meet some standard range.
3. Outliers (technically, and high leverage points) present an interesting challenge. One possibility is to Winsorize the data after scaling it.
Thanks for your comment fay. I agree with you on taking the log. I use to work with data in the range 10^6 to 10^12 for example. And thanks for the remark
Will, your suggestions seem very interesting. I don’t know the “winsorize” technique, but it seems it could be used in addition to normalization.
For readers who are not aware of this technique: “Winsorizing” data simlpy means clamping the extreme values.
This is similar to trimming the data, except that instead of discarding data: values greater than the specified upper limit are replaced with the upper limit, and those below the lower limit are replace with the lower limit. Often, the specified range is indicate in terms of percentiles of the original distribution (like the 5th and 95th percentile).
This process is sometimes used to make conventional measures more robust, as in the Winsorized variance.
Will, can you tell me how I can scale linearly so that the 5th and 95th percentiles meet some standard range?
Can this be done with both negative and positive values?
Another question:
If I want to compute an index where not only the units and scales are different, but also the input metrics into the index have different interpretations – specifically, one metric is better if the values are higher and another one is better if the values are lower, how can I compute an index that represents all numbers concisely and meaningfully?
Let’s say I have Expenses ($), Profits($) and Turnover (%). Expenses and Turnover are better if lower, but Profits are better if higher.
If comparing two companies on these metrics, and I want to compute one index to show the “best” performing company on these parameters, how can I do this?
Sorry, not strictly datamining relevant, but thought someone here might have an answer!
Tried using zscores and normalizing but doesnt work due to different hilow interpretations.
Eventually used a reverserank for Expenses and Turnover so that all have same order. However, rank does not show quantity difference between the two companies, just their ranks!
this is a great blog, thanks to all for helpful comments.
First, you can normalize/standardize your data. Or, on the contrary, you can maybe decide to manually fix weights to each of these metrics.
You can for example use an objective function. Let say you want to maximize a function of the Expenses, Profits and Turnover. In the objective function, give a negative weight to Expenses and Turnover and a positive one to Profits. I don’t know if this will work for your problem, but that would be my first guess.
Hi Sandro,
Interesting article and the comments which followed. I am also dealing with analysis of Mass Spectrometry data. This kind of data suffers a significant variation due to instrumental errors and limitations (even if the same sample is analyzed). Presently I am using log transformation, which is giving satisfactory result. But I am still skeptical of possibilities of false positives, as data is from the range of 10*3 to 10*6. So what do you suggest best method of normalizing this kind of data.
I am also confused with the two terms as you mentioned ’standardization’ & ‘normalization’, which to use with mass spectrometry data analysis. Although I only found research articles mentioning normalization for such kind of data not standardization. Although when I explored the internet then both the techniques were referred in similar grounds.
What is your views regarding this query.
@shrawan: If you know that your data contains several outliers, then I guess you should rather try standardization.
[...] ※normalization(正規化)とstandardization(標準化)の違いについて。 http://www.dataminingblog.com/standardizationvsnormalization/ [...]
Good discussion. However the fact is… Now I am more confused. LOL!
Detrending would be standarzisation or normalization or neither?
Can you please give two basic good on the NOrmalization and Standardization techniques
Is standardization and ztransform the same thing?
Is there a scientific journal paper (reference) about this topic, standardization and normalization?
Rahul Kulkarni: Yes, its the same.
Hi all,
Apart from standardization and normalization, certain applications require each sample to be divided by their norm to get unit magnitude, mostly in PCA,LDA, sparse coding etc.
What is the logic behind it. Also under such cases do we need to first apply normalization and then divide by norm?
Thanks,
Arunava
Great Article. I actually have a problem and was wondering if the data should be normalized, and figure I should reach out here.
I have 15 products a,b,c,d,e, etc.. They were all sent to different amounts of customers, and recieved different amounts of positive responses. The set up in excell would be:
Product Amount Sent Positive Response
A 1001 300
B 395 210
C 2399 295
etc..
I was wondering what would be the best way to scale and/or normalize the data so I can draw some conclusions. Any advice? Thanks for the help
Just read a great blog post on the distinction between normalization and standardization. I’d used both in the past — but never really understood the distinction.
Hope the link isn’t automatically removed:
http://stats.stackexchange.com/questions/10289/whatsthedifferencebetweennormalizationandstandardization
Thanks for the link Tony! Interesting discussion there.
please send me the importance of standardize data
Very interesting thread.
I have a question. If one has a large collection of data with several atributtes (let’s say 40) each of which follows a different distribution (gaussian or not),
Does it make any sense to mix normalization and standarization in the same dataset? That is, standardize those attibutes that follow a gaussian distribution and normalize the rest (removing previously possible outliers).
I presume it may distort the correlation between attibutes and therfore corrupt the results. If that’s the case, one has to choose between normalization and standarization for the whole dataset?.
Thanks in advance.
@David: Very interesting question! In data mining, normalization/standardization is often used to allow data of different scale (unit) to be comparable by a classification/regression algorithm. So, if you mix both of them, then you loose this use. Attributes that have been standardized may have much more impact than the normalized ones, just because of their ranges.
[...] Standardization vs. normalization [...]
[...] Standardization vs. normalization [...]
[...] Standardization vs. normalization [...]
The following statement in the original post is not correct: “When using standardization, you make an assumption that your data have been generated with a Gaussian law (with a certain mean and standard deviation).”
The mean and std dev are definable for any continuousvariable distribution and therefore, regardless of the true underlying distribution, standardization will yield data having a mean of 0 and std dev of 1. There is no assumption that the underlying distribution is Gaussian.
Of course, if the data are not Gaussian, then you could not, for example, make the claim that approximately 95% of the standardized values should fall between 2 and +2; this would be true only if the population being sampled is normally distributed.
Tell me what you're thinking...