Standardization vs. normalization

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

In the overall knowledge discovery process, before data mining itself, data preprocessing plays a crucial role. One of the first steps concerns the normalization of the data. This step is very important when dealing with parameters of different units and scales. For example, some data mining techniques use the Euclidean distance. Therefore, all parameters should have the same scale for a fair comparison between them.

Two methods are usually well known for rescaling data. Normalization, which scales all numeric variables in the range [0,1]. One possible formula is given below:

On the other hand, you can use standardization on your data set. It will then transform it to have zero mean and unit variance, for example using the equation below:

Both of these techniques have their drawbacks. If you have outliers in your data set, normalizing your data will certainly scale the “normal” data to a very small interval. And generally, most of data sets have outliers. When using standardization, you make an assumption that your data have been generated with a Gaussian law (with a certain mean and standard deviation). This may not be the case in reality.

So my question is what do you usually use when mining your data and why?

Note: Thanks to Benny Raphael for fruitful discussions on this topic.

No TweetBacks yet. (Be the first to Tweet this post)
  • Share/Bookmark

Comments

25 Comments on Standardization vs. normalization

  1. Fay on Thu, 12th Jul 2007 2:47 pm
  2. Sometimes perhaps we can take logarithms of input data when they contain order-of-magnitude larger and smaller values. However, since logarithms are defined for positive values only, we need to take care when the input data may contain zero and negative values.
    You did a very good work on your blog! :)

  3. Will Dwinnell on Thu, 12th Jul 2007 11:00 pm
  4. A few points come to mind:

    1. Monotonic scaling of the data (assuming that distinct values are not collapsed) will have no affect on the most common logical learning algorithms (tree- and rule-induction algorithms).

    2. There are robust alternatives, such as: subtract the median and divide by the IQR, or scale linearly so that the 5th and 95th percentiles meet some standard range.

    3. Outliers (technically, and high leverage points) present an interesting challenge. One possibility is to Winsorize the data after scaling it.

  5. Sandro Saitta on Fri, 13th Jul 2007 3:46 pm
  6. Thanks for your comment fay. I agree with you on taking the log. I use to work with data in the range 10^6 to 10^12 for example. And thanks for the remark :-)

    Will, your suggestions seem very interesting. I don’t know the “winsorize” technique, but it seems it could be used in addition to normalization.

  7. Will Dwinnell on Fri, 13th Jul 2007 6:44 pm
  8. For readers who are not aware of this technique: “Winsorizing” data simlpy means clamping the extreme values.

    This is similar to trimming the data, except that instead of discarding data: values greater than the specified upper limit are replaced with the upper limit, and those below the lower limit are replace with the lower limit. Often, the specified range is indicate in terms of percentiles of the original distribution (like the 5th and 95th percentile).

    This process is sometimes used to make conventional measures more robust, as in the Winsorized variance.

  9. mb on Fri, 30th Nov 2007 1:23 am
  10. Will, can you tell me how I can scale linearly so that the 5th and 95th percentiles meet some standard range?
    Can this be done with both negative and positive values?

    Another question:
    If I want to compute an index where not only the units and scales are different, but also the input metrics into the index have different interpretations – specifically, one metric is better if the values are higher and another one is better if the values are lower, how can I compute an index that represents all numbers concisely and meaningfully?
    Let’s say I have Expenses ($), Profits($) and Turnover (%). Expenses and Turnover are better if lower, but Profits are better if higher.
    If comparing two companies on these metrics, and I want to compute one index to show the “best” performing company on these parameters, how can I do this?
    Sorry, not strictly data-mining relevant, but thought someone here might have an answer!

    Tried using z-scores and normalizing but doesnt work due to different hi-low interpretations.
    Eventually used a reverse-rank for Expenses and Turnover so that all have same order. However, rank does not show quantity difference between the two companies, just their ranks!

    this is a great blog, thanks to all for helpful comments.

  11. Sandro Saitta on Thu, 6th Dec 2007 10:26 am
  12. First, you can normalize/standardize your data. Or, on the contrary, you can maybe decide to manually fix weights to each of these metrics.

    You can for example use an objective function. Let say you want to maximize a function of the Expenses, Profits and Turnover. In the objective function, give a negative weight to Expenses and Turnover and a positive one to Profits. I don’t know if this will work for your problem, but that would be my first guess.

  13. shrawan bhattacharjee on Tue, 14th Sep 2010 12:19 pm
  14. Hi Sandro,
    Interesting article and the comments which followed. I am also dealing with analysis of Mass Spectrometry data. This kind of data suffers a significant variation due to instrumental errors and limitations (even if the same sample is analyzed). Presently I am using log transformation, which is giving satisfactory result. But I am still skeptical of possibilities of false positives, as data is from the range of 10*3 to 10*6. So what do you suggest best method of normalizing this kind of data.

    I am also confused with the two terms as you mentioned ’standardization’ & ‘normalization’, which to use with mass spectrometry data analysis. Although I only found research articles mentioning normalization for such kind of data not standardization. Although when I explored the internet then both the techniques were referred in similar grounds.
    What is your views regarding this query.

  15. Sandro Saitta on Tue, 4th Jan 2011 6:07 pm
  16. @shrawan: If you know that your data contains several outliers, then I guess you should rather try standardization.

    [...] ※normalization(正規化)とstandardization(標準化)の違いについて。 http://www.dataminingblog.com/standardization-vs-normalization/ [...]

  17. Mrinal on Wed, 14th Dec 2011 11:05 am
  18. Good discussion. However the fact is… Now I am more confused. LOL!

  19. silvina on Sat, 9th Jun 2012 9:16 pm
  20. Detrending would be standarzisation or normalization or neither?

  21. Ravi on Sun, 10th Jun 2012 1:19 am
  22. Can you please give two basic good on the NOrmalization and Standardization techniques

  23. Rahul Kulkarni on Mon, 30th Jul 2012 8:07 am
  24. Is standardization and z-transform the same thing?

  25. Abraham Aldaco on Fri, 3rd Aug 2012 1:20 am
  26. Is there a scientific journal paper (reference) about this topic, standardization and normalization?

  27. Martin Hlosta on Wed, 26th Sep 2012 10:59 am
  28. Rahul Kulkarni: Yes, its the same.

  29. Arunava on Mon, 29th Oct 2012 6:39 pm
  30. Hi all,

    Apart from standardization and normalization, certain applications require each sample to be divided by their norm to get unit magnitude, mostly in PCA,LDA, sparse coding etc.

    What is the logic behind it. Also under such cases do we need to first apply normalization and then divide by norm?

    Thanks,
    Arunava

  31. Steven on Thu, 6th Dec 2012 10:46 pm
  32. Great Article. I actually have a problem and was wondering if the data should be normalized, and figure I should reach out here.
    I have 15 products a,b,c,d,e, etc.. They were all sent to different amounts of customers, and recieved different amounts of positive responses. The set up in excell would be:
    Product Amount Sent Positive Response
    A 1001 300
    B 395 210
    C 2399 295
    etc..

    I was wondering what would be the best way to scale and/or normalize the data so I can draw some conclusions. Any advice? Thanks for the help

  33. TonyD on Sun, 6th Jan 2013 12:51 am
  34. Just read a great blog post on the distinction between normalization and standardization. I’d used both in the past — but never really understood the distinction.

    Hope the link isn’t automatically removed:
    http://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization

  35. Sandro Saitta on Mon, 7th Jan 2013 9:20 am
  36. Thanks for the link Tony! Interesting discussion there.

  37. DAVID OMBUNI on Thu, 8th Aug 2013 7:44 am
  38. please send me the importance of standardize data

  39. David Garcia on Fri, 20th Sep 2013 10:23 am
  40. Very interesting thread.

    I have a question. If one has a large collection of data with several atributtes (let’s say 40) each of which follows a different distribution (gaussian or not),

    Does it make any sense to mix normalization and standarization in the same dataset? That is, standardize those attibutes that follow a gaussian distribution and normalize the rest (removing previously possible outliers).

    I presume it may distort the correlation between attibutes and therfore corrupt the results. If that’s the case, one has to choose between normalization and standarization for the whole dataset?.

    Thanks in advance.

  41. Sandro Saitta on Sat, 21st Sep 2013 7:01 pm
  42. @David: Very interesting question! In data mining, normalization/standardization is often used to allow data of different scale (unit) to be comparable by a classification/regression algorithm. So, if you mix both of them, then you loose this use. Attributes that have been standardized may have much more impact than the normalized ones, just because of their ranges.

    [...] Standardization vs. normalization [...]

  43. Data Mining Research in 2013 DataMining4U on Tue, 7th Jan 2014 5:43 am
  44. [...] Standardization vs. normalization [...]

  45. Data Mining Research in 2013 | Bugdaddy on Sat, 1st Feb 2014 10:15 am
  46. [...] Standardization vs. normalization [...]

Tell me what you're thinking...





  • Swiss Association for Analytics

  • T-shirts, Mugs & Mousepads


    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations