Data standardization vs. normalization for clustering analysisClustering variables with outliersMultivariate data analyis of compositional dataCan we use cluster analysis in multiple regressionk-mean clustering of week-timesClustering not producing even clustersClustering a dense datasetNormalization/Standarization for Clustering visualizationk-modes Clusters ValidationWhy is t-SNE not used as a dimensionality reduction technique for clustering or classification?How to deal with mixed data type in deep neural network

What would be the ideal melee weapon made of "Phase Metal"?

Bob's unnecessary trip to the shops

When is pointing out a person's hypocrisy not considered to be a logical fallacy?

How do I write a romance that doesn't look obvious

What are some symbols representing peasants/oppressed persons fighting back?

How can an advanced civilization forget how to manufacture its technology?

TikZ Can I draw an arrow by specifying the initial point, direction, and length?

How does one stock fund's charge of 1% more in operating expenses than another fund lower expected returns by 10%?

Cubic programming and beyond?

Can a continent naturally split into two distant parts within a week?

Chaining Dissonant Whispers via War Caster feat

Why would guns not work in the dungeon?

How might the United Kingdom become a republic?

Optimizing Process Builder: Early Exit: Worthwhile?

Why hasn't the U.S. government paid war reparations to any country it attacked?

Modeling, view and projection transformation using vector and point in homogenous form

Pre-1968 YA science fiction novel: robot with black-and-white vision, later the robot could see in color

Data standardization vs. normalization for clustering analysis

Why did my rum cake turn black?

Hacker Rank : Electronics Shop

Why does java.time.Period#normalized() not normalize days?

Doing research in academia and not liking competition

Rearranging the formula

Why does the autopilot disengage even when it does not receive pilot input?



Data standardization vs. normalization for clustering analysis


Clustering variables with outliersMultivariate data analyis of compositional dataCan we use cluster analysis in multiple regressionk-mean clustering of week-timesClustering not producing even clustersClustering a dense datasetNormalization/Standarization for Clustering visualizationk-modes Clusters ValidationWhy is t-SNE not used as a dimensionality reduction technique for clustering or classification?How to deal with mixed data type in deep neural network






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?



With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?










share|cite|improve this question







New contributor



Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$


















    2












    $begingroup$


    I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?



    With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?










    share|cite|improve this question







    New contributor



    Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$














      2












      2








      2





      $begingroup$


      I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?



      With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?










      share|cite|improve this question







      New contributor



      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$




      I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?



      With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?







      machine-learning clustering pca






      share|cite|improve this question







      New contributor



      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share|cite|improve this question







      New contributor



      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share|cite|improve this question




      share|cite|improve this question






      New contributor



      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      asked 8 hours ago









      ElicenElicen

      111 bronze badge




      111 bronze badge




      New contributor



      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      New contributor




      Elicen is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          2 Answers
          2






          active

          oldest

          votes


















          1












          $begingroup$

          There cannot be a general rule on what to do.



          Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
          But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.






          share|cite|improve this answer









          $endgroup$








          • 1




            $begingroup$
            Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
            $endgroup$
            – zbicyclist
            7 hours ago











          • $begingroup$
            Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
            $endgroup$
            – Elicen
            7 hours ago










          • $begingroup$
            Normalizing usually is much worse because of outliers. Standardization is much more robust.
            $endgroup$
            – Anony-Mousse
            6 hours ago


















          1












          $begingroup$

          I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated



          Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.



          About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.



          Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.






          share|cite|improve this answer









          $endgroup$















            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "65"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );






            Elicen is a new contributor. Be nice, and check out our Code of Conduct.









            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f417339%2fdata-standardization-vs-normalization-for-clustering-analysis%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            1












            $begingroup$

            There cannot be a general rule on what to do.



            Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
            But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.






            share|cite|improve this answer









            $endgroup$








            • 1




              $begingroup$
              Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
              $endgroup$
              – zbicyclist
              7 hours ago











            • $begingroup$
              Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
              $endgroup$
              – Elicen
              7 hours ago










            • $begingroup$
              Normalizing usually is much worse because of outliers. Standardization is much more robust.
              $endgroup$
              – Anony-Mousse
              6 hours ago















            1












            $begingroup$

            There cannot be a general rule on what to do.



            Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
            But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.






            share|cite|improve this answer









            $endgroup$








            • 1




              $begingroup$
              Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
              $endgroup$
              – zbicyclist
              7 hours ago











            • $begingroup$
              Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
              $endgroup$
              – Elicen
              7 hours ago










            • $begingroup$
              Normalizing usually is much worse because of outliers. Standardization is much more robust.
              $endgroup$
              – Anony-Mousse
              6 hours ago













            1












            1








            1





            $begingroup$

            There cannot be a general rule on what to do.



            Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
            But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.






            share|cite|improve this answer









            $endgroup$



            There cannot be a general rule on what to do.



            Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand.
            But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.







            share|cite|improve this answer












            share|cite|improve this answer



            share|cite|improve this answer










            answered 7 hours ago









            Anony-MousseAnony-Mousse

            32k5 gold badges44 silver badges85 bronze badges




            32k5 gold badges44 silver badges85 bronze badges







            • 1




              $begingroup$
              Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
              $endgroup$
              – zbicyclist
              7 hours ago











            • $begingroup$
              Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
              $endgroup$
              – Elicen
              7 hours ago










            • $begingroup$
              Normalizing usually is much worse because of outliers. Standardization is much more robust.
              $endgroup$
              – Anony-Mousse
              6 hours ago












            • 1




              $begingroup$
              Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
              $endgroup$
              – zbicyclist
              7 hours ago











            • $begingroup$
              Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
              $endgroup$
              – Elicen
              7 hours ago










            • $begingroup$
              Normalizing usually is much worse because of outliers. Standardization is much more robust.
              $endgroup$
              – Anony-Mousse
              6 hours ago







            1




            1




            $begingroup$
            Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
            $endgroup$
            – zbicyclist
            7 hours ago





            $begingroup$
            Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better.
            $endgroup$
            – zbicyclist
            7 hours ago













            $begingroup$
            Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
            $endgroup$
            – Elicen
            7 hours ago




            $begingroup$
            Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection.
            $endgroup$
            – Elicen
            7 hours ago












            $begingroup$
            Normalizing usually is much worse because of outliers. Standardization is much more robust.
            $endgroup$
            – Anony-Mousse
            6 hours ago




            $begingroup$
            Normalizing usually is much worse because of outliers. Standardization is much more robust.
            $endgroup$
            – Anony-Mousse
            6 hours ago













            1












            $begingroup$

            I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated



            Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.



            About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.



            Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.






            share|cite|improve this answer









            $endgroup$

















              1












              $begingroup$

              I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated



              Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.



              About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.



              Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.






              share|cite|improve this answer









              $endgroup$















                1












                1








                1





                $begingroup$

                I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated



                Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.



                About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.



                Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.






                share|cite|improve this answer









                $endgroup$



                I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated



                Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.



                About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.



                Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.







                share|cite|improve this answer












                share|cite|improve this answer



                share|cite|improve this answer










                answered 7 hours ago









                aghdaghd

                255 bronze badges




                255 bronze badges




















                    Elicen is a new contributor. Be nice, and check out our Code of Conduct.









                    draft saved

                    draft discarded


















                    Elicen is a new contributor. Be nice, and check out our Code of Conduct.












                    Elicen is a new contributor. Be nice, and check out our Code of Conduct.











                    Elicen is a new contributor. Be nice, and check out our Code of Conduct.














                    Thanks for contributing an answer to Cross Validated!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f417339%2fdata-standardization-vs-normalization-for-clustering-analysis%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                    Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                    Ласкавець круглолистий Зміст Опис | Поширення | Галерея | Примітки | Посилання | Навігаційне меню58171138361-22960890446Bupleurum rotundifoliumEuro+Med PlantbasePlants of the World Online — Kew ScienceGermplasm Resources Information Network (GRIN)Ласкавецькн. VI : Літери Ком — Левиправивши або дописавши її