what should be done first, handling missing data or dealing with data types?Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

How do pilots align the HUD with their eyeballs?

Suffocation while cooking under an umbrella?

What is the meaning of word 'crack' in chapter 33 of A Game of Thrones?

How to deal with a Homophobic PC

Which place in our solar system is the most fit for terraforming?

How do you use the interjection for snorting?

Difference between types of yeast

How to justify a team increase when the team is doing good?

What should I consider when deciding whether to delay an exam?

Is there a way to hide HTML source code yet keeping it effective?

Late 1970's and 6502 chip facilities for operating systems

Could Apollo astronauts see city lights from the moon?

How can an attacker use robots.txt?

What benefits does the Power Word Kill spell have?

Symbol for function composition like a big sum

OSM Thunderforest API keys in QGIS3

Clear text passwords in Unix

Tesla coil and Tesla tower

What secular civic space would pioneers build for small frontier towns?

Line segments inside a square

Do wheelchair aircraft exist?

A file manager to open a zip file like opening a folder, instead of extract it by using a archive manager

what should be done first, handling missing data or dealing with data types?

Cut a cake into 3 equal portions with only a knife



what should be done first, handling missing data or dealing with data types?


Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










share|cite|improve this question







New contributor



Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$




















    2












    $begingroup$


    In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



    1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



    2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










    share|cite|improve this question







    New contributor



    Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$
















      2












      2








      2





      $begingroup$


      In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



      1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



      2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.










      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$




      In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:



      1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns



      2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.







      data-cleaning






      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share|cite|improve this question







      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share|cite|improve this question




      share|cite|improve this question






      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      asked 8 hours ago









      KiranKiran

      1113 bronze badges




      1113 bronze badges




      New contributor



      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      New contributor




      Kiran is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.

























          2 Answers
          2






          active

          oldest

          votes


















          2














          $begingroup$

          Handle data first, then perform multiple imputation.



          Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



          • Specific data types produce specific models, so the quality of your imputation depends on handling data types

          • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

          • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

          In this way you can handle missing data for categorical variables alongside continuous or interval variables.



          References



          Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



          White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






          share|cite|improve this answer











          $endgroup$






















            1














            $begingroup$

            I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



            I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






            share|cite|improve this answer









            $endgroup$

















              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "65"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );







              Kiran is a new contributor. Be nice, and check out our Code of Conduct.









              draft saved

              draft discarded
















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              2 Answers
              2






              active

              oldest

              votes








              2 Answers
              2






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              2














              $begingroup$

              Handle data first, then perform multiple imputation.



              Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



              • Specific data types produce specific models, so the quality of your imputation depends on handling data types

              • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

              • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

              In this way you can handle missing data for categorical variables alongside continuous or interval variables.



              References



              Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



              White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






              share|cite|improve this answer











              $endgroup$



















                2














                $begingroup$

                Handle data first, then perform multiple imputation.



                Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                References



                Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






                share|cite|improve this answer











                $endgroup$

















                  2














                  2










                  2







                  $begingroup$

                  Handle data first, then perform multiple imputation.



                  Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                  • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                  • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                  • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                  In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                  References



                  Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                  White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.






                  share|cite|improve this answer











                  $endgroup$



                  Handle data first, then perform multiple imputation.



                  Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:



                  • Specific data types produce specific models, so the quality of your imputation depends on handling data types

                  • Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

                  • Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

                  In this way you can handle missing data for categorical variables alongside continuous or interval variables.



                  References



                  Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.



                  White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.







                  share|cite|improve this answer














                  share|cite|improve this answer



                  share|cite|improve this answer








                  edited 3 hours ago

























                  answered 7 hours ago









                  AlexisAlexis

                  17.3k4 gold badges49 silver badges106 bronze badges




                  17.3k4 gold badges49 silver badges106 bronze badges


























                      1














                      $begingroup$

                      I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                      I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                      share|cite|improve this answer









                      $endgroup$



















                        1














                        $begingroup$

                        I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                        I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                        share|cite|improve this answer









                        $endgroup$

















                          1














                          1










                          1







                          $begingroup$

                          I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                          I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.






                          share|cite|improve this answer









                          $endgroup$



                          I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.



                          I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.







                          share|cite|improve this answer












                          share|cite|improve this answer



                          share|cite|improve this answer










                          answered 8 hours ago









                          Emma JeanEmma Jean

                          2088 bronze badges




                          2088 bronze badges
























                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.









                              draft saved

                              draft discarded

















                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.












                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.











                              Kiran is a new contributor. Be nice, and check out our Code of Conduct.














                              Thanks for contributing an answer to Cross Validated!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

                              Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

                              François Viète Contents Biography Work and thought Bibliography See also Notes Further reading External links Navigation menup. 21Google Bookspp. 75–77Google BooksDe thou (from University of Saint Andrews)ArchivedGoogle BooksGoogle BooksGoogle BooksGoogle booksGoogle Bookscc-parthenay.frL'histoire universelle (fr)Universal History (en)ArchivedAdsabs.harvard.eduPagesperso-orange.frArchive.orgChikara Sasaki. Descartes' mathematical thought p.259Google BooksGoogle BooksGoogle Bookspp. 152 and onwardGoogle BooksGoogle BooksScribd.comGoogle Books1257-7979Google BooksGoogle BooksGoogle BooksGoogle BooksGoogle BooksGoogle BooksGallica.bnf.frGoogle BooksGoogle Books"François Viète"Francois Viète: Father of Modern Algebraic NotationThe Lawyer and the GamblerAbout TarporleySite de Jean-Paul GuichardL'algèbre nouvelle"About the Harmonicon"cb120511976(data)1188044800000 0001 0913 5903n82164680ola2013766880073431702w6vt1sb70287374827140948071409480