what should be done first, handling missing data or dealing with data types?Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

How do pilots align the HUD with their eyeballs?

Suffocation while cooking under an umbrella?

What is the meaning of word 'crack' in chapter 33 of A Game of Thrones?

How to deal with a Homophobic PC

Which place in our solar system is the most fit for terraforming?

How do you use the interjection for snorting?

Difference between types of yeast

How to justify a team increase when the team is doing good?

What should I consider when deciding whether to delay an exam?

Is there a way to hide HTML source code yet keeping it effective?

Late 1970's and 6502 chip facilities for operating systems

Could Apollo astronauts see city lights from the moon?

How can an attacker use robots.txt?

What benefits does the Power Word Kill spell have?

Symbol for function composition like a big sum

OSM Thunderforest API keys in QGIS3

Clear text passwords in Unix

Tesla coil and Tesla tower

What secular civic space would pioneers build for small frontier towns?

Line segments inside a square

Do wheelchair aircraft exist?

A file manager to open a zip file like opening a folder, instead of extract it by using a archive manager

what should be done first, handling missing data or dealing with data types?

Cut a cake into 3 equal portions with only a knife

what should be done first, handling missing data or dealing with data types?

Creating “demo” data from real data: disguising without disfiguringShould I remove data with known computational error before doing linear regression?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

2) Handling data types first, then handling missing data - if we create dummies for catergorical values, it might happen that while handling missing data, we might assign two different categories to same row.

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

add a comment
|

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

add a comment
|

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

In data science, Which process should come first, handling missing data or handling data types. I am asking this question because I have problem in following cases:

1) Handling Missing data first, then handling data types - It would be difficult to handle missing data for categorical columns

data-cleaning

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

asked 8 hours ago

Kiran

1113 bronze badges

asked 8 hours ago

Kiran

1113 bronze badges

New contributor

add a comment
|

2 Answers
2

active

oldest

votes

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

Azur, M. J., Stuart, E. A., Frangakis, C., & Leaf, P. J. (2011). Multiple imputation by chained equations: What is it and how does it work? International Journal of Methods in Psychiatric Research, 20(1), 40–49.

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

I don't think it matters which one you do first as long as your methods are valid. Yes, imputing categorical missing data is difficult. However, if you create your dummy variables and then try to impute you haven't made your problem any easier as you now have C (number of categories) variables to determine.

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f428142%2fwhat-should-be-done-first-handling-missing-data-or-dealing-with-data-types%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

Handle data first, then perform multiple imputation.

Several solid multiple imputation using chained equations (MICE) implementations that I can think of permit contingent imputation where:

Specific data types produce specific models, so the quality of your imputation depends on handling data types

Interdependence between variables (e.g., mutually exclusive categories) can be explicitly modeled (e.g., using ordered logit or unordered multiple logit)

Hard dependencies (e.g., do not impute $x$ and $x^2$, but only, for example, impute $x$ using chained equations, and simply calculate $x^2$ based on imputed values of $x$, or vice versa)

In this way you can handle missing data for categorical variables alongside continuous or interval variables.

References

White, I. R., Royston, P., & Wood, A. M. (2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399.

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

edited 3 hours ago

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

answered 7 hours ago

Alexis

17.3k4 gold badges49 silver badges106 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

I think I would personally take the first approach and try to impute what I can (if that is the route you are taking). MICE is a very flexible method for imputing data of different types.

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

answered 8 hours ago

Emma Jean

2088 bronze badges

add a comment
|

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Kiran is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf