How do I deal with large amout missing values in a data set without dropping them?Imputation of missing response variablesHow to deal with non-numeric missing values with pythonHow do I impute missing values of a dataset with little pairwise correction using regressionIn real clinical diagnostic data set how can we know the “true label” of a patient?A data set with missing values in multiple variablesMissing values in a large data setShould I impute Missing Laboratory Data? - Confused about MAR-MNARFollow up medical study with missing dataWill I break my model if I replace missing values with `Unknown`?dealing with missing values on train set Or combined set
How do I deal with large amout missing values in a data set without dropping them?
How to touch up scratches on a black anodized aluminum flashlight?
Changing Row Keys into Normal Rows
If someone else uploads my GPL'd code to Github without my permission, is that a copyright violation?
Do some languages mention the top limit of a range first?
A verb for when some rights are not violated?
Getting an entry level IT position later in life
Is the first page of a novel really that important?
Pronouns when writing from the point of view of a robot
Examples of application problems of coordinate geomtry in the complex plane?
What is it exactly about flying a Flyboard across the English channel that made Zapata's thighs burn?
Plato and the knowledge of the forms
Could an areostationary satellite help locate asteroids?
How to make attic easier to traverse?
Ubuntu show wrong disk sizes, how to solve it?
What are the function of EM and EN spaces?
Can chords be inferred from melody alone?
How do I get the =LEFT function in excel, to also take the number zero as the first number?
Should I take out a personal loan to pay off credit card debt?
Traveling from Germany to other countries by train?
Does the length of a password for Wi-Fi affect speed?
Is there a way to say "double + any number" in German?
Probably terminated or laid off soon; confront or not?
Write The Shortest Program To Check If A Binary Tree Is Balanced
How do I deal with large amout missing values in a data set without dropping them?
Imputation of missing response variablesHow to deal with non-numeric missing values with pythonHow do I impute missing values of a dataset with little pairwise correction using regressionIn real clinical diagnostic data set how can we know the “true label” of a patient?A data set with missing values in multiple variablesMissing values in a large data setShould I impute Missing Laboratory Data? - Confused about MAR-MNARFollow up medical study with missing dataWill I break my model if I replace missing values with `Unknown`?dealing with missing values on train set Or combined set
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I am trying to build a binary classification model which predicts whether a patient would me infected with a certain disease at the the end of his hospital stay or not. The features that I have are results of different standard medical tests. But the issue is almost all of these results have around 60% - 80% missing values as not all the tests are relevant for all the patients. So how do I deal with the missing values as dropping them is not an option here. Also since the medical test results lie on scale ranging from low to high, should i converted them to categorical variable with High, Low, Medium, Null (for missing data), based on the standard medical test ranges? Or would it helpful to replace with any of the central tendencies?
classification python missing-data feature-engineering
New contributor
$endgroup$
add a comment |
$begingroup$
I am trying to build a binary classification model which predicts whether a patient would me infected with a certain disease at the the end of his hospital stay or not. The features that I have are results of different standard medical tests. But the issue is almost all of these results have around 60% - 80% missing values as not all the tests are relevant for all the patients. So how do I deal with the missing values as dropping them is not an option here. Also since the medical test results lie on scale ranging from low to high, should i converted them to categorical variable with High, Low, Medium, Null (for missing data), based on the standard medical test ranges? Or would it helpful to replace with any of the central tendencies?
classification python missing-data feature-engineering
New contributor
$endgroup$
add a comment |
$begingroup$
I am trying to build a binary classification model which predicts whether a patient would me infected with a certain disease at the the end of his hospital stay or not. The features that I have are results of different standard medical tests. But the issue is almost all of these results have around 60% - 80% missing values as not all the tests are relevant for all the patients. So how do I deal with the missing values as dropping them is not an option here. Also since the medical test results lie on scale ranging from low to high, should i converted them to categorical variable with High, Low, Medium, Null (for missing data), based on the standard medical test ranges? Or would it helpful to replace with any of the central tendencies?
classification python missing-data feature-engineering
New contributor
$endgroup$
I am trying to build a binary classification model which predicts whether a patient would me infected with a certain disease at the the end of his hospital stay or not. The features that I have are results of different standard medical tests. But the issue is almost all of these results have around 60% - 80% missing values as not all the tests are relevant for all the patients. So how do I deal with the missing values as dropping them is not an option here. Also since the medical test results lie on scale ranging from low to high, should i converted them to categorical variable with High, Low, Medium, Null (for missing data), based on the standard medical test ranges? Or would it helpful to replace with any of the central tendencies?
classification python missing-data feature-engineering
classification python missing-data feature-engineering
New contributor
New contributor
New contributor
asked 8 hours ago
KrantikariKrantikari
161 bronze badge
161 bronze badge
New contributor
New contributor
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.
So you want to learn from NA values. A common approach is to add an indicator column for NA values.
New contributor
$endgroup$
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
add a comment |
$begingroup$
A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists
If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.
GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Krantikari is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f420949%2fhow-do-i-deal-with-large-amout-missing-values-in-a-data-set-without-dropping-the%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.
So you want to learn from NA values. A common approach is to add an indicator column for NA values.
New contributor
$endgroup$
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
add a comment |
$begingroup$
Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.
So you want to learn from NA values. A common approach is to add an indicator column for NA values.
New contributor
$endgroup$
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
add a comment |
$begingroup$
Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.
So you want to learn from NA values. A common approach is to add an indicator column for NA values.
New contributor
$endgroup$
Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.
So you want to learn from NA values. A common approach is to add an indicator column for NA values.
New contributor
New contributor
answered 8 hours ago
damerdjidamerdji
714 bronze badges
714 bronze badges
New contributor
New contributor
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
add a comment |
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago
1
1
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago
1
1
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago
add a comment |
$begingroup$
A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists
If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.
GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.
$endgroup$
add a comment |
$begingroup$
A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists
If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.
GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.
$endgroup$
add a comment |
$begingroup$
A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists
If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.
GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.
$endgroup$
A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists
If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.
GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.
edited 5 hours ago
answered 7 hours ago
Thomas WukitschThomas Wukitsch
565 bronze badges
565 bronze badges
add a comment |
add a comment |
Krantikari is a new contributor. Be nice, and check out our Code of Conduct.
Krantikari is a new contributor. Be nice, and check out our Code of Conduct.
Krantikari is a new contributor. Be nice, and check out our Code of Conduct.
Krantikari is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f420949%2fhow-do-i-deal-with-large-amout-missing-values-in-a-data-set-without-dropping-the%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown