How do I deal with large amout missing values in a data set without dropping them?Imputation of missing response variablesHow to deal with non-numeric missing values with pythonHow do I impute missing values of a dataset with little pairwise correction using regressionIn real clinical diagnostic data set how can we know the “true label” of a patient?A data set with missing values in multiple variablesMissing values in a large data setShould I impute Missing Laboratory Data? - Confused about MAR-MNARFollow up medical study with missing dataWill I break my model if I replace missing values with `Unknown`?dealing with missing values on train set Or combined set

How do I deal with large amout missing values in a data set without dropping them?

How to touch up scratches on a black anodized aluminum flashlight?

Changing Row Keys into Normal Rows

If someone else uploads my GPL'd code to Github without my permission, is that a copyright violation?

Do some languages mention the top limit of a range first?

A verb for when some rights are not violated?

Getting an entry level IT position later in life

Is the first page of a novel really that important?

Pronouns when writing from the point of view of a robot

Examples of application problems of coordinate geomtry in the complex plane?

What is it exactly about flying a Flyboard across the English channel that made Zapata's thighs burn?

Plato and the knowledge of the forms

Could an areostationary satellite help locate asteroids?

How to make attic easier to traverse?

Ubuntu show wrong disk sizes, how to solve it?

What are the function of EM and EN spaces?

Can chords be inferred from melody alone?

How do I get the =LEFT function in excel, to also take the number zero as the first number?

Should I take out a personal loan to pay off credit card debt?

Traveling from Germany to other countries by train?

Does the length of a password for Wi-Fi affect speed?

Is there a way to say "double + any number" in German?

Probably terminated or laid off soon; confront or not?

Write The Shortest Program To Check If A Binary Tree Is Balanced

How do I deal with large amout missing values in a data set without dropping them?

Imputation of missing response variablesHow to deal with non-numeric missing values with pythonHow do I impute missing values of a dataset with little pairwise correction using regressionIn real clinical diagnostic data set how can we know the “true label” of a patient?A data set with missing values in multiple variablesMissing values in a large data setShould I impute Missing Laboratory Data? - Confused about MAR-MNARFollow up medical study with missing dataWill I break my model if I replace missing values with `Unknown`?dealing with missing values on train set Or combined set

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am trying to build a binary classification model which predicts whether a patient would me infected with a certain disease at the the end of his hospital stay or not. The features that I have are results of different standard medical tests. But the issue is almost all of these results have around 60% - 80% missing values as not all the tests are relevant for all the patients. So how do I deal with the missing values as dropping them is not an option here. Also since the medical test results lie on scale ranging from low to high, should i converted them to categorical variable with High, Low, Medium, Null (for missing data), based on the standard medical test ranges? Or would it helpful to replace with any of the central tendencies?

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

add a comment |

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

add a comment |

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

classification python missing-data feature-engineering

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

asked 8 hours ago

Krantikari

161 bronze badge

asked 8 hours ago

Krantikari

161 bronze badge

New contributor

add a comment |

2 Answers
2

active

oldest

votes

Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.

So you want to learn from NA values. A common approach is to add an indicator column for NA values.

answered 8 hours ago

damerdji

714 bronze badges

New contributor

$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago

1

$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago

1

$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago

$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago

$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago

add a comment |

A linear mixed effects model would allow you to have individuals with missing data and not need to convert everything over to categories. If ever you have a continuous variable, use it as a continuum if at all possible.
Here is a link to a paper that explains more about why. It is not just for psychologists, the same applies because the arguments are based on math, not opinion. https://www.researchgate.net/publication/282351876_The_problem_with_categorical_thinking_by_psychologists

If you have data on a bunch of known cases you can use to build the model, use a logistic generalized linear mixed-effects model aka logistic GLMM. In R it is in the lme4 library and uses GLMER for its call (Generalized Linear Mixed Effects Regression). You may also want to look into signal detection theory as it may help you out here. With a logistic GLMM you can use an individual patient's information in the model and it will give you the odds of them having/not having the outcome. Just be careful to add only relevant variables to your model. If there are too many predictors your model will not generalize well to new patients that were not used to fit the model. To remedy this, if you have enough data, split it at random into two data sets, fit the model on one data set and then see how well it predicts another data set by comparing the Akaike Information Criterion and Bayesian Information Criterion. Bootstrapping may also help.

GLMMs and LMMs in general deal very will with missing data. Unlike a traditional logistic regression, LMMs do not have the assumption of equal cell sizes. Don't be fooled, if someone says that ANOVA/regression is robust to violations of its assumptions, especially if the cell sizes are unequal. They haven't done their homework and are just parroting what they heard in grad school. the math on that is clear.

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

Krantikari is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f420949%2fhow-do-i-deal-with-large-amout-missing-values-in-a-data-set-without-dropping-the%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.

So you want to learn from NA values. A common approach is to add an indicator column for NA values.

answered 8 hours ago

damerdji

714 bronze badges

New contributor

$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago

1

$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago

1

$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago

$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago

$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago

add a comment |

Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.

So you want to learn from NA values. A common approach is to add an indicator column for NA values.

answered 8 hours ago

damerdji

714 bronze badges

New contributor

$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago

1

$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago

1

$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago

$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago

$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago

add a comment |

Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.

So you want to learn from NA values. A common approach is to add an indicator column for NA values.

answered 8 hours ago

damerdji

714 bronze badges

New contributor

Because NA values are informative for your dataset, you don't want to drop NAs or impute values. If a patient doesn't get an X-ray, they probably didn't break a bone.

So you want to learn from NA values. A common approach is to add an indicator column for NA values.

answered 8 hours ago

damerdji

714 bronze badges

New contributor

answered 8 hours ago

damerdji

714 bronze badges

New contributor

answered 8 hours ago

damerdji

714 bronze badges

answered 8 hours ago

damerdji

714 bronze badges

New contributor

$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago

1

$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago

1

$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago

$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago

$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago

add a comment |

$begingroup$
OK but how do I replace the NA values in the original column?
$endgroup$
– Krantikari
8 hours ago

1

$begingroup$
You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.
$endgroup$
– damerdji
7 hours ago

1

$begingroup$
I should add that if you make that change, only nonlinear models will learn effectively.
$endgroup$
– damerdji
7 hours ago

$begingroup$
Tree based models should work fine right?
$endgroup$
– Krantikari
7 hours ago

$begingroup$
Yup! Tree-based models are nonlinear.
$endgroup$
– damerdji
7 hours ago

OK but how do I replace the NA values in the original column?

– Krantikari
8 hours ago

You don't want to use values in the range of possible test results. I'm guessing -999 will work, but this is data-dependent.

– damerdji
7 hours ago

I should add that if you make that change, only nonlinear models will learn effectively.

– damerdji
7 hours ago

Tree based models should work fine right?

– Krantikari
7 hours ago

Yup! Tree-based models are nonlinear.

– damerdji
7 hours ago

add a comment |

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

add a comment |

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

add a comment |

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

edited 5 hours ago

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

answered 7 hours ago

Thomas Wukitsch

565 bronze badges

add a comment |

Krantikari is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Krantikari is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2