Missing answers

Published

September 10, 2022

Amaya’s company surveyed a large group of people. After collecting and digitizing all of the data, she noticed that many participants didn’t answer one particular question.

It was clear to Amaya that not disclosing the answer to the question was as important as the answer itself, so she wanted to consider this when preparing the data.

Amaya is ready to build a machine learning model but must first deal with the missing answers.

Which of the following should be the best strategy that Amaya should pursue?

Amaya should remove the column that contains the missing values.
Amaya should add a new column to the dataset to flag rows with missing values and then replace those values with a reasonable answer.
Amaya should replace the missing values with a reasonable answer.
Amaya should predict the missing values using a separate machine learning model.

Expand to see the answer

Amaya knows she must do something with the missing answers before training a machine learning model, and imputing these values seem like a good solution. For example, she could replace the missing values with their mean, median, or mode.

But this approach has a problem: Amaya thinks having many participants not disclose the answer to the question is as important as the answer itself. If she replaces the missing values, she will lose that information.

A great approach to avoid losing that information is to add a new binary column to the dataset. This column will flag every participant that didn’t answer the question. Assuming avoiding to answer is not a random event, this new column will help with the predictive performance of Amaya’s model.

Therefore, the best strategy for Amaya is to add this new column right before imputing missing values.

Recommended reading

“Handling Missing Values” is a Kaggle notebook that goes over a few strategies to handle missing values including adding an extra column to flag these rows.
Check “Imputation of missing values” for some information about how Scikit-Learn handles missing values.
“How to Handle Missing Data with Python” is an excellent post discussing different strategies in Python.