I started learning about data using SPSS and working with University questionnaires from both undergraduate and post-graduate projects. Questionnaires were filled on paper and I had to hardcode the responses into SPSS.
NB — Hardcoding is manually inputting the response into SPSS. For instance, what is your gender? Male or Female. So wherever the respondent tick Male, I type in 1 and if it is Female, I type in 2
Then comes the mountain of problems. Respondents leave some questions empty.
What do I do?
Reached out to my tutor, and his response was to replace them with the most occurring value i.e. the mode. The way it worked was that as I was hard coding into SPSS, subconsciously I'm storing which category I typed in more times i.e. have I typed more male than female?
Is this the best approach? No
What is the best approach?
Below is my approach to handling missing values
NB — variable refers to the column
Step 1 — Type of variable
The first question I ask myself is what type of variable am i dealing with. A continuous or categorical variable. Understanding the type of variable will help me understand what approach to use in handling the missing values
Step 2 — Importance of variable
The next question is, is the variable containing missing values important for my analysis? If the answer is No, I delete that column. if the answer is Yes, I go to Step 3
Step 3 — Percentage of missing values
What is the percentage of missing values in the whole data? My benchmark is 5%. Note that this 5% is not industry standard and there is no industry standard. If the percentage is greater than 5% of my data, I never delete missing values. Also, I don’t just delete missing values if it’s less than 5%. Deleting missing values is also dependent on the volume of data I have. If the volume of data is small, that 5% might be very significant
We’ve identified the columns with missing values and they are important for our analysis, hence the columns can’t be dropped. Deleting the missing values is also not an option because of our threshold. What manner of headache is this? Why can’t they collect clean data? Why can’t they use Data validation or ensure all questions must be answered in their Google or Microsoft forms? I just want to analyze data.
Brighten up. The light at the end of the tunnel is here
Since columns can’t be dropped and rows can’t be deleted, the next thing I do is take the following approach based on the type of variable I’m dealing with.
If the column where I have missing values is a continuous variable, I do either one of the following 1. Replace the missing values with the median or mean. 2. Create a regression model to predict the missing values. This is quite stressful as I would have to tweak my parameters to ensure my RMSE is as small as possible. Really stressful
Replacing missing values with median or mean is quite straightforward. You just need to note one very important thing. How is your data (i.e. column) is distributed?
If the data is either negatively or positively skewed, I replace missing values with the median.
If the distribution is a normal distribution I replace the missing values with either the mean or median. That’s dependent on my mood.
There are 3 conventional approaches to dealing with a categorical variable having missing values. 1. Replace with the mode i.e. if “female ”occurs twice, replace all missing values with “female”. I don’t like this approach. For instance, all respondents who didn’t fill their gender can’t be all females. This approach can lead to you saying things the data isn’t saying. 2. Prediction by creating a classification model. Can be stressful also as you might have to tune parameters. 3. Assign a new category. This is my favorite approach. I just replace missing values with either “Not specified”, “Unknown” etc. With this, I’m certain I’m not saying what the data isn’t saying.
That’s a peak view into how I deal with missing values in my day-to-day life as an analyst.