Junior suggestions

Published

October 20, 2022

Amaya leads the team responsible for maintaining a machine learning model that helps run their company’s marketing budget. The model has been running for a long time, and Amaya’s team makes periodic updates and improvements.

But they want more.

The team aims to find some breakthroughs to improve the model considerably. Surprisingly, the most junior person on the team was the one coming up with two different ideas:

  1. Take each row separately, and count the missing values across all columns. Then add a new column to the dataset with that number.
  2. Replace a categorical feature on their dataset with the number of times each value appears across all samples.

Which of the following would be your recommendation for Amaya regarding these two ideas?

  1. Amaya shouldn’t consider any of these techniques because they aren’t valid forms of feature engineering.

  2. Amaya should only consider the first technique. The second one is not a valid form of feature engineering.

  3. Amaya should only consider the second technique. The first one is not a valid form of feature engineering

  4. Amaya should consider both methods.

4

Amaya should experiment with both techniques to see whether they improve their model. The two suggestions are examples of meta-features based on the rows and columns of the team’s dataset.

The first suggestion, for example, will help the model understand which rows have an excess of missing values and which are complete. Of course, there’s no guarantee this information is helpful, but that’s something Amaya will have to decide based on her experiments.

The second suggestion is an example of frequency encoding. It will make evident to the model the importance of an individual row based on how prominently it appears on the data. For instance, we could replace a job position title with the number of employees with that title.

Recommended reading