Helping with the curse of dimensionality.
So, we looked at what is the curse of dimensionality, now lets see some techniques on how to mitigate it. There are different methods on how to resolve this and in this blog we’ll take a look.
In this part we will look at Feature selection, this helps us by only keeping the most relevant variables from the original dataset. The techniques that we will use here will be:
- High Correlation filter
- Low Variance Filter
- Missing Value Ratio
The dataset that I will be using is the titanic dataset. Ok, let's get started.
Missing Value Ratio
As we know, when we start working with a dataset we first explore the data before doing anything. But what if we have missing values in our dataset? What would be the best method? To impute, to drop?
What about when we have more than 50% of missing values? Its preferable to drop the column if that is the case since there is too little data to work with. That doesn’t mean that that is the only method, we can set a threshold value and if the percentage of missing values in any variable is more than that threshold, we will drop the variable.
We will load the dataset and then check the percentage of missing vbalues in each valriable by using
.isnull() as our filter.
As you can see in the above table, there aren’t too many missing values. We can impute the columns using appropriate methods, or we can set a threshold of, say 20%, and remove the columns having more than 20% missing values.
There we go! we just removed the columns that have more than 20% missing values.
Low Variance Filter
Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.
So, we need to calculate the variance of each variable we are given. Then drop the variables having low variance as compared to other variables in our dataset. The reason for doing this, as I mentioned above, is that variables with a low variance will not affect the target variable.
Now let’s calculate the variance of all the numerical variables.
Wow! we can see that PassengerId has incredibly high variance! We can definitely get rid of that column. We also see that Age and Fare have very very high variance but we will not get rid of them because we know that we can reduce the variance by cleaning up those columns.
High Correlation Filter
High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance). We can calculate the correlation between independent numerical variables that are numerical in nature. If the correlation coefficient crosses a certain threshold value, we can drop one of the variables (dropping a variable is highly subjective and should always be done keeping the domain in mind).
As a general guideline, we should keep those variables which show a decent or high correlation with the target variable.
We will drop the dependent variable and then check for correlation.
Look at that! our data looks great with no correlation, we have reduced the dimensions by cleaning our data and making it better for us to start modeling. Next we will look at more complex methods for larger datatasets.