Dimension Reduction Techniques
Helping with the curse of dimensionality part 2.
So, in my previous article we saw a coulple of dimension reduction techniques. In this part we will look at some other more complicated froms of Feature selection. The techniques that we will use here will be:
- Random Forest
- Backward Feature Elimination
- Forward Feature Selection
I will be working with the same dataset as before. Let’s get started!
Random Forest
Random Forest is a very popular algorithm for feature selection. This algorithm contains built in a feature importance so there is no need to break your head with coding your own.
Random Forest can help us with smaller datasets, one thing to keep in mind is that using random forest is very similar to building a model. All the data that we will be using will be numerical, therefore we need to make sure to clean our dataset.
Now that our data is clean we can fit our model. There is no need to use train_test_split()
since we are not using it as a predictive model, but we are using it to get the feature importance of our data.
Now that we’ve done that, lets plot it!
As we can see we get similar results from our previous feature selection methods. Now, we can manually choose the features that we want to use or we can use our SelectFromModel()
method, it selects the features based on the importance of their weights.
Backward Feature Elimination
Now let’s look at Backward Feature Elimination, lets take a look at the steps that are needed in order to be able to complete this technique:
- We first take all the n variables present in our dataset and train the model using them
- We then calculate the performance of the model
- Now, we compute the performance of the model after eliminating each variable (n times), i.e., we drop one variable every time and train the model on the remaining n-1 variables
- We identify the variable whose removal has produced the smallest (or no) change in the performance of the model, and then drop that variable
- Repeat this process until no variable can be dropped
This technique can be used when building Linear Regression or Logistic Regression models. Here I’ll be using a different dataset, I will be using the housing prediction dataset from SKLearn. Lets take a look:
Now that we’ve done that lets take a look at the features, and how they were ranked!
Forward Feature Selection
This is the opposite process of the Backward Feature Elimination we saw above. Instead of eliminating features, we try to find the best features which improve the performance of the model. This technique works as follows:
- We start with a single feature. Essentially, we train the model n number of times using each feature separately
- The variable giving the best performance is selected as the starting variable
- Then we repeat this process and add one variable at a time. The variable that produces the highest increase in performance is retained
- We repeat this process until no significant improvement is seen in the model’s performance
This returns an array containing the F-values of the variables and the p-values corresponding to each F value. Refer to this link to learn more about F-values. For our purpose, we will select the variables having F-value greater than 10:
This gives us the top most variables based on the forward feature selection algorithm.
Keep in mind that Both Backward Feature Elimination and Forward Feature Selection are time consuming and computationally expensive.They are practically only used on datasets that have a small number of input variables.
Theere we go! we saw some simple methods for us to do some feature elimination and reduce the curse of dimensionality, so go and check them out!