Feature selection

The purpose of this post is to explore feature selection techniques, namely:

* Correlation based
* K-best
* In-built features importance of Random Forest
* Permutation importances

Data exploration

The dataset is taken from Kaggle: https://www.kaggle.com/c/santander-customer-transaction-prediction

Load and inspect data, check dimensions, data types, missing values, target variable

import pandas as pd
import numpy as np

df_train = pd.read_csv('../input/train.csv')
print(df_train.shape)
df_train.head()

(200000, 202)

def check_missing(df):
    return df.isnull().sum().sum()
print(f'Missing values {check_missing(df_train)}')

Missing values 0

df_train.dtypes.value_counts()

float64    200
object       1
int64        1
dtype: int64

df_train.target.sum() / df_train.shape[0]

0.10049

Quite large dataset with 200 numerical features, no missing values and all variable names anonymyzed, target class is inbalanced.

Sampling

In real life scenarios we would want to keep as much information as possible, given resource constraints and probably use model weights instead of downsampling. However, for the purpose of this post, let's downsample majority class to achieve 1:1 ratio, which will both speed up exploration and simplify accuracy measurement.

# Drop ID column as it doesn't contain information
df_train.drop('ID_code', axis=1, inplace=True)

df_positive_class = df_train[df_train.target == 1]
df_negative_class = df_train[df_train.target == 0].sample(df_positive_class.shape[0], replace=False, random_state=42)
sample_df = pd.concat([df_positive_class, df_negative_class], axis=0)
sample_df.target.value_counts()

1    20098
0    20098
Name: target, dtype: int64

Separate X and y, add random feature for importance reference, change dtypes to float32 to speedup processing

np.random.seed(42)
X = (sample_df.drop('target', axis=1)
     .apply(lambda x: x.astype('float32'))
     .assign(rand=lambda x: np.random.rand(x.shape[0])))
y = sample_df.target

Create train test splits

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

The benchmark

Typical framework for such experiments is building the simplest solution first and then trying different techniques to improve it. Let's use the usual Random Forest classifier, which is the go-to choice for prototyping, idea testing and experimentation. We'll fit it on all features to get the benchmark and then try to reduce the number of features while maintaining as high accuracy as possible. Besides accuracy score, let's keep track of prediction time - a pretend scenario for when real time speed is important, this will add more motivation for dimensionality reduction. Remember, simple = good :)

import time
def score_and_time(model, X):
    ts = time.time()
    preds = model.predict(X)
    elapsed = int((time.time() - ts) * 1000)
    print(f'Score {model.oob_score_}, predicted in {elapsed}')

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

baseline = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=-1, random_state=42)
baseline.fit(X_train, y_train)
score_and_time(baseline, X_test)

Score 0.7065244433387238, predicted in 209

In-built feature importance

Random Forest was chosen for another nice property - it has in-built feature importance feature after fitting it, which saves considerable amount of coding. Let's use already fitted benchmark to get the important features.

importances = pd.Series(baseline.feature_importances_, index=X_train.columns)
important_cols = importances[importances > importances['rand']].index.tolist()
print(f'Number of features with greater than random column importance {len(important_cols)}')
importances.sort_values().plot()

Number of features with greater than random column importance 178

<matplotlib.axes._subplots.AxesSubplot at 0x7f11ad0854a8>

reduced_rf = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=-1, random_state=42)
reduced_rf.fit(X_train[important_cols], y_train)
score_and_time(reduced_rf, X_test[important_cols])

Score 0.7095720860803583, predicted in 205

Ok, so we improved the score somewhat, prediction time is similar, dimensionality is still quite high, around 2/3 of the original input. Of course, if this was the issue, we could try sacrificing some of the accuracy and choosing less top features, instead of taking all non-random. Also, importance plot shows exponential decay, so 20/80 principle is applicable here and shouldn't hurt accuracy too much.

Importance based on correlation to target

Another way to perform variable selection that is more traditional and related to linear models is by picking features that are most correlated to target variable. Since most of the relationships in real world are not linear, we'll use Spearman's rank correlation instead of Pearson's correlation, because it's more robust and better captures non-linearities. Also, even though pandas has in-built correlation method, let's use scipy's implementation, since it's more efficient.

Some practitioners suggest investigating relationships between features first, because multicollinearity has an effect on importances, i.e. correlated variables will both seem important but contain nearly identical information. But let's leave it for now, because it would unnecessarily complicate implementation if we wanted to automate this process.

import scipy.stats
_ = pd.concat([X_train,y_train], axis=1)
cor = pd.DataFrame(np.abs(scipy.stats.spearmanr(_).correlation),
                   columns=_.columns, index=_.columns)['target']
non_rand_corr = cor[cor > cor['rand']].shape[0]
print(f'Number of variables with correlation to target higher than random {non_rand_corr}')
cor[cor.index != 'target'].sort_values().plot()

Number of variables with correlation to target higher than random 199

<matplotlib.axes._subplots.AxesSubplot at 0x7f11ae06b588>

Most of the variables have higher absolute correlation to target than random column, so this heuristic doesn't provide a good variable selection strategy. Also all variables have very low correlations to target, e.g. 10% at most, so it's not evident how many top variables to select or what correlation threshold to set. Let's try taking top 20% most correlated variables and see how the score changes.

top20_cols = cor[cor.index != 'target'].sort_values()[-int(cor.shape[0] * .2):].index.tolist()
rf_corr = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=-1, random_state=42)
rf_corr.fit(X_train[top20_cols], y_train)
print(f'Reduced number of columns {len(top20_cols)}')
score_and_time(rf_corr, X_test[top20_cols])

Reduced number of columns 40
Score 0.7037877845503172, predicted in 104

Ok, so the score decreased slightly, but we reduced input dimension by 80% and prediction time twice! Of course here top 20 was chosen arbitrarily, we could do a GridSearch approach and get multiple scores for different number of top variables and see how does the trade-off landscape between accuracy and complexity looks like.

In general, in-built selection is preferrable, because importances are calculated directly when fitting trees, so it saves computation and additional code lines, but in general it's good to know alternative methods for later use with algorithms that don't have in-built feature importance calculation.

Permutation importance

The final technique that I'm going to consider in this post is permutation importance, which as argued (https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-py) is superior to default impurity based feature importance.

Additionally, I'll wrap feature selection part into Sklearn pipeline, such that it can be integrated into end-to-end solution.

fit base estimator, in next step get fitted estimator and apply feature importances

# scikit-learn==0.22.2 or higher required
from sklearn.inspection import permutation_importance
from sklearn.base import BaseEstimator, ClassifierMixin

class FeatureSelector(BaseEstimator, ClassifierMixin):
    
    # This can be tuned to accept kwargs for pi
    def __init__(self, estimator):
        self.estimator = estimator
    
    def fit(self, X, y=None):
        self.estimator.fit(X, y)
        self.important_cols = self._get_important_cols(X, y)
        self.estimator.fit(X[self.important_cols], y)
        return self
        
    def _get_important_cols(self, X, y):
        pi = permutation_importance(self.estimator, X, y, n_repeats=1, n_jobs=-1, random_state=42)
        importances = pd.DataFrame(pi.importances_mean, index=X.columns, columns=['imp'])['imp']
        return importances[importances > importances['rand']].index.tolist()
    
    def predict(self, X):
        return self.estimator.predict(X[self.important_cols])
    
    @property
    def oob_score_(self):
        return self.estimator.oob_score_

from sklearn.pipeline import Pipeline

clf = RandomForestClassifier(n_estimators=100, oob_score=True, n_jobs=-1, random_state=42)
fs = FeatureSelector(clf)

fs.fit(X_train, y_train)
print(f'Number of selected features {len(fs.important_cols)}')
score_and_time(fs, X_test)

Number of selected features 21
Score 0.6743065057843015, predicted in 105

Conclusion

As you can see using permutation importances didn't seem to prove improve accuracy. It reduced the number of columns down to 21, which seems a little low. It could be due to correlated inputs, for more info see here: https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html#sphx-glr-auto-examples-inspection-plot-permutation-importance-multicollinear-py. Anyway, it was interesting to implement new Scikit Learn methods, although for practicality I would use in-built importances for dimensionality reduction and then use permutation importances or LIME / SHAP for model interpretation. Additionally, it would be worthwile to play around with automated multicollinearity reduction techniques in the future.

	ID_code	var_0	var_1	var_2	var_3	var_4	var_5	var_6	var_7	...	var_190	var_191	var_192	var_193	var_194	var_195	var_196	var_197	var_198	var_199
0	train_0	8.9255	-6.7863	11.9081	5.0930	11.4607	-9.2834	5.1187	18.6266	...	4.4354	3.9642	3.1364	1.6910	18.5227	-2.3978	7.8784	8.5635	12.7803	-1.0914
1	train_1	11.5006	-4.1473	13.8588	5.3890	12.3622	7.0433	5.6208	16.5338	...	7.6421	7.7214	2.5837	10.9516	15.4305	2.0339	8.1267	8.7889	18.3560	1.9518
2	train_2	8.6093	-2.7457	12.0805	7.8928	10.5825	-9.0837	6.9427	14.6155	...	2.9057	9.7905	1.6704	1.6858	21.6042	3.1417	-6.5213	8.2675	14.7222	0.3965
3	train_3	11.0604	-2.1518	8.9522	7.1957	12.5846	-1.8361	5.8428	14.9250	...	4.4666	4.7433	0.7178	1.4214	23.0347	-1.2706	-2.9275	10.2922	17.9697	-8.9996
4	train_4	9.8369	-1.4834	12.8746	6.6375	12.2772	2.4486	5.9405	19.2514	...	-1.4905	9.5214	-0.1508	9.1942	13.2876	-1.5121	3.9267	9.5031	17.9974	-8.8104