Feature Selection or Variable Selection or Attribute Selection or Variable Subset Selection

selection - is the process of selecting a subset of relevant features (variables, predictors) for use in model construction
importance/relevance - refers to techniques that assign a score to input features based on how useful they are at predicting a target variable

3 Strategies

filter strategy (e.g. information gain)
wrapper strategy (e.g. search guided by accuracy)
embedded strategy (selected features add or are removed while building the model based on prediction errors)

Feature Importance - Map

TODO

resources: https://towardsdatascience.com/6-types-of-feature-importance-any-data-scientist-should-master-1bfd566f21c9

Click here to expand...

Feature importance is a fundamental concept for Machine Learning practitioners.

Due to its simplicity and intuitiveness, this indicator is not only constantly monitored by data scientists, but often communicated to the non-technical stakeholders of a predictive model.

But, despite being intuitive, the idea of “feature importance” is also somewhat vague. In fact, there are many different ways to calculate it. So it’s important to know them, along with their pros and cons, to make sure that we are answering exactly the question that we want to answer.

The purpose of this article is to shed some light on the various approaches that may be taken to calculate feature importance.

But what do we mean by “feature importance”?

Before delving into the topic, it’s advisable that we agree upon what do we mean by “feature importance”.

Suppose that we have a dataframe— called X — where each row represents an “observation” and each column represents a characteristic (or “feature”). We also have a phenomenon (or “target variable”) that we would like to predict about each observation.

Feature importance is a score between 0 and 100 assigned to each column (or feature), telling how powerful is that feature in predicting the target variable.

Note that we also require that the sum of all features should be 100. So we are implicitly assuming that we have observed all the characteristics that may be useful to explain the target variable.

A taxonomy of feature importance

Let’s start with a conceptual map of the types of feature importance.

A conceptual map of feature importance categories.

The first distinction that we encounter is between univariate and multivariate importance. The difference between the two is that:

Univariate importance considers each feature individually.

Multivariate importance measures the contribution of each feature conditionally to all the other features.

These two types of importance convey a very different meaning, we could say “orthogonal”, to each other. This is why I personally prefer to provide 2 types of importance: one univariate and one multivariate. In this way, the final user can get a more complete picture.

1. Univariate

The goal of univariate importance is to get a score of how much the target variable depends from each feature, ignoring all the other features. The score can be obtained from a closed formula (e.g. Pearson correlation) or from a predictive model (e.g. curve under ROC curve of a random forest). Then, the scores of all the features are normalized to sum 100.

In general, these are the main pros and cons of univariate importance:

Pros: It is easy to grasp, since we humans tend to reason in univariate terms. Moreover, it doesn’t suffer from correlated features.

Cons: It may oversimplify the reality. In fact, it doesn’t take into account how the features interplay among them.

Let me explain what I mean when I say that univariate importance doesn’t suffer from correlated features.

Suppose you have 2 features, height and age, and you use a random forest to predict your target variable. This is their importance:

Feature importance (2 features)

They are equally important, both close to 50%.

However, what happens if you have features that are highly correlated among them? If, for instance, you have not only age in years, but also age in days and age in months and you try to calculate multivariate importance, this is what you get:

Multivariate feature importance

Since the 3 features about age carry the same information, the overall importance of age (roughly 50%) is practically shared among them.

But if we used univariate importance, we get a different picture, namely all features get approximately the same level of importance, which reflects more accurately the truth:

Univariate feature importance

Now that we have seen the pros and cons of univariate importance, let’s see some methods that belong to this category. Many algorithms can be used, but, as an example, we will see 3 relevant methods implemented in Python.

- F-statistic

The simplest type of univariate importance is given by F-statistic (also called ANOVA) for categorical target variables. In this case, you want to predict a target variable consisting of 2 or more classes, based on the value of a numeric feature. Then, F-statistic is calculated as the ratio between-group-variability and the within-group-variability of the feature.

Note that, for a continuous target variable, this is equivalent to using Pearson correlation, since correlation can be “converted” to F-statistic through a simple formula.

Scikit-learn has two functions for handling both the categorical and the continuous case, respectively f_classif and f_regression , so you can use it through:
from sklearn.feature_selection import f_classiff = pd.Series(f_classif(X, y)[0], index = X.columns)
fimpo = f / f.sum() * 100

if y is categorical, or
from sklearn.feature_selection import f_regressionf = pd.Series(f_regression(X, y)[0], index = X.columns)
fimpo = f / f.sum() * 100

if y is continuous.

- Maximal Information Coefficient

F-statistic or Pearson correlation are very simple. In fact, the first one addresses only differences between means and the second one only linear relationships.

A more robust approach would be to use Mutual Information, which can be thought as the reduction in uncertainty about one random variable given knowledge of another. Note that this accounts also for non-linear relationships.

However, the main drawback of MI is that it requires the feature to be discrete, which often is not the case. Maximal Information Coefficient is designed to overcome this issue, because it automatically bins the continuous features for us, in a way such that the MI between them is maximal.

In Python, you can find an implementation of MIC in the library Minepy (which you can install through pip install minepy ). Supposing that X is a Pandas Dataframe and y is a Pandas Series, you can obtain the MIC of each feature with:
from minepy import MINEdef get_mic(x, y):
mine = MINE()
mine.compute_score(x, y)
return mine.mic()f = X.apply(lambda feature: get_mic(feature, y))
fimpo = f / f.sum() * 100

- Predictive Power Score

Since we want to estimate the relationship between one feature and one target variable, why not fitting a predictive model, such as a decision tree, on that single feature? This is the idea behind Predictive Power Score (PPS), which was first presented in this Towards Data Science article.

The trick is that the model is non-linear, so it is able to catch also non linear relationships. Moreover, PPS has some other convenient properties, such as being not symmetrical, and being implemented for each possible combination of categorical/numerical feature/target.

The dedicated Python library can be installed via pip install ppscore.
import ppscoref = ppscore.predictors(pd.concat([X, y], axis = 1),
column_target).set_index(‘x’)[‘ppscore’]
fimpo = f / f.sum() * 100

2. Multivariate

Multivariate importance seeks to answer the question: what is the overall importance of a feature, considering also what we know about all the other features? In other words, it considers also interactions among features.

Let’s see the pros and cons of multivariate importance.

Pros: It is more complete, because it takes into account interactions. Moreover, it is useful to demistify predictive models.

Cons: As shown in the paragraph above, it may be misleading when features are highly correlated.

Let’s borrow an example from this paper by Lawrence Hamilton to explain why interactions are important. Education level may not say much about how worried a person is about rising sea level. But, if you combine this feature with political ideology, it suddenly becomes relevant. Therefore, the univariate importance of education level for this topic is low, whereas its multivariate importance is high.

An example of interaction between features (education and political ideology) in predicting concern about sea level rise. [Hamilton, Who Cares about Polar Regions? Results from a Survey of U.S. Public Opinion, 2008]

Within multivariate importance, one can differentiate between two sub-categories:

Fit-time: Feature importance is available as soon as the model is trained.

Predict-time: Feature importance is available only after the model has scored on some data.

Let’s see each of them separately.

3. Fit-time

In fit-time, feature importance can be computed at the end of the training phase. So, it reflects what the model has learnt on training data. Simplifying a bit, the more one feature has been used during the training phase, the more important is the feature.

This is also called “intrinsic” importance, because the calculation method is model-specific.

Pros: It is fast, since it can be extrapolated directly from the trained model, so nothing is required besides training the model. It is also convenient because it is built-in in some Python models. For instance, in Scikit-learn, it’s enough to call model.feature_importances_.

Cons: It may assign a high importance to features that do not work well on unseen data. Moreover, it is model dependent: for some models it is difficult, if not impossible, to calculate. Moreover, impurity-based feature importance for trees are strongly biased in favor of high cardinality features (see Scikit-learn documentation).

Since fit-time importance is model-dependent, we will see just examples of methods that are valid for tree-based models, such as random forest or gradient boosting, which are the most popular ones.

- Impurity Reduction

In decision trees, splits are chosen in order to reduce a measure of impurity (such as Gini, Entropy or MSE) within groups of observations. Therefore, it is natural to consider the features that are responsible for a greatest decrease in impurity as the most important ones.

Thus, the importance of feature f is calculated as the average impurity decrease of all nodes that split on f (weighted by the number of observations that are in that node).

This is the default importance that you get calling model.feature_importances_ for Scikit-learn models.

For instance,
from sklearn.ensemble import RandomForestClassifierrf = RandomForestClassifier().fit(X, y)
fimpo = pd.Series(rf.feature_importances_ * 100, index = X.columns)

- Split Count

Another approach, described here, is simply to count the number of times a feature has been used to split across all the trees. Intuitively, a feature that has been used 10 times is twice as important as a feature that has been used only 5 times.

This method is natively available in the XGBoost library:
from xgboost import XGBClassifierxgb = XGBClassifier().fit(X, y)
f = pd.Series(xgb.get_booster().get_score(importance_type=‘weight’))
fimpo = f / f.sum() * 100

- Coverage

Counting the number of splits may be misleading. For instance, some splits may concern just few observations, so they are not actually that relevant. To overcome this issue, one can weight each split by its coverage, that is the number of observations affected by the split.

This method is natively available in the XGBoost library:
from xgboost import XGBClassifierf = pd.Series(xgb.get_booster().get_score(importance_type=‘cover’))
fimpo = f / f.sum() * 100

4. Predict-time

The main drawback of fit-time importance is that it’s model-dependent. Thus, for some models, such as logistic regression or neural network, it would be very troublesome to calculate.

To overcome this issue, there exists a set of methods that can be calculated at predict-time, i.e. after the model has completed the training phase. This also implies that these methods can be applied on datasets other than the training dataset.

Pros: It is model-agnostic. Moreover, since it is done at predict-time, it can be calculated for different datasets, which proves very useful in real applications.

Cons: It can be slow to compute, because it may require to make iterations (permutation importance) or to employ complex methods (SHAP).

Within the category of Predict-time methods, there are two main types of algorithms:

Target required, besides the input data for the model, you also need to know the target variable;

Target not required: you can use these methods even if you don’t know the target variable.

Let’s see each of them separately.

5. Target required

Within this category, the main algorithm is “Permutation Importance”. These are its pros and cons:

Pros: It is model-agnostic. Moreover, since it is done at predict-time, it can be calculated for different datasets, which proves very useful in real applications.

Cons: It cannot be used when the target variable is not available. It suffers from highly correlated features, in fact when you randomly shuffle a feature, the model will still use the other features that are correlated with the one you have shuffled. This will result in an underestimate of the importance of all the correlated features.

- Permutation Importance

Suppose you have already trained a predictive model M.

As the name implies, this algorithm randomly shuffles one feature at a time, and makes the prediction of model M on the dataset containing the shuffled column. Then, it calculates the performance score (for instance, the area under the ROC curve) on the prediction.

This is repeated for all the features in the dataset. The idea is that the worsening in the performance score (compared to the performance of M on the “unspoilt” dataset) is proportional to the importance of the feature. Note that this procedure is carried out on the test dataset, not on the dataset on which the model was trained on.

Why shuffling the feature and not, for instance, imputing a fixed value? Because permutation allows to preserve the original distribution. And this is safer since we want to exploit a model that has been on the original feature.

Permutation importance is readily available in Scikit-learn:
from sklearn.inspection import permutation_importancef = permutation_importance(model, X, y)[‘importances_mean’]
fimpo = f / f.sum() * 100

6. Target not required

In real applications, the whole point of machine learning is to make predictions on instances for which we still haven’t observed the target variable. So, it is particularly useful to calculate feature importance on these instances. This is the main characteristic of this set of methods.

Pros: It can be applied also to datasets for which we don’t have the target variable. This is vital for real-world applications, when we want to compute feature importance for latest data on which we are making predictions.

Cons: It may be very slow to compute.

The methods that we will see are based on Shap values. Shap can be calculated through homonymous Python library. Supposing that you have already trained a model, this is how you can get the respective Shap values:
import shapexplainer = shap.Explainer(model)
shap_values = pd.DataFrame(explainer(X).values, columns = X.columns)

- Absolute Importance

Since each Shap value tells how much that feature’s value “moves” the final prediction up or down, the sum of all the Shap values of a feature is the ideal proxy of the relevance of that feature on the chosen dataset. Of course, before summing them up, Shap values must be taken in absolute value because negative effects are as important as positive effects.
f = shap_values.abs().sum()
fimpo = f / f.sum() * 100

- Main Factor

Rather than the overall effects, we may be interested to know what is the single most relevant feature for each observation. This approach is interesting, because feature importance can be interpreted directly as a percentage, so it is much easier to understand. For example, if the importance of “Age” was 25%, this would mean that for 25% of the observations “Age” was the most important feature.

To obtain this quantity is enough to take the highest Shap value by row, and then count the value. Supposing that Shap values are contained into a Pandas Dataframe:
fimpo = shap_values.abs().idxmax(axis = 1).value_counts(normalize=True) * 100

- Main Factor with Sign

As we have seen, Shap values may be positive or negative. Positive values contribute to increase the final prediction, while negative values contribute to lower the final prediction. Thus, it is sometimes interesting to know which features impact most in either of the two directions.

This corresponds to calculating “Main Factor” importance that we have seen in the previous paragraph, just after having removed negative or positive values, depending on which sign you are interested in.

For example, if you want to know which features are most relevant in raising the final prediction, then you should put a floor at 0:
fimpo = shap_values.clip(lower = 0).abs().idxmax(axis = 1).value_counts(normalize = True) * 100

On the contrary, if you want to know which features are most relevant in lowering the final prediction, then you should put a cap at 0:
fimpo = shap_values.clip(upper = 0).abs().idxmax(axis = 1).value_counts(normalize = True) * 100

Comparing importance

Let’s see the outcome of the methods we have seen above on the most well-known dataset of all times:

Feature importance methods compared on Titanic dataset. [Image by Author]

As you can see, the estimates are pretty different across the various methods.

Sex is surprisingly high in permutation and main-factor importance, consistently higher than all the univariate methods. This suggests that Sex has probably many interactions with the other features.

It is also interesting to notice how different “main factor positive” and “main factor negative” are from “main factor”. For example, looking at Sex, the fact that “main factor negative” is higher than “main factor positive” suggests that Sex is more often a decisive factor in “condemning” a passenger’s life, rather than saving it.

In general, the takeaway is that it is useful to compare different types of importance (generally at least one univariate and one multivariate) to get insights about both the phenomenon and the predictive model.

／var／log marcus chiu

Explorer

Feature／Variable／Attribute Subset Selection／Importance／Relevance

3 Strategies

Feature Importance - Map

TODO

But what do we mean by “feature importance”?

A taxonomy of feature importance

1. Univariate

- F-statistic

- Maximal Information Coefficient

- Predictive Power Score

2. Multivariate

3. Fit-time

- Impurity Reduction

- Split Count

- Coverage

4. Predict-time

5. Target required

- Permutation Importance

6. Target not required

- Absolute Importance

- Main Factor

- Main Factor with Sign

Comparing importance

／var／logmarcus chiu

Explorer

Feature／Variable／Attribute Subset Selection／Importance／Relevance

3 Strategies

Feature Importance - Map

TODO

But what do we mean by “feature importance”?

A taxonomy of feature importance

1. Univariate

- F-statistic

- Maximal Information Coefficient

- Predictive Power Score

2. Multivariate

3. Fit-time

- Impurity Reduction

- Split Count

- Coverage

4. Predict-time

5. Target required

- Permutation Importance

6. Target not required

- Absolute Importance

- Main Factor

- Main Factor with Sign

Comparing importance

／var／log marcus chiu