Machine learning notes mutual information

1、 summary

         When encountering a new data set, the important first step is to use the characteristic utility index to build the ranking , This indicator is a function of measuring the correlation between characteristics and objectives . then , You can choose a small number of the most useful functions for initial development .

         The metrics we use are called “ Mutual information ”. Mutual information is much like relevance , Because it measures the relationship between two quantities . The advantage of mutual information is that it can detect any kind of relationship , Correlation only detects linear relationships .

         Mutual information is a good general indicator , Especially useful at the beginning of function development , Because you may not know which model to use .

        Mutual information is easy to use and interpret , High calculation efficiency , There is a theoretical basis , Over fitting , And can detect any type of relationship .

2、 Mutual information and its measurement

         Mutual information describes the relationship in terms of uncertainty . Mutual information between two quantities (MI) It is a measure of how much knowledge of one quantity reduces the uncertainty of another . If you know the value of a feature , Will you have more confidence in your goals ?

         This is a Ames Housing An example of data . The figure shows the relationship between the appearance quality of the house and its selling price . Each dot represents a house .

Knowing the external quality of a house can reduce the uncertainty of its price .

         As we can see from the picture , know ExterQual The value of should make you correct the corresponding SalePrice More certain ——ExterQual Each category of tends to SalePrice Concentrate in a certain range . ExterQual And SalePrice The mutual information of is ExterQual Four values of SalePrice Average reduction in uncertainty . for example , because Fair The occurrence frequency of is lower than that of typical , therefore Fair stay MI The weight in the score is small .

         What we call uncertainty is the use of information theory called “ entropy ” To measure . The entropy of a variable roughly means :“ How many yes or no questions do you need to describe this happening Variable , On average, .” The more questions you have to ask , The greater your uncertainty about variables . Mutual information is how many questions about goals you expect the feature to answer .

3、 Explain mutual information scores

         The minimum possible mutual information between quantities is 0.0. When MI Is zero , These quantities are independent : Neither can tell you anything about the other . contrary , Theoretically MI There is no upper limit . In practice , Although higher than 2.0 Values around are not common . ( Mutual information is a pair of numbers , So it increases very slowly .)

         The following figure will show you MI How the value corresponds to the type and degree of association between the feature and the target .

Left : As the dependence between features and goals becomes closer , Mutual information increases . Right picture : Mutual information can capture any type of association ( Not just linear , Like relevance .)

          Here are some things to remember when applying mutual information :

        MI It can help you understand the relative potential of a feature as a target predictor , And consider it alone .

         When interacting with other functions , A function may provide very rich information , But the amount of information alone may not be very large . MI Interaction between features cannot be detected . It is a univariate indicator .

         The actual use of the function depends on the model you use . A feature is only useful if its relationship to the target is something your model can learn . Just because one feature has high MI Score does not mean that your model will be able to do anything with this information . You may need to transform features first to expose associations .

4、 Example - 1985 Cars in

         The car data set consists of 1985 Year of 193 Vehicle composition . The goal of this data set is based on the 23 Features ( For example, brand 、 Body style and horsepower ) To predict the price of cars ( The goal is ). In this case , We will rank features with mutual information , And visualize the research results through data .

Automobile Dataset | KaggleDataset consist of various characteristic of an autohttps://www.kaggle.com/toramky/automobile-dataset        The following code imports some libraries and loads the dataset .

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


df = pd.read_csv("../input/fe-course-data/autos.csv")

        MI Of scikit-learn The algorithm deals with discrete features differently from continuous features . Based on experience , Any must have float dtype Everything is not discrete . By giving them a tag code , You can classify ( Object or classification dtype) Considered discrete .

X = df.copy()
y = X.pop("price")

# Label encoding for categoricals
for colname in X.select_dtypes("object"):
    X[colname], _ = X[colname].factorize()

# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int

         Scikit-learn In its feature_selection There are two mutual information measures in the module : One for real value goals (mutual_info_regression), A target for classification (mutual_info_classif). Our goal , Price , Is of real value . The next unit calculates our characteristic MI fraction , And wrap them in a data frame .

from sklearn.feature_selection import mutual_info_regression

def make_mi_scores(X, y, discrete_features):
    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
    mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
    mi_scores = mi_scores.sort_values(ascending=False)
    return mi_scores

mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3]  # show a few features with their MI scores
curb_weight          1.486440
highway_mpg          0.950989
length               0.607955
bore                 0.489772
stroke               0.380041
drive_wheels         0.332973
compression_ratio    0.134799
fuel_type            0.048139
Name: MI Scores, dtype: float64

         Now it's a bar chart , It can make the comparison easier :

def plot_mi_scores(scores):
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

plt.figure(dpi=100, figsize=(8, 5))

          Data visualization is a good follow-up to utility ranking . Let's take a closer look at some of them .

         as we had expected , High score curb_weight Characteristics have a strong relationship with the target price .

sns.relplot(x="curb_weight", y="price", data=df);

        Fuel_type Features have a fairly low MI fraction , But we can see from the picture that , It clearly distinguishes two price groups with different trends in horsepower characteristics . This shows that fuel_type Contributed an interaction , And it may not be unimportant . In from MI Before the score determines that a characteristic is not important , It is best to investigate any possible interaction —— Knowledge in the professional field can provide a lot of guidance here .

sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df);

         Data visualization is an important supplement to the feature engineering toolbox . In addition to practical indicators such as mutual information , This kind of visualization can help you discover important relationships in the data .


