当前位置:网站首页>Machine learning notes mutual information
Machine learning notes mutual information
2022-07-04 22:04:00 【Sit and watch the clouds rise】
1、 summary
When encountering a new data set, the important first step is to use the characteristic utility index to build the ranking , This indicator is a function of measuring the correlation between characteristics and objectives . then , You can choose a small number of the most useful functions for initial development .
The metrics we use are called “ Mutual information ”. Mutual information is much like relevance , Because it measures the relationship between two quantities . The advantage of mutual information is that it can detect any kind of relationship , Correlation only detects linear relationships .
Mutual information is a good general indicator , Especially useful at the beginning of function development , Because you may not know which model to use .
Mutual information is easy to use and interpret , High calculation efficiency , There is a theoretical basis , Over fitting , And can detect any type of relationship .
2、 Mutual information and its measurement
Mutual information describes the relationship in terms of uncertainty . Mutual information between two quantities (MI) It is a measure of how much knowledge of one quantity reduces the uncertainty of another . If you know the value of a feature , Will you have more confidence in your goals ?
This is a Ames Housing An example of data . The figure shows the relationship between the appearance quality of the house and its selling price . Each dot represents a house .

As we can see from the picture , know ExterQual The value of should make you correct the corresponding SalePrice More certain ——ExterQual Each category of tends to SalePrice Concentrate in a certain range . ExterQual And SalePrice The mutual information of is ExterQual Four values of SalePrice Average reduction in uncertainty . for example , because Fair The occurrence frequency of is lower than that of typical , therefore Fair stay MI The weight in the score is small .
What we call uncertainty is the use of information theory called “ entropy ” To measure . The entropy of a variable roughly means :“ How many yes or no questions do you need to describe this happening Variable , On average, .” The more questions you have to ask , The greater your uncertainty about variables . Mutual information is how many questions about goals you expect the feature to answer .
3、 Explain mutual information scores
The minimum possible mutual information between quantities is 0.0. When MI Is zero , These quantities are independent : Neither can tell you anything about the other . contrary , Theoretically MI There is no upper limit . In practice , Although higher than 2.0 Values around are not common . ( Mutual information is a pair of numbers , So it increases very slowly .)
The following figure will show you MI How the value corresponds to the type and degree of association between the feature and the target .

Here are some things to remember when applying mutual information :
MI It can help you understand the relative potential of a feature as a target predictor , And consider it alone .
When interacting with other functions , A function may provide very rich information , But the amount of information alone may not be very large . MI Interaction between features cannot be detected . It is a univariate indicator .
The actual use of the function depends on the model you use . A feature is only useful if its relationship to the target is something your model can learn . Just because one feature has high MI Score does not mean that your model will be able to do anything with this information . You may need to transform features first to expose associations .
4、 Example - 1985 Cars in
The car data set consists of 1985 Year of 193 Vehicle composition . The goal of this data set is based on the 23 Features ( For example, brand 、 Body style and horsepower ) To predict the price of cars ( The goal is ). In this case , We will rank features with mutual information , And visualize the research results through data .
Automobile Dataset | KaggleDataset consist of various characteristic of an autohttps://www.kaggle.com/toramky/automobile-dataset The following code imports some libraries and loads the dataset .
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.style.use("seaborn-whitegrid")
df = pd.read_csv("../input/fe-course-data/autos.csv")
df.head()
MI Of scikit-learn The algorithm deals with discrete features differently from continuous features . Based on experience , Any must have float dtype Everything is not discrete . By giving them a tag code , You can classify ( Object or classification dtype) Considered discrete .
X = df.copy()
y = X.pop("price")
# Label encoding for categoricals
for colname in X.select_dtypes("object"):
X[colname], _ = X[colname].factorize()
# All discrete features should now have integer dtypes (double-check this before using MI!)
discrete_features = X.dtypes == int
Scikit-learn In its feature_selection There are two mutual information measures in the module : One for real value goals (mutual_info_regression), A target for classification (mutual_info_classif). Our goal , Price , Is of real value . The next unit calculates our characteristic MI fraction , And wrap them in a data frame .
from sklearn.feature_selection import mutual_info_regression
def make_mi_scores(X, y, discrete_features):
mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name="MI Scores", index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(X, y, discrete_features)
mi_scores[::3] # show a few features with their MI scores
curb_weight 1.486440
highway_mpg 0.950989
length 0.607955
bore 0.489772
stroke 0.380041
drive_wheels 0.332973
compression_ratio 0.134799
fuel_type 0.048139
Name: MI Scores, dtype: float64
Now it's a bar chart , It can make the comparison easier :
def plot_mi_scores(scores):
scores = scores.sort_values(ascending=True)
width = np.arange(len(scores))
ticks = list(scores.index)
plt.barh(width, scores)
plt.yticks(width, ticks)
plt.title("Mutual Information Scores")
plt.figure(dpi=100, figsize=(8, 5))
plot_mi_scores(mi_scores)
Data visualization is a good follow-up to utility ranking . Let's take a closer look at some of them .
as we had expected , High score curb_weight Characteristics have a strong relationship with the target price .
sns.relplot(x="curb_weight", y="price", data=df);
Fuel_type Features have a fairly low MI fraction , But we can see from the picture that , It clearly distinguishes two price groups with different trends in horsepower characteristics . This shows that fuel_type Contributed an interaction , And it may not be unimportant . In from MI Before the score determines that a characteristic is not important , It is best to investigate any possible interaction —— Knowledge in the professional field can provide a lot of guidance here .
sns.lmplot(x="horsepower", y="price", hue="fuel_type", data=df);
Data visualization is an important supplement to the feature engineering toolbox . In addition to practical indicators such as mutual information , This kind of visualization can help you discover important relationships in the data .
边栏推荐
- Interviewer: what is XSS attack?
- QT - double buffer plot
- 智洋创新与华为签署合作协议,共同推进昇腾AI产业持续发展
- HUAWEI nova 10系列发布 华为应用市场筑牢应用安全防火墙
- 如何使用ConcurrentLinkedQueue做一个缓存队列
- Keep on fighting! The city chain technology digital summit was grandly held in Chongqing
- 283. 移动零-c与语言辅助数组法
- Redis has three methods for checking big keys, which are necessary for optimization
- Caduceus从未停止创新,去中心化边缘渲染技术让元宇宙不再遥远
- Three or two things about the actual combat of OMS system
猜你喜欢
From repvgg to mobileone, including mobileone code
QT—双缓冲绘图
Case sharing | integrated construction of data operation and maintenance in the financial industry
Cloudcompare & open3d DBSCAN clustering (non plug-in)
应用实践 | 蜀海供应链基于 Apache Doris 的数据中台建设
[advanced C language] array & pointer & array written test questions
Interpreting the development of various intelligent organizations in maker Education
开源之夏专访|Apache IoTDB社区 新晋Committer谢其骏
[early knowledge of activities] list of recent activities of livevideostack
QT—绘制其他问题
随机推荐
类方法和类变量的使用
Super detailed tutorial, an introduction to istio Architecture Principle and practical application
什么是商业智能(BI),就看这篇文章足够了
历史最全混合专家(MOE)模型相关精选论文、系统、应用整理分享
Delphi soap WebService server-side multiple soapdatamodules implement the same interface method, interface inheritance
GTEST from ignorance to proficiency (3) what are test suite and test case
置信区间的画法
[advanced C language] array & pointer & array written test questions
【米哈游2023届秋招】开启【校招唯一专属内推码EYTUC】
输入的查询SQL语句,是如何执行的?
Enlightenment of maker thinking in Higher Education
Flink1.13 SQL basic syntax (I) DDL, DML
VS2019 C# release下断点调试
Analysis of maker education technology in the Internet Era
Use of class methods and class variables
GTEST from ignorance to proficiency (4) how to write unit tests with GTEST
Application practice | Shuhai supply chain construction of data center based on Apache Doris
Redis03 - network configuration and heartbeat mechanism of redis
VIM from dislike to dependence (23) -- the last gossip
应用实践 | 蜀海供应链基于 Apache Doris 的数据中台建设