当前位置：网站首页>Data processing and visualization of machine learning [iris data classification | feature attribute comparison]

Data processing and visualization of machine learning [iris data classification | feature attribute comparison]

2022-06-11 23:03:00 【Hua Weiyun】

One , Preface

1.1 This paper is based on the principle

Most machine learning models deal with features , The feature is usually the numerical representation of the input variable that can be used for the model .
In most cases , The collected data needs to be processed before it can be used by the algorithm . Usually , There are many different characteristics in a dataset , Some of them may be redundant or irrelevant to the value we want to predict , It can be filtered through data processing and visualization .
The necessity of feature selection technology is also reflected in the simplified model 、 Reduce training time 、 Avoid dimension explosion and promote generalization to avoid over fitting .

1.2 Purpose

1. Familiar with data processing and visualization methods of machine learning
2. Use data processing and visualization methods to analyze data characteristics

1.3 Objectives and contents

1. install scikit-learn Machine learning and its related python package ;
2. Download the iris data set in the program ;
3. Use matplotlib Compare and draw the characteristics of iris data set ;
4. Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris ;

1.4 This paper is based on the environment

1.PC machine
2.Windows10
3.Scikit-learn Installation package
4.jupyter Editor or pycharm etc. python Editor

Two , Experimental process

2.1 install scikit-learn Machine learning related modules

The installation process is a little bit , Direct installation scikit-learn modular , Domestic image installation can be adopted , It saves time .
Input

pip show scikit-learn

Check whether the local environment is successfully installed 【scikit-learn】 This module .

2.2 Download the iris data set in the program

We use load_iris Data sets , In total, including 150 rows , The first four columns are calyx length , Calyx width , Petal length , Petal width 4 An attribute that identifies iris ,‘sepal_len’,‘sepal_wid’,‘petal_len’,‘petal_wid’.
The first 5 In the category of iris （ Include Setosa,Versicolour,Virginica Three types of ）
The code is as follows

1.import matplotlib.pyplot as plt2.from sklearn.datasets import load_iris3.iris = load_iris()4.X = iris.data5.print(X.shape, X)

We output X Take a look at this 150 Group data ：

2.3 Use matplotlib Compare and draw the characteristics of iris data set

Because we will use figure Method , Let's define the size first , Give Way 16 Subgraphs can be output appropriately . The following code ：

plt.figure(figsize=(44,44))

We need output 16 Subtext , Set the variable to 4, Traverse twice .

feature_max_num=4

Traverse twice , as follows ：

for feature in range(feature_max_num):    for feature_other in range(feature_max_num):

You can imagine ：
Namely 0-0,0-1,0-2,0-3,1-0,1-1……
Yes 16 Combinations of , It is also necessary to take the characteristic value .

We need to set the position of each subgraph , You can draw these subgraphs in turn , The advantage is simplicity , The disadvantage is that it is a little troublesome .
The following code ：

plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)

We need to think about , If 0-0,1-1,2-2, This is a special case , Let's deal with it separately .
plt.scatter We need to understand the properties of ： as follows

matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None,
vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs)

x, y → The coordinates of the scatter point
s → The area of the scatter
c → Scattered color （ The default is blue ,‘b’, Other colors are the same as plt.plot( )）
marker→ Scatter style （ The default value is filled circle ,‘o’, Other styles are the same as plt.plot( )） alpha → Scatter transparency （[0,1] Number between ,0 Indicates full transparency ,1 Is completely opaque ）
linewidths → Edge lineweight of scatter points edgecolors → The edge color of the scatter

if feature==feature_other:  # A special case

If ,feature==feature_other, If the traversal values are the same ,x, y → The coordinates of the scatter points are the same , It's not very intuitive , Let's go straight to x The coordinates of the scatter point set a self increasing variable , Let it come from 0 To 49 Self increasing .

 plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa') ... ...

In other cases ：x, y → The coordinates of the scatter points are different , You can draw normally

else:	plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')	...	...

Above code explanation ：

X[0:50,feature],X[0:50,feature_other]

Represent the x, y → The coordinates of the scatter point , Because we have 150 Group target data , We get the target data set from the data set according to different characteristic values . Perform drawing processing .
Need to understand grammar ：
a[:,1] The meaning of , You can understand .

Now we need to set X Axis and Y The label of the shaft . The grammar is as follows ：

xlabel(xlabel, fontdict=None, labelpad=None, *, loc=None, **kwargs)

xlabel： Type is string , The text of the label .
fontdict： dict, A dictionary is used to control the font style of labels
labelpad： The type is floating point number , The default value is None, That is, the distance between the label and the coordinate axis .
loc： The value range is {‘left’, ‘center’, ‘right’}, The default value is rcParams[“xaxis.labellocation”]（‘center’）, The location of the label .
**kwargs：Text Object key attribute , Used to control the appearance properties of text , Like typeface 、 Text color, etc .

 plt.xlabel(iris.feature_names[feature]) plt.vlabel(iris.feature_names[feature_other])

Finally, set the legend position , Output image .

		plt.legend(loc='best')plt.show()

The renderings are as follows :

2.4 Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris

According to the figure 0-2 ,1-3 distinct .
The length of sepals and petals can be seen , The characteristics of sepal width and petal width can clearly distinguish Iris species .

3、 ... and , The source code involved in this article is attached

The source code involved in this paper is as follows , It can run directly ：

import matplotlib.pyplot as pltfrom sklearn.datasets import load_irisiris = load_iris()X = iris.dataprint(X.shape, X)plt.figure(figsize=(44,44))feature_max_num=4for feature in range(feature_max_num):    for feature_other in range(feature_max_num):        plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)        if feature==feature_other:            plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa')            plt.scatter([i for i in range(50)],X[50:100,feature],color='blue',marker='x',label='versicolor')            plt.scatter([i for i in range(50)],X[100:,feature],color='red',marker='+',label='Virginica')        else:            plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')            plt.scatter(X[50:100,feature],X[50:100,feature_other],color='blue',marker='x',label='versicolor')            plt.scatter(X[100:,feature],X[100:,feature_other],color="red",marker='+',label='Virginica')        plt.xlabel(iris.feature_names[feature])        plt.vlabel(iris.feature_names[feature_other])        plt.legend(loc='best')plt.show()