当前位置:网站首页>Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
2022-06-12 10:07:00 【Shangjin vegetable pig】
List of articles
One , Preface
1.1 This paper is based on the principle
Most machine learning models deal with features , The feature is usually the numerical representation of the input variable that can be used for the model .
In most cases , The collected data needs to be processed before it can be used by the algorithm . Usually , There are many different characteristics in a dataset , Some of them may be redundant or irrelevant to the value we want to predict , It can be filtered through data processing and visualization .
The necessity of feature selection technology is also reflected in the simplified model 、 Reduce training time 、 Avoid dimension explosion and promote generalization to avoid over fitting .
1.2 Purpose
1. Familiar with data processing and visualization methods of machine learning
2. Use data processing and visualization methods to analyze data characteristics
1.3 Objectives and contents
1. install scikit-learn Machine learning and its related python package ;
2. Download the iris data set in the program ;
3. Use matplotlib Compare and draw the characteristics of iris data set ;
4. Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris ;
1.4 This paper is based on the environment
1.PC machine
2.Windows10
3.Scikit-learn Installation package
4.jupyter Editor or pycharm etc. python Editor 
Two , Experimental process
2.1 install scikit-learn Machine learning related modules
The installation process is a little bit , Direct installation scikit-learn modular , Domestic image installation can be adopted , It saves time .
Input
pip show scikit-learn
Check whether the local environment is successfully installed 【scikit-learn】 This module .
2.2 Download the iris data set in the program
We use load_iris Data sets , In total, including 150 rows , The first four columns are calyx length , Calyx width , Petal length , Petal width 4 An attribute that identifies iris ,‘sepal_len’,‘sepal_wid’,‘petal_len’,‘petal_wid’.
The first 5 In the category of iris ( Include Setosa,Versicolour,Virginica Three types of )
The code is as follows
1.import matplotlib.pyplot as plt
2.from sklearn.datasets import load_iris
3.iris = load_iris()
4.X = iris.data
5.print(X.shape, X)
We output X Take a look at this 150 Group data :
2.3 Use matplotlib Compare and draw the characteristics of iris data set
Because we will use figure Method , Let's define the size first , Give Way 16 Subgraphs can be output appropriately . The following code :
plt.figure(figsize=(44,44))
We need output 16 Subtext , Set the variable to 4, Traverse twice .
feature_max_num=4
Traverse twice , as follows :
for feature in range(feature_max_num):
for feature_other in range(feature_max_num):
You can imagine :
Namely 0-0,0-1,0-2,0-3,1-0,1-1……
Yes 16 Combinations of , It is also necessary to take the characteristic value .
We need to set the position of each subgraph , You can draw these subgraphs in turn , The advantage is simplicity , The disadvantage is that it is a little troublesome .
The following code :
plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)
We need to think about , If 0-0,1-1,2-2, This is a special case , Let's deal with it separately .
plt.scatter We need to understand the properties of : as follows
matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None,
vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs)
- x, y → The coordinates of the scatter point
- s → The area of the scatter
- c → Scattered color ( The default is blue ,‘b’, Other colors are the same as plt.plot( ))
- marker→ Scatter style ( The default value is filled circle ,‘o’, Other styles are the same as plt.plot( )) alpha → Scatter transparency ([0,1] Number between ,0 Indicates full transparency ,1 Is completely opaque )
- linewidths → Edge lineweight of scatter points edgecolors → The edge color of the scatter
if feature==feature_other: # A special case
If ,feature==feature_other, If the traversal values are the same ,x, y → The coordinates of the scatter points are the same , It's not very intuitive , Let's go straight to x The coordinates of the scatter point set a self increasing variable , Let it come from 0 To 49 Self increasing .
plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa')
...
...
In other cases :x, y → The coordinates of the scatter points are different , You can draw normally
else:
plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')
...
...
Above code explanation :
X[0:50,feature],X[0:50,feature_other]
Represent the x, y → The coordinates of the scatter point , Because we have 150 Group target data , We get the target data set from the data set according to different characteristic values . Perform drawing processing .
Need to understand grammar :
a[:,1] The meaning of , You can understand .
Now we need to set X Axis and Y The label of the shaft . The grammar is as follows :
xlabel(xlabel, fontdict=None, labelpad=None, *, loc=None, **kwargs)
- xlabel: Type is string , The text of the label .
- fontdict: dict, A dictionary is used to control the font style of labels
- labelpad: The type is floating point number , The default value is None, That is, the distance between the label and the coordinate axis .
- loc: The value range is {‘left’, ‘center’, ‘right’}, The default value is rcParams[“xaxis.labellocation”](‘center’), The location of the label .
- **kwargs:Text Object key attribute , Used to control the appearance properties of text , Like typeface 、 Text color, etc .
plt.xlabel(iris.feature_names[feature])
plt.vlabel(iris.feature_names[feature_other])
Finally, set the legend position , Output image .
plt.legend(loc='best')
plt.show()
The renderings are as follows :
2.4 Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris
According to the figure 0-2 ,1-3 distinct .
The length of sepals and petals can be seen , The characteristics of sepal width and petal width can clearly distinguish Iris species .
3、 ... and , The source code involved in this article is attached
The source code involved in this paper is as follows , It can run directly :
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
print(X.shape, X)
plt.figure(figsize=(44,44))
feature_max_num=4
for feature in range(feature_max_num):
for feature_other in range(feature_max_num):
plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)
if feature==feature_other:
plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa')
plt.scatter([i for i in range(50)],X[50:100,feature],color='blue',marker='x',label='versicolor')
plt.scatter([i for i in range(50)],X[100:,feature],color='red',marker='+',label='Virginica')
else:
plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')
plt.scatter(X[50:100,feature],X[50:100,feature_other],color='blue',marker='x',label='versicolor')
plt.scatter(X[100:,feature],X[100:,feature_other],color="red",marker='+',label='Virginica')
plt.xlabel(iris.feature_names[feature])
plt.vlabel(iris.feature_names[feature_other])
plt.legend(loc='best')
plt.show()
边栏推荐
- 2021-02-22
- MySQL v Index and algorithm
- SAP HANA 错误消息 SYS_XSA authentication failed SQLSTATE - 28000
- CLAHE in opencv for 16 bit image enhancement display
- The Dragon Boat Festival is in good health -- people are becoming more and more important in my heart
- 1268_FreeRTOS任务上下文切换的实现
- JVM (V) Virtual machine class loading (parental delegation mechanism)
- Li Yang, a scientific and technological innovator and CIO of the world's top 500 group: the success of digital transformation depends on people. Decision makers should always focus on "firewood"
- Auto. JS learning note 9: basic methods such as using the script engine, starting the script file with the specified path, and closing
- Basic use of scratch
猜你喜欢

SAP HANA 错误消息 SYS_XSA authentication failed SQLSTATE - 28000

MySQL v Index and algorithm

优质好书助成长 猿辅导携四大出版社推荐“暑期好书”

MySQL index FAQs

奇葩错误 -- 轮廓检测检测到边框、膨胀腐蚀开闭运算效果颠倒

Example interview -- dongyuhang: harvest love in the club

7-13 地下迷宫探索(邻接表)

MySQL 4 Database table storage structure & tablespace

基于 Ceph 对象存储的实战兵法

001:数据湖是什么?
随机推荐
MySQL VI Database lock
2021-09-15
5 most common CEPH failure scenarios
《保护我们的数字遗产:DNA数据存储》白皮书发布
IoT简介
在线电路仿真以及开源电子硬件设计介绍
002: what are the characteristics of the data lake
First NFT platform in dfinity Ecology: impossible thoughts
HALCON联合C#检测表面缺陷——仿射变换(三)
Create simple windowing programs using Visual Studio 2017
7-4 network red dot punch in strategy (DFS)
Introduction to on-line circuit simulation and open source electronic hardware design
JVM (V) Virtual machine class loading (parental delegation mechanism)
Differences among list, set and map
行业分析怎么做
基于 Ceph 对象存储的实战兵法
2021-02-22
【系统分析师之路】第十八章 复盘系统安全分析与设计
Dazzle the "library" action - award solicitation from the golden warehouse of the National People's Congress - high availability failover and recovery of kingbasees cluster
7-5 zhe zhe playing games