当前位置:网站首页>Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
2022-06-12 10:07:00 【Shangjin vegetable pig】
List of articles
One , Preface
1.1 This paper is based on the principle
Most machine learning models deal with features , The feature is usually the numerical representation of the input variable that can be used for the model .
In most cases , The collected data needs to be processed before it can be used by the algorithm . Usually , There are many different characteristics in a dataset , Some of them may be redundant or irrelevant to the value we want to predict , It can be filtered through data processing and visualization .
The necessity of feature selection technology is also reflected in the simplified model 、 Reduce training time 、 Avoid dimension explosion and promote generalization to avoid over fitting .
1.2 Purpose
1. Familiar with data processing and visualization methods of machine learning
2. Use data processing and visualization methods to analyze data characteristics
1.3 Objectives and contents
1. install scikit-learn Machine learning and its related python package ;
2. Download the iris data set in the program ;
3. Use matplotlib Compare and draw the characteristics of iris data set ;
4. Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris ;
1.4 This paper is based on the environment
1.PC machine
2.Windows10
3.Scikit-learn Installation package
4.jupyter Editor or pycharm etc. python Editor 
Two , Experimental process
2.1 install scikit-learn Machine learning related modules
The installation process is a little bit , Direct installation scikit-learn modular , Domestic image installation can be adopted , It saves time .
Input
pip show scikit-learn
Check whether the local environment is successfully installed 【scikit-learn】 This module .
2.2 Download the iris data set in the program
We use load_iris Data sets , In total, including 150 rows , The first four columns are calyx length , Calyx width , Petal length , Petal width 4 An attribute that identifies iris ,‘sepal_len’,‘sepal_wid’,‘petal_len’,‘petal_wid’.
The first 5 In the category of iris ( Include Setosa,Versicolour,Virginica Three types of )
The code is as follows
1.import matplotlib.pyplot as plt
2.from sklearn.datasets import load_iris
3.iris = load_iris()
4.X = iris.data
5.print(X.shape, X)
We output X Take a look at this 150 Group data :
2.3 Use matplotlib Compare and draw the characteristics of iris data set
Because we will use figure Method , Let's define the size first , Give Way 16 Subgraphs can be output appropriately . The following code :
plt.figure(figsize=(44,44))
We need output 16 Subtext , Set the variable to 4, Traverse twice .
feature_max_num=4
Traverse twice , as follows :
for feature in range(feature_max_num):
for feature_other in range(feature_max_num):
You can imagine :
Namely 0-0,0-1,0-2,0-3,1-0,1-1……
Yes 16 Combinations of , It is also necessary to take the characteristic value .
We need to set the position of each subgraph , You can draw these subgraphs in turn , The advantage is simplicity , The disadvantage is that it is a little troublesome .
The following code :
plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)
We need to think about , If 0-0,1-1,2-2, This is a special case , Let's deal with it separately .
plt.scatter We need to understand the properties of : as follows
matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None,
vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs)
- x, y → The coordinates of the scatter point
- s → The area of the scatter
- c → Scattered color ( The default is blue ,‘b’, Other colors are the same as plt.plot( ))
- marker→ Scatter style ( The default value is filled circle ,‘o’, Other styles are the same as plt.plot( )) alpha → Scatter transparency ([0,1] Number between ,0 Indicates full transparency ,1 Is completely opaque )
- linewidths → Edge lineweight of scatter points edgecolors → The edge color of the scatter
if feature==feature_other: # A special case
If ,feature==feature_other, If the traversal values are the same ,x, y → The coordinates of the scatter points are the same , It's not very intuitive , Let's go straight to x The coordinates of the scatter point set a self increasing variable , Let it come from 0 To 49 Self increasing .
plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa')
...
...
In other cases :x, y → The coordinates of the scatter points are different , You can draw normally
else:
plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')
...
...
Above code explanation :
X[0:50,feature],X[0:50,feature_other]
Represent the x, y → The coordinates of the scatter point , Because we have 150 Group target data , We get the target data set from the data set according to different characteristic values . Perform drawing processing .
Need to understand grammar :
a[:,1] The meaning of , You can understand .
Now we need to set X Axis and Y The label of the shaft . The grammar is as follows :
xlabel(xlabel, fontdict=None, labelpad=None, *, loc=None, **kwargs)
- xlabel: Type is string , The text of the label .
- fontdict: dict, A dictionary is used to control the font style of labels
- labelpad: The type is floating point number , The default value is None, That is, the distance between the label and the coordinate axis .
- loc: The value range is {‘left’, ‘center’, ‘right’}, The default value is rcParams[“xaxis.labellocation”](‘center’), The location of the label .
- **kwargs:Text Object key attribute , Used to control the appearance properties of text , Like typeface 、 Text color, etc .
plt.xlabel(iris.feature_names[feature])
plt.vlabel(iris.feature_names[feature_other])
Finally, set the legend position , Output image .
plt.legend(loc='best')
plt.show()
The renderings are as follows :
2.4 Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris
According to the figure 0-2 ,1-3 distinct .
The length of sepals and petals can be seen , The characteristics of sepal width and petal width can clearly distinguish Iris species .
3、 ... and , The source code involved in this article is attached
The source code involved in this paper is as follows , It can run directly :
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
print(X.shape, X)
plt.figure(figsize=(44,44))
feature_max_num=4
for feature in range(feature_max_num):
for feature_other in range(feature_max_num):
plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True)
if feature==feature_other:
plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa')
plt.scatter([i for i in range(50)],X[50:100,feature],color='blue',marker='x',label='versicolor')
plt.scatter([i for i in range(50)],X[100:,feature],color='red',marker='+',label='Virginica')
else:
plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa')
plt.scatter(X[50:100,feature],X[50:100,feature_other],color='blue',marker='x',label='versicolor')
plt.scatter(X[100:,feature],X[100:,feature_other],color="red",marker='+',label='Virginica')
plt.xlabel(iris.feature_names[feature])
plt.vlabel(iris.feature_names[feature_other])
plt.legend(loc='best')
plt.show()
边栏推荐
- How to do industry analysis
- Explication du principe d'appariement le plus à gauche de MySQL
- There is always a negative line (upper shadow line) that will stop the advance of many armies, and there is always a positive line (lower shadow line) that will stop the rampant bombing of the air for
- C#入门系列(十二) -- 字符串
- 日本经济泡沫与房价泡沫
- 【926. 将字符串翻转到单调递增】
- Checkpoint of the four cornerstones of Flink
- Web3.0与数字时尚,该如何落地?
- 在线电路仿真以及开源电子硬件设计介绍
- CLAHE in opencv for 16 bit image enhancement display
猜你喜欢

SAP Hana error message sys_ XSA authentication failed SQLSTATE - 28000

SAP HANA 错误消息 SYS_XSA authentication failed SQLSTATE - 28000

哈希表的理论讲解

List of computer startup shortcut keys
![[cloud native] what exactly does it mean? This article shares the answer with you](/img/82/f268adcbdbe8195a066d065eb560d7.jpg)
[cloud native] what exactly does it mean? This article shares the answer with you

Combat tactics based on CEPH object storage

004:aws data Lake solution

Explication du principe d'appariement le plus à gauche de MySQL

SAP Hana error message sys_ XSA authentication failed SQLSTATE - 28000

【云原生 | Kubernetes篇】Kubernetes 网络策略(NetworkPolicy)
随机推荐
June training (day 12) - linked list
Essentials reading notes
003:what does AWS think is a data lake?
Clickhouse column basic data type description
004:AWS数据湖解决方案
MYSQL的最左匹配原则的原理讲解
Auto. JS debugging: use the network mode of lightning simulator for debugging
Quickly build oncyber io
六月集训(第12天) —— 链表
Antique mfc/gdi+ Frame LCD display control
002: what are the characteristics of the data lake
奇葩错误 -- 轮廓检测检测到边框、膨胀腐蚀开闭运算效果颠倒
[path of system analyst] Chapter 18 security analysis and design of double disk system
Strange error -- frame detected by contour detection, expansion corrosion, and reversal of opening and closing operation effect
7-5 zhe zhe playing games
Li Yang, a scientific and technological innovator and CIO of the world's top 500 group: the success of digital transformation depends on people. Decision makers should always focus on "firewood"
JVM (VI) Virtual machine bytecode execution engine (with stack execution process and bytecode instruction table)
Checkpoint of the four cornerstones of Flink
【系统分析师之路】第十八章 复盘系统安全分析与设计
CEPH performance optimization and enhancement