当前位置:网站首页>Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
Data processing and visualization of machine learning [iris data classification | feature attribute comparison]
2022-06-11 23:03:00 【Hua Weiyun】
One , Preface
1.1 This paper is based on the principle
Most machine learning models deal with features , The feature is usually the numerical representation of the input variable that can be used for the model .
In most cases , The collected data needs to be processed before it can be used by the algorithm . Usually , There are many different characteristics in a dataset , Some of them may be redundant or irrelevant to the value we want to predict , It can be filtered through data processing and visualization .
The necessity of feature selection technology is also reflected in the simplified model 、 Reduce training time 、 Avoid dimension explosion and promote generalization to avoid over fitting .
1.2 Purpose
1. Familiar with data processing and visualization methods of machine learning
2. Use data processing and visualization methods to analyze data characteristics
1.3 Objectives and contents
1. install scikit-learn Machine learning and its related python package ;
2. Download the iris data set in the program ;
3. Use matplotlib Compare and draw the characteristics of iris data set ;
4. Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris ;
1.4 This paper is based on the environment
1.PC machine
2.Windows10
3.Scikit-learn Installation package
4.jupyter Editor or pycharm etc. python Editor 
Two , Experimental process
2.1 install scikit-learn Machine learning related modules
The installation process is a little bit , Direct installation scikit-learn modular , Domestic image installation can be adopted , It saves time .
Input
pip show scikit-learn Check whether the local environment is successfully installed 【scikit-learn】 This module .
2.2 Download the iris data set in the program
We use load_iris Data sets , In total, including 150 rows , The first four columns are calyx length , Calyx width , Petal length , Petal width 4 An attribute that identifies iris ,‘sepal_len’,‘sepal_wid’,‘petal_len’,‘petal_wid’.
The first 5 In the category of iris ( Include Setosa,Versicolour,Virginica Three types of )
The code is as follows
1.import matplotlib.pyplot as plt2.from sklearn.datasets import load_iris3.iris = load_iris()4.X = iris.data5.print(X.shape, X) We output X Take a look at this 150 Group data :
2.3 Use matplotlib Compare and draw the characteristics of iris data set
Because we will use figure Method , Let's define the size first , Give Way 16 Subgraphs can be output appropriately . The following code :
plt.figure(figsize=(44,44))We need output 16 Subtext , Set the variable to 4, Traverse twice .
feature_max_num=4Traverse twice , as follows :
for feature in range(feature_max_num): for feature_other in range(feature_max_num): You can imagine :
Namely 0-0,0-1,0-2,0-3,1-0,1-1……
Yes 16 Combinations of , It is also necessary to take the characteristic value .
We need to set the position of each subgraph , You can draw these subgraphs in turn , The advantage is simplicity , The disadvantage is that it is a little troublesome .
The following code :
plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True) We need to think about , If 0-0,1-1,2-2, This is a special case , Let's deal with it separately .
plt.scatter We need to understand the properties of : as follows
matplotlib.pyplot.scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None,
vmax=None, alpha=None, linewidths=None, verts=None, edgecolors=None, hold=None, data=None, **kwargs)
- x, y → The coordinates of the scatter point
- s → The area of the scatter
- c → Scattered color ( The default is blue ,‘b’, Other colors are the same as plt.plot( ))
- marker→ Scatter style ( The default value is filled circle ,‘o’, Other styles are the same as plt.plot( )) alpha → Scatter transparency ([0,1] Number between ,0 Indicates full transparency ,1 Is completely opaque )
- linewidths → Edge lineweight of scatter points edgecolors → The edge color of the scatter
if feature==feature_other: # A special case If ,feature==feature_other, If the traversal values are the same ,x, y → The coordinates of the scatter points are the same , It's not very intuitive , Let's go straight to x The coordinates of the scatter point set a self increasing variable , Let it come from 0 To 49 Self increasing .
plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa') ... ...In other cases :x, y → The coordinates of the scatter points are different , You can draw normally
else: plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa') ... ...Above code explanation :
X[0:50,feature],X[0:50,feature_other]
Represent the x, y → The coordinates of the scatter point , Because we have 150 Group target data , We get the target data set from the data set according to different characteristic values . Perform drawing processing .
Need to understand grammar :
a[:,1] The meaning of , You can understand .
Now we need to set X Axis and Y The label of the shaft . The grammar is as follows :
xlabel(xlabel, fontdict=None, labelpad=None, *, loc=None, **kwargs)
- xlabel: Type is string , The text of the label .
- fontdict: dict, A dictionary is used to control the font style of labels
- labelpad: The type is floating point number , The default value is None, That is, the distance between the label and the coordinate axis .
- loc: The value range is {‘left’, ‘center’, ‘right’}, The default value is rcParams[“xaxis.labellocation”](‘center’), The location of the label .
- **kwargs:Text Object key attribute , Used to control the appearance properties of text , Like typeface 、 Text color, etc .
plt.xlabel(iris.feature_names[feature]) plt.vlabel(iris.feature_names[feature_other])Finally, set the legend position , Output image .
plt.legend(loc='best')plt.show() The renderings are as follows :
2.4 Analyze the characteristics of the drawn iris visual map to clearly distinguish the categories of iris
According to the figure 0-2 ,1-3 distinct .
The length of sepals and petals can be seen , The characteristics of sepal width and petal width can clearly distinguish Iris species .
3、 ... and , The source code involved in this article is attached
The source code involved in this paper is as follows , It can run directly :
import matplotlib.pyplot as pltfrom sklearn.datasets import load_irisiris = load_iris()X = iris.dataprint(X.shape, X)plt.figure(figsize=(44,44))feature_max_num=4for feature in range(feature_max_num): for feature_other in range(feature_max_num): plt.subplot(feature_max_num,feature_max_num,feature*feature_max_num+feature_other +1,frame_on= True) if feature==feature_other: plt.scatter([i for i in range(50)],X[0:50,feature],color='green',marker='o',label='setosa') plt.scatter([i for i in range(50)],X[50:100,feature],color='blue',marker='x',label='versicolor') plt.scatter([i for i in range(50)],X[100:,feature],color='red',marker='+',label='Virginica') else: plt.scatter(X[0:50,feature],X[0:50,feature_other],color='green',marker='o',label='setosa') plt.scatter(X[50:100,feature],X[50:100,feature_other],color='blue',marker='x',label='versicolor') plt.scatter(X[100:,feature],X[100:,feature_other],color="red",marker='+',label='Virginica') plt.xlabel(iris.feature_names[feature]) plt.vlabel(iris.feature_names[feature_other]) plt.legend(loc='best')plt.show()边栏推荐
- Postgresql10 process
- The key to the safe was inserted into the door, and the college students stole the mobile phone numbers of 1.1 billion users of Taobao alone
- 2022年低压电工上岗证题目及在线模拟考试
- Pourquoi Google Search ne peut - il pas Pager indéfiniment?
- The top ten trends of 2022 industrial Internet security was officially released
- NLP - fastText
- Alibaba cloud server MySQL remote connection has been disconnected
- Exercise 8-2 finding a specified element in an array (15 points)
- Three years of college should be like this
- [day15 literature extensive reading] numerical magnetic effects temporary memories but not time encoding
猜你喜欢

Pourquoi Google Search ne peut - il pas Pager indéfiniment?

【Day9 文献泛读】On the (a)symmetry between the perception of time and space in large-scale environments

PHP+MYSQL图书管理系统(课设)

2022新兴市场品牌出海线上峰会即将举办 ADVANCE.AI CEO寿栋将受邀出席

Is it too troublesome to turn pages manually when you encounter a form? I'll teach you to write a script that shows all the data on one page

【Day3 文献精读】Asymmetrical time and space interference in Tau and Kappa effects

Wireless communication comparison of si4463, si4438 and Si4432 schemes of wireless data transmission module

The second bullet of in-depth dialogue with the container service ack distribution: how to build a hybrid cloud unified network plane with the help of hybridnet
![[day11-12 intensive literature reading] on languages in memory: an internal clock account of space-time interaction](/img/85/4486bd46b5f32331ce398e42e5d803.png)
[day11-12 intensive literature reading] on languages in memory: an internal clock account of space-time interaction

5. Xuecheng project Alipay payment
随机推荐
Try catch
Matlab point cloud processing (XXIV): point cloud median filtering (pcmedian)
栈(C语言)
【Day1/5 文献精读】Speed Constancy or Only Slowness: What Drives the Kappa Effect
Swiper -- a solution to the conflict of single page multicast plug-ins
习题11-3 计算最长的字符串长度 (15 分)
Wireless communication comparison of si4463, si4438 and Si4432 schemes of wireless data transmission module
Games-101 闫令琪 5-6讲 光栅化处理 (笔记整理)
The remote connection to redis is disconnected and reconnected after a while
C# List. Can foreach temporarily / at any time terminate a loop?
Tensorflow [actual Google deep learning framework] uses HDF5 to process large data sets with tflearn
Here we go! Dragon lizard community enters PKU classroom
Exercise 8-5 using functions to realize partial copying of strings (20 points)
2022年起重机司机(限桥式起重机)考试题模拟考试题库及模拟考试
Recruitment of audio and video quality test and Development Engineer
Why can't Google search page infinite?
Exercise 8-8 judging palindrome string (20 points)
postgresql10 進程
Number of classified statistical characters (15 points)
Gcache of goframe memory cache