当前位置:网站首页>Fundamentals of machine learning (II) -- division of training set and test set
Fundamentals of machine learning (II) -- division of training set and test set
2022-07-02 13:16:00 【Bayesian grandson】
List of articles
1. Division of test set and training set
from sklearn.datasets import load_iris, fetch_20newsgroups, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
li = load_iris()
# li The attributes of are ‘data’,‘target’
print(" Get eigenvalues :")
print(li.data[0:5])
print(" The target :")
print(li.target[0:5])
Get eigenvalues :
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
The target :
[0 0 0 0 0]
Be careful Return value : Training set train x_train, y_train Test set test x_test, y_test
1.1 Divide the training set and the test set
# The invocation format is :
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)
# Large amount of data , Before printing 5 One sample is enough .
print(" Training set eigenvalues and target values :", x_train[:5], y_train[:5])
print(" Test set eigenvalues and target values :", x_test[:5], y_test[:5])
Training set eigenvalues and target values : [[6.3 3.3 6. 2.5]
[6. 3. 4.8 1.8]
[7. 3.2 4.7 1.4]
[4.9 2.5 4.5 1.7]
[6.7 3.1 4.4 1.4]] [2 2 1 2 1]
Test set eigenvalues and target values : [[7.7 3. 6.1 2.3]
[4.9 2.4 3.3 1. ]
[7.7 2.8 6.7 2. ]
[7.9 3.8 6.4 2. ]
[5.4 3.9 1.3 0.4]] [2 1 2 2 0]
print(li.DESCR) # Some descriptive information of the data set
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
data[:5]
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
1.5300e+01, 3.9690e+02, 4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9690e+02, 9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9283e+02, 4.0300e+00],
[3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9463e+02, 2.9400e+00],
[6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9690e+02, 5.3300e+00]])
target[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
The above shows two different types of data sets , A kind of target It is discrete ( Category ), One is continuity type ( Price ).
2. fit and transform
fit( ): Method calculates the parameters μ and σ and saves them as internal objects.
It can be understood as before the data set is converted , For some basic attributes of data, such as : mean value , variance , Maximum , The minimum value is similar pd.info() Overview of .
transform( ): Method using these calculated parameters apply the transformation to a particular dataset.
Calling transform Before , You need to do a... On the data fit Preprocessing , Then it can be standardized , Dimension reduction , Normalization and other operations .( Such as PCA,StandardScaler etc. ).
fit_transform(): joins the fit() and transform() method for transformation of dataset.
Want to be so fit and transform The combination of , It includes preprocessing and data conversion .
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
sample_1 = [[2,4,2,3],[6,4,3,2],[8,4,5,6]]
s = StandardScaler()
s.fit_transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
ss = StandardScaler()
ss.fit(sample_1)
StandardScaler()
ss.transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
边栏推荐
- Ntmfs4c05nt1g N-ch 30V 11.9a MOS tube, pdf
- OpenAPI generator: simplify the restful API development process
- West digital decided to raise the price of flash memory products immediately after the factory was polluted by materials
- 中文姓名提取(玩具代码——准头太小,权当玩闹)
- Domestic free data warehouse ETL dispatching automation operation and maintenance expert taskctl
- Unity SKFramework框架(十五)、Singleton 单例
- Jerry's watch reads the alarm clock [chapter]
- Fundamentals of face recognition (facenet)
- (7) Web security | penetration testing | how does network security determine whether CND exists, and how to bypass CND to find the real IP
- 文件的下载与图片的预览
猜你喜欢

Unity skframework framework (XVIII), roamcameracontroller roaming perspective camera control script

国产免费数据仓库ETL调度自动化运维专家—TASKCTL

Unity SKFramework框架(十六)、Package Manager 开发工具包管理器
![[opencv learning] [image filtering]](/img/4c/fe22e9cdf531873a04a7c4e266228d.jpg)
[opencv learning] [image filtering]

3 a VTT terminal regulator ncp51200mntxg data

Js1day (syntaxe d'entrée / sortie, type de données, conversion de type de données, Var et let différenciés)

Unity SKFramework框架(十八)、RoamCameraController 漫游视角相机控制脚本

How to modify the error of easydss on demand service sharing time?

Unity skframework framework (XIII), question module

记忆函数的性能优化
随机推荐
Analog to digital converter (ADC) ade7913ariz is specially designed for three-phase energy metering applications
VIM super practical guide collection of this one is enough
Unity SKFramework框架(十九)、POI 兴趣点/信息点
Jerry's watch time synchronization [chapter]
linux下清理系统缓存并释放内存
国内首款、完全自主、基于云架构的三维CAD平台——CrownCAD(皇冠CAD)
TVOC, VOC, VOCs gas detection + Solution
三翼鸟两周年:羽翼渐丰,腾飞指日可待
Counter attack of flour dregs: MySQL 66 questions, 20000 words + 50 pictures in detail! A little six
OpenAPI generator: simplify the restful API development process
难忘阿里,4面技术5面HR附加笔试面,走的真艰难真心酸
8A Synchronous Step-Down regulator tps568230rjer_ Specification information
(6) Web security | penetration test | network security encryption and decryption ciphertext related features, with super encryption and decryption software
无向图的桥
Unity skframework framework (XVIII), roamcameracontroller roaming perspective camera control script
ADB basic commands
Obtain file copyright information
Tencent three sides: in the process of writing files, the process crashes, and will the file data be lost?
Word efficiency guide - word's own template
Unity SKFramework框架(十三)、Question 问题模块