当前位置:网站首页>Fundamentals of machine learning (II) -- division of training set and test set
Fundamentals of machine learning (II) -- division of training set and test set
2022-07-02 13:16:00 【Bayesian grandson】
List of articles
1. Division of test set and training set
from sklearn.datasets import load_iris, fetch_20newsgroups, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
li = load_iris()
# li The attributes of are ‘data’,‘target’
print(" Get eigenvalues :")
print(li.data[0:5])
print(" The target :")
print(li.target[0:5])
Get eigenvalues :
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
The target :
[0 0 0 0 0]
Be careful Return value : Training set train x_train, y_train Test set test x_test, y_test
1.1 Divide the training set and the test set
# The invocation format is :
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)
# Large amount of data , Before printing 5 One sample is enough .
print(" Training set eigenvalues and target values :", x_train[:5], y_train[:5])
print(" Test set eigenvalues and target values :", x_test[:5], y_test[:5])
Training set eigenvalues and target values : [[6.3 3.3 6. 2.5]
[6. 3. 4.8 1.8]
[7. 3.2 4.7 1.4]
[4.9 2.5 4.5 1.7]
[6.7 3.1 4.4 1.4]] [2 2 1 2 1]
Test set eigenvalues and target values : [[7.7 3. 6.1 2.3]
[4.9 2.4 3.3 1. ]
[7.7 2.8 6.7 2. ]
[7.9 3.8 6.4 2. ]
[5.4 3.9 1.3 0.4]] [2 1 2 2 0]
print(li.DESCR) # Some descriptive information of the data set
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
data[:5]
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
1.5300e+01, 3.9690e+02, 4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9690e+02, 9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9283e+02, 4.0300e+00],
[3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9463e+02, 2.9400e+00],
[6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9690e+02, 5.3300e+00]])
target[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
The above shows two different types of data sets , A kind of target It is discrete ( Category ), One is continuity type ( Price ).
2. fit and transform
fit( ): Method calculates the parameters μ and σ and saves them as internal objects.
It can be understood as before the data set is converted , For some basic attributes of data, such as : mean value , variance , Maximum , The minimum value is similar pd.info() Overview of .
transform( ): Method using these calculated parameters apply the transformation to a particular dataset.
Calling transform Before , You need to do a... On the data fit Preprocessing , Then it can be standardized , Dimension reduction , Normalization and other operations .( Such as PCA,StandardScaler etc. ).
fit_transform(): joins the fit() and transform() method for transformation of dataset.
Want to be so fit and transform The combination of , It includes preprocessing and data conversion .
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
sample_1 = [[2,4,2,3],[6,4,3,2],[8,4,5,6]]
s = StandardScaler()
s.fit_transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
ss = StandardScaler()
ss.fit(sample_1)
StandardScaler()
ss.transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
边栏推荐
- Unity SKFramework框架(二十)、VFX Lab 特效库
- Research shows that "congenial" is more likely to become friends
- [error record] cannot open "XXX" because Apple cannot check whether it contains malware
- [opencv learning] [common image convolution kernel]
- Js1day (syntaxe d'entrée / sortie, type de données, conversion de type de données, Var et let différenciés)
- 记忆函数的性能优化
- Finally, someone explained the supervised learning clearly
- To bypass obregistercallbacks, you need to drive the signature method
- Redis数据库持久化
- Js3day (array operation, JS bubble sort, function, debug window, scope and scope chain, anonymous function, object, Math object)
猜你喜欢
![Jerry's watch delete alarm clock [chapter]](/img/7f/d51b37872b4ce905a0a723a514b2dc.jpg)
Jerry's watch delete alarm clock [chapter]
![[opencv learning] [template matching]](/img/4c/7214329a34974c59b4931c08046ee8.jpg)
[opencv learning] [template matching]

Japan bet on national luck: Web3.0, anyway, is not the first time to fail!

Fully autonomous and controllable 3D cloud CAD: crowncad's convenient command search can quickly locate the specific location of the required command.

PR usage skills, how to use PR to watermark?

【蓝桥杯选拔赛真题43】Scratch航天飞行 少儿编程scratch蓝桥杯选拔赛真题讲解

Js3day (array operation, JS bubble sort, function, debug window, scope and scope chain, anonymous function, object, Math object)

记忆函数的性能优化

SAP MM 因物料有负库存导致MMPV开账期失败问题之对策

运维必备——ELK日志分析系统
随机推荐
Essential for operation and maintenance - Elk log analysis system
How to modify the error of easydss on demand service sharing time?
Post order traversal sequence of 24 binary search tree of sword finger offer
日本赌国运:Web3.0 ,反正也不是第一次失败了!
挥发性有机物TVOC、VOC、VOCS气体检测+解决方案
Jerry's weather direction coding table [chapter]
C modifier
Fundamentals of face recognition (facenet)
Unity skframework Framework (XVI), package manager Development Kit Manager
Js1day (syntaxe d'entrée / sortie, type de données, conversion de type de données, Var et let différenciés)
Everyone wants to eat a broken buffet. It's almost cold
OLED screen driver based on stm32
研究表明“气味相投”更易成为朋友
Unity SKFramework框架(十五)、Singleton 单例
中文姓名提取(玩具代码——准头太小,权当玩闹)
Rust language document Lite (Part 1) - cargo, output, basic syntax, data type, ownership, structure, enumeration and pattern matching
What are eNB, EPC and PGW?
腾讯三面:进程写文件过程中,进程崩溃了,文件数据会丢吗?
Five best software architecture patterns that architects must understand
numpy数组计算