当前位置:网站首页>Fundamentals of machine learning (II) -- division of training set and test set
Fundamentals of machine learning (II) -- division of training set and test set
2022-07-02 13:16:00 【Bayesian grandson】
List of articles
1. Division of test set and training set
from sklearn.datasets import load_iris, fetch_20newsgroups, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
li = load_iris()
# li The attributes of are ‘data’,‘target’
print(" Get eigenvalues :")
print(li.data[0:5])
print(" The target :")
print(li.target[0:5])
Get eigenvalues :
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
The target :
[0 0 0 0 0]
Be careful Return value : Training set train x_train, y_train Test set test x_test, y_test
1.1 Divide the training set and the test set
# The invocation format is :
x_train, x_test, y_train, y_test = train_test_split(li.data, li.target, test_size=0.25)
# Large amount of data , Before printing 5 One sample is enough .
print(" Training set eigenvalues and target values :", x_train[:5], y_train[:5])
print(" Test set eigenvalues and target values :", x_test[:5], y_test[:5])
Training set eigenvalues and target values : [[6.3 3.3 6. 2.5]
[6. 3. 4.8 1.8]
[7. 3.2 4.7 1.4]
[4.9 2.5 4.5 1.7]
[6.7 3.1 4.4 1.4]] [2 2 1 2 1]
Test set eigenvalues and target values : [[7.7 3. 6.1 2.3]
[4.9 2.4 3.3 1. ]
[7.7 2.8 6.7 2. ]
[7.9 3.8 6.4 2. ]
[5.4 3.9 1.3 0.4]] [2 1 2 2 0]
print(li.DESCR) # Some descriptive information of the data set
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. topic:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
data[:5]
array([[6.3200e-03, 1.8000e+01, 2.3100e+00, 0.0000e+00, 5.3800e-01,
6.5750e+00, 6.5200e+01, 4.0900e+00, 1.0000e+00, 2.9600e+02,
1.5300e+01, 3.9690e+02, 4.9800e+00],
[2.7310e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
6.4210e+00, 7.8900e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9690e+02, 9.1400e+00],
[2.7290e-02, 0.0000e+00, 7.0700e+00, 0.0000e+00, 4.6900e-01,
7.1850e+00, 6.1100e+01, 4.9671e+00, 2.0000e+00, 2.4200e+02,
1.7800e+01, 3.9283e+02, 4.0300e+00],
[3.2370e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
6.9980e+00, 4.5800e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9463e+02, 2.9400e+00],
[6.9050e-02, 0.0000e+00, 2.1800e+00, 0.0000e+00, 4.5800e-01,
7.1470e+00, 5.4200e+01, 6.0622e+00, 3.0000e+00, 2.2200e+02,
1.8700e+01, 3.9690e+02, 5.3300e+00]])
target[:5]
array([24. , 21.6, 34.7, 33.4, 36.2])
The above shows two different types of data sets , A kind of target It is discrete ( Category ), One is continuity type ( Price ).
2. fit and transform
fit( ): Method calculates the parameters μ and σ and saves them as internal objects.
It can be understood as before the data set is converted , For some basic attributes of data, such as : mean value , variance , Maximum , The minimum value is similar pd.info() Overview of .
transform( ): Method using these calculated parameters apply the transformation to a particular dataset.
Calling transform Before , You need to do a... On the data fit Preprocessing , Then it can be standardized , Dimension reduction , Normalization and other operations .( Such as PCA,StandardScaler etc. ).
fit_transform(): joins the fit() and transform() method for transformation of dataset.
Want to be so fit and transform The combination of , It includes preprocessing and data conversion .
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler, StandardScaler
sample_1 = [[2,4,2,3],[6,4,3,2],[8,4,5,6]]
s = StandardScaler()
s.fit_transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
ss = StandardScaler()
ss.fit(sample_1)
StandardScaler()
ss.transform(sample_1)
array([[-1.33630621, 0. , -1.06904497, -0.39223227],
[ 0.26726124, 0. , -0.26726124, -0.98058068],
[ 1.06904497, 0. , 1.33630621, 1.37281295]])
边栏推荐
- Ali was killed by two programming problems at the beginning, pushed inward again, and finally landed (he has taken an electronic offer)
- Js5day (event monitoring, function assignment to variables, callback function, environment object this, select all, invert selection cases, tab column cases)
- [opencv learning] [moving object detection]
- Record idea shortcut keys
- Mobile layout (flow layout)
- 最近公共祖先LCA的三种求法
- 无向图的桥
- How to modify the error of easydss on demand service sharing time?
- Traverse entrylist method correctly
- js2day(又是i++和++i,if语句,三元运算符,switch、while语句,for循环语句)
猜你喜欢
Everyone wants to eat a broken buffet. It's almost cold
OpenAPI generator: simplify the restful API development process
Unity SKFramework框架(十六)、Package Manager 开发工具包管理器
Unity SKFramework框架(十五)、Singleton 单例
Unforgettable Ali, 4 skills, 5 hr additional written tests, it's really difficult and sad to walk
记忆函数的性能优化
Lucky numbers in the [leetcode daily question] matrix
挥发性有机物TVOC、VOC、VOCS气体检测+解决方案
Get started REPORT | today, talk about the microservice architecture currently used by Tencent
Jerry's watch ringtone audition [article]
随机推荐
Embedded software development
Fundamentals of face recognition (facenet)
最近公共祖先LCA的三种求法
[opencv learning] [Canny edge detection]
[OpenGL] notes 29. Advanced lighting (specular highlights)
Mysql常用命令详细大全
嵌入式软件开发
Web基础
Js1day (syntaxe d'entrée / sortie, type de données, conversion de type de données, Var et let différenciés)
[opencv learning] [common image convolution kernel]
Unity skframework framework (XVII), freecameracontroller God view / free view camera control script
Ltc3307ahv meets EMI standard, step-down converter qca7005-al33 phy
Redis database persistence
Everyone wants to eat a broken buffet. It's almost cold
互联网常见34个术语解释
ADB basic commands
屠榜多目标跟踪!BoT-SORT:稳健的关联多行人跟踪
(7) Web security | penetration testing | how does network security determine whether CND exists, and how to bypass CND to find the real IP
Fully autonomous and controllable 3D cloud CAD: crowncad's convenient command search can quickly locate the specific location of the required command.
Ruby: how to copy variables without pointing to the same object- Ruby: how can I copy a variable without pointing to the same object?