当前位置:网站首页>Data classification: support vector machine
Data classification: support vector machine
2022-07-03 10:24:00 【Why】
One 、 Job requirements
- To write SVM Algorithmic program ( You can find the corresponding code from the network ), Platform self selection .
- Use SVM Algorithm , Three kernel functions are used to establish classification models for a given sample data set . The dimension in the data file “ type ” Is the type of identification .
- use 60% The data is the training set ,40% For test set , Accuracy of use 、 Test your results with sensitivity and specificity .
- Complete the excavation report .
Two 、 Data set pre analysis
The first five items of the data set are shown
View the overall information of the dataset
The whole data set consists of 12 Column ,569 That's ok , No missing value , There is no need to deal with missing values .Comments for dataset columns
Include ID, Types and characteristics are 12 individualCheck the average value of each characteristic of the data , variance , The most value , Index values such as quartile
Visualization of the distribution of two types of data
3、 ... and 、 Data preprocessing
- feature selection
The correlation coefficient of the single variable itself on the diagonal of the thermodynamic diagram is 1, The lighter the color, the greater the correlation . It can be seen from the heat map radius_mean、perimeter_mean and area_mean Very relevant ,compactness_mean、concavity_mean、concave_points_mean this 3 Fields are also relevant , Therefore, you can choose radius_mean,perimeter_mean,area_mean,compactness_mean,concavity_mean,concave_points_mean These six features are the main features .
Four 、 Related knowledge
- SVM Kernel function
① Linear kernel function
Linear kernel function (Linear Kernel) In fact, it is linearly separable SVM, in other words , Linearly separable SVM Can be inseparable from linearity SVM Classed , The difference is only linear separability SVM Linear kernel function is used .
② Gaussian kernel
Gaussian kernel (Gaussian Kernel), stay SVM Also known as radial basis kernel function (Radial Basis Function,RBF), It is a nonlinear classification SVM The most popular kernel function , yes libsvm Default kernel function .
③ Sigmoid Kernel function
Sigmoid Kernel function (Sigmoid Kernel) It's also linear SVM One of the common kernel functions . - SVM Performance metrics
① Confusion matrix
For the problem of two categories , The categories we care about are usually defined as positive classes , The other is called negative class . The confusion matrix consists of the following data :
True Positive ( real ,TP): The number of positive classes predicted as positive classes
True Negative ( True negative ,TN): The number of negative classes predicted as negative classes
False Positive( False positive ,FP): The number of negative classes predicted to be positive ( False positives )
False Negative( False negative ,FN): The number of positive classes predicted as negative classes ( Omission of )
M | B | |
---|---|---|
F(False,0) | TN | FP |
T(True,1) | FN | TP |
② Accuracy rate
Accuracy is the most common evaluation index , Predict the proportion of correct samples in all samples ; Generally speaking , The higher the accuracy, the better the classifier .
③ Sensitivity ( Recall rate )
Sensitivity represents the proportion of all positive examples in the sample that are identified , It measures the recognition ability of classifier to positive examples .
④ Special effect test ( Special effects )
The specific effect degree represents the proportion of all negative examples in the sample that are recognized , It measures the ability of the classifier to recognize negative examples .
5、 ... and 、 Outline design
Outline design flow chart :
6、 ... and 、 Detailed design and core code
- Read datasets and view datasets overview
# Reading data sets
data = pd.read_excel(' Classification job data set .xlsx')
# Dataset view
print(data.info())
print(data.columns)
print(data.head(5))
print(data.describe())
- Data cleaning and type mapping
# Data cleaning
#“ID" Columns have no practical significance , Delete
data.drop('ID',axis = 1,inplace=True)
# Will be of type B,M use 0,1 Instead of
data[' type '] = data[' type '].map({
'M':1,'B':0})
- Data visualization and feature selection
# The characteristic field is placed in features_mean
features_mean= list(data.columns[1:12])
# Visualize two types of distribution
sns.countplot(x=" type ",data=data)
plt.show()
# Present... With a heat map features_mean Correlation between fields
corr = data[features_mean].corr()
plt.figure(figsize=(14,14))
# annot=True Display data for each grid
sns.heatmap(corr, annot=True)
plt.show()
# After feature selection 6 Features
features_remain = ['radius','texture', 'smoothnesscompactness','compactness','symmetry ', 'fractal dimension ']
- The segmentation data set is training set and test set
# extract 40% As a test set , rest 60% As a training set
train,test = train_test_split(data,test_size = 0.4)
# Extract the value of feature selection as training and test data
train_X = train[features_remain]
train_y = train[' type ']
test_X = test[features_remain]
test_y = test[' type ']
- Standardized data
# use Z-Score Standardization , Ensure that the data mean of each feature dimension is 0, The variance of 1
ss = StandardScaler()
train_X = ss.fit_transform(train_X)
test_X = ss.transform(test_X)
- use SVM Three kernel functions establish classification model and performance measurement
print("%%%%%%% Accuracy %%%%%%%")
print("%%%%%%% Sensitivity %%%%%%%")
print("%%%%%%% Special effects %%%%%%%")
print("%%%%%%%F1_score%%%%%%%")
kernelList = ['linear','rbf','sigmoid']
for kernel in kernelList:
svc = SVC(kernel=kernel).fit(train_X,train_y)
y_pred = svc.predict(test_X)
# Calculation accuracy
score_svc = metrics.accuracy_score(test_y,y_pred)
print(kernel+":")
print(score_svc)
# Calculate recall rate ( Sensitivity )
print(recall_score(test_y, y_pred))
# Confusion matrix
C = confusion_matrix(test_y, y_pred)
TN=C[0][0]
FP=C[0][1]
FN=C[1][0]
TP=C[1][1]
# Calculate specific effect
specificity=TN/(TN+FP)
print(specificity)
# Calculation f1_score
# print(f1_score(test_y, y_pred))
# print(classification_report(test_y, y_pred))
7、 ... and 、 Run a screenshot
The following are the accuracy of the three kernel function models , Sensitivity , Special effect result :
![]() ![]() ![]() | |
For a more intuitive view of SVM Three kernel functions establish the performance measurement of the model , Draw the results into a table as follows ( Keep five decimal places ):
It can be seen from the table that , For this dataset , Kernel function rbf The accuracy and sensitivity of the established model are higher than the other two kernel functions , The special effect is slightly lower than linear Kernel function model , On the whole , Kernel function rbf The established classification model has better performance when applied to this dataset .
8、 ... and 、 Complete code
import matplotlib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
# Solve the confusion of negative sign of coordinate axis scale
plt.rcParams['axes.unicode_minus'] = False
# Solve the problem of Chinese garbled code
plt.rcParams['font.sans-serif'] = ['Simhei']
# Show all columns of the dataset
pd.set_option('display.max_columns', None)
# Reading data sets
data = pd.read_excel(' Classification job data set .xlsx')
# Dataset view
print(data.info())
print(data.columns)
print(data.head(5))
print(data.describe())
# Data cleaning
#“ID" Columns have no practical significance , Delete
data.drop('ID',axis = 1,inplace=True)
# Will be of type B,M use 0,1 Instead of
data[' type '] = data[' type '].map({
'M':1,'B':0})
# The characteristic field is placed in features_mean
features_mean= list(data.columns[1:12])
# Visualize two types of distribution
sns.countplot(x=" type ",data=data)
plt.show()
# Present... With a heat map features_mean Correlation between fields
corr = data[features_mean].corr()
plt.figure(figsize=(14,14))
# annot=True Display data for each grid
sns.heatmap(corr, annot=True)
plt.show()
# After feature selection 6 Features
features_remain = ['radius','texture', 'smoothnesscompactness','compactness','symmetry ', 'fractal dimension ']
# extract 40% As a test set , rest 60% As a training set
train,test = train_test_split(data,test_size = 0.4)
train_X = train[features_remain] # Extract the value of feature selection as training and test data
train_y = train[' type ']
test_X = test[features_remain]
test_y = test[' type ']
# use Z-Score Standardization , Ensure that the data mean of each feature dimension is 0, The variance of 1
ss = StandardScaler()
train_X = ss.fit_transform(train_X)
test_X = ss.transform(test_X)
print("%%%%%%% Accuracy %%%%%%%")
print("%%%%%%% Sensitivity %%%%%%%")
print("%%%%%%% Special effects %%%%%%%")
print("%%%%%%%F1_score%%%%%%%")
kernelList = ['linear','rbf','sigmoid']
for kernel in kernelList:
svc = SVC(kernel=kernel).fit(train_X,train_y)
y_pred = svc.predict(test_X)
# Calculation accuracy
score_svc = metrics.accuracy_score(test_y,y_pred)
print(kernel+":")
print(score_svc)
# Calculate recall rate ( Sensitivity )
print(recall_score(test_y, y_pred))
# Confusion matrix
C = confusion_matrix(test_y, y_pred)
TN=C[0][0]
FP=C[0][1]
FN=C[1][0]
TP=C[1][1]
# Calculate specific effect
specificity=TN/(TN+FP)
print(specificity)
# Calculation f1_score
# print(f1_score(test_y, y_pred))
# print(classification_report(test_y, y_pred))
边栏推荐
- QT setting suspension button
- Label Semantic Aware Pre-training for Few-shot Text Classification
- [graduation season] the picture is rich, and frugality is easy; Never forget chaos and danger in peace.
- Julia1.0
- Deep Reinforcement learning with PyTorch
- Leetcode-404:左叶子之和
- 20220607其他:两整数之和
- LeetCode - 933 最近的请求次数
- LeetCode - 900. RLE iterator
- Leetcode bit operation
猜你喜欢
Configure opencv in QT Creator
Leetcode - 1670 design front, middle and rear queues (Design - two double ended queues)
CV learning notes - scale invariant feature transformation (SIFT)
[LZY learning notes dive into deep learning] 3.4 3.6 3.7 softmax principle and Implementation
Cases of OpenCV image enhancement
LeetCode - 919. Full binary tree inserter (array)
『快速入门electron』之实现窗口拖拽
Policy gradient Method of Deep Reinforcement learning (Part One)
Powshell's set location: unable to find a solution to the problem of accepting actual parameters
Hands on deep learning pytorch version exercise answer - 2.2 preliminary knowledge / data preprocessing
随机推荐
20220604 Mathematics: square root of X
Deep Reinforcement learning with PyTorch
20220610其他:任务调度器
[graduation season] the picture is rich, and frugality is easy; Never forget chaos and danger in peace.
2018 y7000 upgrade hard disk + migrate and upgrade black apple
LeetCode - 933 最近的请求次数
2.1 Dynamic programming and case study: Jack‘s car rental
When the reference is assigned to auto
[LZY learning notes dive into deep learning] 3.1-3.3 principle and implementation of linear regression
Octave instructions
Label Semantic Aware Pre-training for Few-shot Text Classification
Leetcode-112:路径总和
CV learning notes - camera model (Euclidean transformation and affine transformation)
2-program logic
What useful materials have I learned from when installing QT
LeetCode - 705 设计哈希集合(设计)
Leetcode - 1670 conception de la file d'attente avant, moyenne et arrière (conception - deux files d'attente à double extrémité)
Codeup: word replacement
Dynamic layout management
MySQL root user needs sudo login