当前位置:网站首页>Fundamentals of machine learning (III) -- KNN / naive Bayes / cross validation / grid search

Fundamentals of machine learning (III) -- KNN / naive Bayes / cross validation / grid search

2022-07-05 19:06:00 Bayesian grandson

3. K Nearest neighbor algorithm (KNN)

(1)KNN Concept :k The nearest neighbor , That is, each sample can use its closest k A neighbor represents .(K Near Neighbor)

(2) Algorithmic thought : A sample and dataset of k Two samples are the most similar , If this k Most of the samples belong to a certain category , Then the sample also belongs to this category .

(3) Distance metric : Generally, European distance is used ,L2 Norm is enough .

(4)K value The choice of : If you choose smaller Of K value , It is equivalent to predicting in a small neighborhood , The approximation error of learning will be reduced ; shortcoming It is the estimation error of learning that will

increase . If the adjacent point happens to be noise , The prediction will go wrong .K A smaller value means that the overall model becomes more complex , Easy to happen Over fitting .

If you choose more K value , It is equivalent to using a larger neighborhood to predict ; advantage It can reduce the estimation error of learning , But the approximation error will increase ,K Worth increasing means adjusting

The volume model changes Simple .

General algorithm instance flow

1、 Data set processing

2、 Split the dataset

3、 Standardize data sets

4、estimator Process for classified forecast

3.1 Read data information

import pandas as pd
#  Reading data 
data = pd.read_csv("./KNN_al/train.csv")
data.head(10)
row_idxyaccuracytimeplace_id
000.79419.0809544707028523065625
115.95674.7968131865551757726713
228.30787.0407743226481137537235
337.36652.5165657045876567393236
444.09611.1307314721307440663949
553.80991.9586751780656289802927
666.33364.3720136668299931249544
775.74096.7697853690025662813655
884.31146.941031663848471780938
996.34140.0758654000601253803156
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29118021 entries, 0 to 29118020
Data columns (total 6 columns):
 #   Column    Dtype  
---  ------    -----  
 0   row_id    int64  
 1   x         float64
 2   y         float64
 3   accuracy  int64  
 4   time      int64  
 5   place_id  int64  
dtypes: float64(2), int64(4)
memory usage: 1.3 GB

3.2 Processing data

This data is too big , Nearly 30 million , We need to filter the data .

3.2.1 Shrink the data , Query data filtering

data = data.query("x > 1.0 & x < 1.25 & y > 2.5 & y < 2.75")
data.head(10)
row_idxyaccuracytimeplace_id
6006001.22142.702317653806683426742
9579571.18322.6891587854706683426742
434543451.19352.6550114000826889790653
473547351.14522.6074495149836822359752
558055801.00892.7287197324101527921905
609060901.11402.6262111455074000153867
623462341.14492.5003343163773741484405
635063501.08442.743665368165963693798
746874681.00582.5096667467669076695703
847884781.20152.5187726907223992589015

3.2.2 Processing time data

time_value = pd.to_datetime(data['time'], unit='s')
time_value.head()
600    1970-01-01 18:09:40
957    1970-01-10 02:11:10
4345   1970-01-05 15:08:02
4735   1970-01-06 23:03:03
5580   1970-01-09 11:26:50
Name: time, dtype: datetime64[ns]
#  Convert the date format to   Dictionary format 
time_value = pd.DatetimeIndex(time_value)
time_value
DatetimeIndex(['1970-01-01 18:09:40', '1970-01-10 02:11:10',
               '1970-01-05 15:08:02', '1970-01-06 23:03:03',
               '1970-01-09 11:26:50', '1970-01-02 16:25:07',
               '1970-01-04 15:52:57', '1970-01-01 10:13:36',
               '1970-01-09 15:26:06', '1970-01-08 23:52:02',
               ...
               '1970-01-07 10:03:36', '1970-01-09 11:44:34',
               '1970-01-04 08:07:44', '1970-01-04 15:47:47',
               '1970-01-08 01:24:11', '1970-01-01 10:33:56',
               '1970-01-07 23:22:04', '1970-01-08 15:03:14',
               '1970-01-04 00:53:41', '1970-01-08 23:01:07'],
              dtype='datetime64[ns]', name='time', length=17710, freq=None)
#  Construct some features 
data['day'] = time_value.day
data['hour'] = time_value.hour
data['weekday'] = time_value.weekday
data.head()
row_idxyaccuracytimeplace_iddayhourweekday
6006001.22142.7023176538066834267421183
9579571.18322.68915878547066834267421025
434543451.19352.65501140008268897906535150
473547351.14522.60744951498368223597526231
558055801.00892.72871973241015279219059114
#  Delete the timestamp feature 
data = data.drop(['time'], axis=1)
data.head()
row_idxyaccuracyplace_iddayhourweekday
6006001.22142.70231766834267421183
9579571.18322.68915866834267421025
434543451.19352.65501168897906535150
473547351.14522.60744968223597526231
558055801.00892.72871915279219059114
#  Check in less than n Target locations deleted 
place_count = data.groupby('place_id').count()
place_count
#  Group by a certain feature , This feature becomes an index index
row_idxyaccuracydayhourweekday
place_id
10120239721111111
10571821341111111
10599580363333333
10852667891111111
10972008691044104410441044104410441044
........................
99041820601111111
99150935011111111
99461985891111111
99501908901111111
99807110125555555

805 rows × 7 columns

# tf It keeps row_id>3 The data of 
tf = place_count[place_count.row_id > 3]
tf
row_idxyaccuracydayhourweekday
place_id
10972008691044104410441044104410441044
1228935308120120120120120120120
126780152958585858585858
127804050715151515151515
128505162221212121212121
........................
97413078785555555
975385552921212121212121
98060437376666666
980947606923232323232323
99807110125555555

239 rows × 7 columns

#  Then reset the index , Give Way place_id Go back to data features 
tf = tf.reset_index()
tf
place_idrow_idxyaccuracydayhourweekday
010972008691044104410441044104410441044
11228935308120120120120120120120
2126780152958585858585858
3127804050715151515151515
4128505162221212121212121
...........................
23497413078785555555
235975385552921212121212121
23698060437376666666
237980947606923232323232323
23899807110125555555

239 rows × 8 columns

#  hold data Inside id Is it in tf.place_id Inside , Keep it if you have it .
data = data[data['place_id'].isin(tf.place_id)]
data
row_idxyaccuracyplace_iddayhourweekday
6006001.22142.70231766834267421183
9579571.18322.68915866834267421025
434543451.19352.65501168897906535150
473547351.14522.60744968223597526231
558055801.00892.72871915279219059114
...........................
29100203291002031.01292.67751233124637461103
29108443291084431.14742.68403635331777797232
29109993291099931.02402.72386264249725518153
29111539291115391.20322.6796873533177779406
29112154291121541.10702.541917849325782458233

16918 rows × 8 columns

3.2.3 Take out the target value and characteristic value

y = data["place_id"]
x = data.drop(["place_id"],axis = 1) #  Delete the target value along the direction of the column 

3.3 Divide the training set and the test set

from sklearn.datasets import load_iris, fetch_20newsgroups, load_boston
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
data
row_idxyaccuracyplace_iddayhourweekday
6006001.22142.70231766834267421183
9579571.18322.68915866834267421025
434543451.19352.65501168897906535150
473547351.14522.60744968223597526231
558055801.00892.72871915279219059114
...........................
29100203291002031.01292.67751233124637461103
29108443291084431.14742.68403635331777797232
29109993291099931.02402.72386264249725518153
29111539291115391.20322.6796873533177779406
29112154291121541.10702.541917849325782458233

16918 rows × 8 columns

#  At this time, we will not standardize the data , Call directly KNN Algorithm to try how the prediction effect .
def knn_al():
    knn = KNeighborsClassifier(n_neighbors = 5)
    # fit,predict ,score
    knn.fit(x_train,y_train)
    #  Come up with a prediction 
    y_predict = knn.predict(x_test)
    print(" The predicted target sign in location is :",y_predict)
    #  Get the accuracy 
    print(" The accuracy of the prediction :",knn.score(x_test,y_test))
if __name__ == "__main__":
    knn_al()
 The predicted target sign in location is : [1479000473 2584530303 2946102544 ... 5606572086 1602053545 1097200869]
 The accuracy of the prediction : 0.029787234042553193
#  We try to improve the accuracy of the algorithm , Delete first data Medium row_id Characteristics of .

data_del_row_id = data.drop(['row_id'],axis =1)
data_del_row_id
xyaccuracyplace_iddayhourweekday
6001.22142.70231766834267421183
9571.18322.68915866834267421025
43451.19352.65501168897906535150
47351.14522.60744968223597526231
55801.00892.72871915279219059114
........................
291002031.01292.67751233124637461103
291084431.14742.68403635331777797232
291099931.02402.72386264249725518153
291115391.20322.6796873533177779406
291121541.10702.541917849325782458233

16918 rows × 7 columns

y = data_del_row_id["place_id"]
x = data_del_row_id.drop(["place_id"],axis = 1) #  Delete the target value along the direction of the column 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
if __name__ == "__main__":
    knn_al()
 The predicted target sign in location is : [1097200869 3312463746 9632980559 ... 3533177779 4932578245 1913341282]
 The accuracy of the prediction : 0.0806146572104019

We deleted row_id after , It is found that the accuracy of prediction ranges from 0.0319 To improve the 0.0806

#  Next delete day try 
y = data_del_row_id["day"]
x = data_del_row_id.drop(["place_id"],axis = 1) #  Delete the target value along the direction of the column 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.25)
if __name__ == "__main__":
    knn_al()
 The predicted target sign in location is : [2 9 4 ... 6 5 9]
 The accuracy of the prediction : 0.810401891252955

We deleted day After feature , It is found that the accuracy of prediction ranges from 0.0763 To improve the 0.8104

3.4 Feature Engineering ( Standardization )

Let's go back to the processed data , namely data, Then standardize the eigenvalues .

3.5 Calculation predict and Score

 #  Take out the eigenvalues and target values in the data 
y = data['place_id']

x = data.drop(['place_id'], axis=1)

#  Carry out data segmentation, training set, test set 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

#  Feature Engineering ( Standardization )
std = StandardScaler()
 #  The eigenvalues of test set and training set are standardized 
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
if __name__ == "__main__":
    knn_al()
 The predicted target sign in location is : [6683426742 1435128522 2327054745 ... 2460093296 1435128522 1097200869]
 The accuracy of the prediction : 0.41631205673758864

We Standardization after , It is found that the accuracy of prediction ranges from 0.0763 To improve the 0.41631205673758864.

And then we drop once "row_id" Characteristics of , Try again. .

 #  Take out the eigenvalues and target values in the data 
x = data.drop("place_id",axis = 1)
x
row_idxyaccuracydayhourweekday
6006001.22142.7023171183
9579571.18322.6891581025
434543451.19352.6550115150
473547351.14522.6074496231
558055801.00892.7287199114
........................
29100203291002031.01292.6775121103
29108443291084431.14742.6840367232
29109993291099931.02402.7238628153
29111539291115391.20322.679687406
29112154291121541.10702.54191788233

16918 rows × 7 columns

#  Carry out data segmentation, training set, test set 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
#  Feature Engineering ( Standardization )
std = StandardScaler()
 #  The eigenvalues of test set and training set are standardized 
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)
if __name__ == "__main__":
    knn_al()
 The predicted target sign in location is : [5270522918 1097200869 3312463746 ... 1097200869 5606572086 1097200869]
 The accuracy of the prediction : 0.40803782505910163
#  We will have a drop features :“day”
#
x_no_row_id = x.drop(["row_id"],axis =1)
x_no_row_id_and_no_day = x_no_row_id.drop(["day"],axis =1)
x_no_row_id_and_no_day
xyaccuracyhourweekday
6001.22142.702317183
9571.18322.68915825
43451.19352.655011150
47351.14522.607449231
55801.00892.728719114
..................
291002031.01292.677512103
291084431.14742.684036232
291099931.02402.723862153
291115391.20322.67968706
291121541.10702.5419178233

16918 rows × 5 columns

y
600         6683426742
957         6683426742
4345        6889790653
4735        6822359752
5580        1527921905
               ...    
29100203    3312463746
29108443    3533177779
29109993    6424972551
29111539    3533177779
29112154    4932578245
Name: place_id, Length: 16918, dtype: int64
## 3.5
#  Carry out data segmentation, training set, test set 
x_train, x_test, y_train, y_test = train_test_split(x_no_row_id_and_no_day, y, test_size=0.25)
#  Feature Engineering ( Standardization )
std = StandardScaler()
 #  The eigenvalues of test set and training set are standardized 
x_train = std.fit_transform(x_train)
x_test = std.transform(x_test)

knn = KNeighborsClassifier(n_neighbors = 5)
    # fit,predict ,score
knn.fit(x_train,y_train)
    #  Come up with a prediction 
y_predict = knn.predict(x_test)
print(" The predicted target sign in location is :",y_predict)
    #  Get the accuracy 
print(" The accuracy of the prediction :",knn.score(x_test,y_test))
 The predicted target sign in location is : [6399991653 3533177779 1097200869 ... 2327054745 3992589015 6683426742]
 The accuracy of the prediction : 0.48699763593380613

3.6 KNN Algorithm is summarized

k The value is very small , Susceptible to outliers .
k It's worth a lot , Easy to bear k Value quantity ( Category ) Influence .

4. Classification model evaluation ( Accuracy and recall )

estimator.score()

Generally, the most common use is Accuracy rate , That is, the correct percentage of predicted results :

A c c u r a c y = T P + T N T P + F P + F N + T N Accuracy = \frac{TP+TN}{TP+FP+FN+TN} Accuracy=TP+FP+FN+TNTP+TN

Confusion matrix : In the classification task , Predicted results (Predicted Condition) With the right mark (True Condition) There are four different combinations , Make up the confusion matrix ( It's suitable for multiple categories )
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-kZbQHN9e-1656744456611)(attachment:image-2.png)]

Accuracy (Precision) And recall rate (Recall)

Accuracy : The predicted result is the proportion of real positive cases in positive samples ( Check it out
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-mL6R4695-1656744456612)(attachment:image.png)]

P r e c i s i o n = T P T P + F P Precision = \frac{TP}{TP+FP} Precision=TP+FPTP

Recall rate : The proportion of real positive samples with positive prediction results ( Check the whole , The ability to distinguish positive samples )
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-1ST5eaks-1656744456613)(attachment:image-3.png)]

R e c a l l = T P T P + F N Recall = \frac{TP}{TP+FN} Recall=TP+FNTP

Other classification criteria ,F1-score, It reflects the robustness of the model

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zfkzyqJe-1656744456613)(attachment:image-3.png)]

Classification model evaluation API

sklearn.metrics.classification_report

sklearn.metrics.classification_report(y_true, y_pred, target_names=None)

y_true: True target value

y_pred: The estimator predicts the target value

target_names: Target category name

return: Accuracy rate and recall rate of each category

5. Cross validation and grid search

on top , We divide the data into training sets and test sets . Now let's put aside the test set , Divide the training set .

The training set is divided into training set and verification set .

Usually , There are many parameters that need to be specified manually ( Such as k- In the nearest neighbor algorithm K value ), This is called superparameter . But the manual process is complicated , So we need to preset several super parameter combinations for the model . Each group of super parameters was evaluated by cross validation . Finally, the optimal combination of parameters is selected to establish the model .

Super parameter search - The grid search API: sklearn.model_selection.GridSearchCV

sklearn.model_selection.GridSearchCV(estimator, param_grid=None,cv=None)

estimator: Estimator objects

param_grid: Estimator parameters (dict){“n_neighbors”:[1,3,5]}

cv: Specify a few fold cross validation

fit: Input training data

score: Accuracy rate

Result analysis

best_score_: The best results tested in cross validation

best_estimator_: The best parametric model

cv_results_: Test set accuracy results and training set accuracy results after each cross validation

from sklearn.model_selection import train_test_split, GridSearchCV
#  Construct the values of some parameters to search 
param = {
    "n_neighbors": [1,3,5,7,10]}

#  Do a grid search 
gc = GridSearchCV(knn, param_grid=param, cv=2)

gc.fit(x_train, y_train)

#  Prediction accuracy 
print(" Accuracy on the test set :", gc.score(x_test, y_test))

print(" The best result in cross validation :", gc.best_score_)

print(" Choosing the best model is :", gc.best_estimator_)
print("*"*100)
print(" The result of each cross validation of each super parameter :", gc.cv_results_)
 Accuracy on the test set : 0.4955082742316785
 The best result in cross validation : 0.45917402269861285
 Choosing the best model is : KNeighborsClassifier(n_neighbors=10)
****************************************************************************************************
 The result of each cross validation of each super parameter : {'mean_fit_time': array([0.00385594, 0.00366092, 0.00310779, 0.00316703, 0.003443  ]), 'std_fit_time': array([4.26769257e-04, 5.06877899e-04, 7.70092010e-05, 4.99486923e-05,
       2.91109085e-04]), 'mean_score_time': array([0.19389665, 0.20236516, 0.21587265, 0.22173393, 0.23718596]), 'std_score_time': array([0.00897849, 0.00262308, 0.00137246, 0.00043309, 0.00201011]), 'param_n_neighbors': masked_array(data=[1, 3, 5, 7, 10],
             mask=[False, False, False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'n_neighbors': 1}, {'n_neighbors': 3}, {'n_neighbors': 5}, {'n_neighbors': 7}, {'n_neighbors': 10}], 'split0_test_score': array([0.41456494, 0.42307692, 0.44435687, 0.44656368, 0.45176545]), 'split1_test_score': array([0.4186633 , 0.43332282, 0.45412989, 0.4612232 , 0.4665826 ]), 'mean_test_score': array([0.41661412, 0.42819987, 0.44924338, 0.45389344, 0.45917402]), 'std_test_score': array([0.00204918, 0.00512295, 0.00488651, 0.00732976, 0.00740858]), 'rank_test_score': array([5, 4, 3, 2, 1], dtype=int32)}

6. Naive bayes algorithm

P ( C ∣ W ) = P ( W ∣ C ) P ( C ) P ( W ) P(C|W)=\frac{P(W|C)P(C)}{P(W)} P(CW)=P(W)P(WC)P(C)

notes :w Is the characteristic value of a given document ( frequency statistic , The forecast document provides ),c For document category

𝑃(𝐶): The probability of each document category ( Number of words in a document category / Total number of document words )

𝑃(𝑊│𝐶): Characteristics under a given category ( The words in the predicted document ) Probability

computing method :𝑃(𝐹1│𝐶)=𝑁𝑖/𝑁 ( In the training document to calculate )

𝑁𝑖 For the sake of 𝐹1 Words in C The number of times a category appears in all documents

𝑁 Is the category C The number of times all words appear and

6.1 Laplacian smoothing

𝛼 The coefficient specified for is generally 1,m It is the number of feature words in the training document

P ( F 1 ∣ C ) = N i + α N + α m P(F1|C)=\frac{N_i+\alpha}{N+\alpha m} P(F1C)=N+αmNi+α

6.2 sklearn Naive Bayesian implementation API

sklearn.naive_bayes.MultinomialNB

sklearn.naive_bayes.MultinomialNB(alpha = 1.0)

naive bayesian classification

α \alpha α: Laplace smoothing coefficient

6.3 Naive Bayesian algorithm case

Problem description :

(1)sklearn20 News category ;

(2)20 A newsgroup dataset contains 20 A theme 18000 Newsgroup posts

Naive Bayesian case flow

1、 load 20 Class news data , And split it up

2、 Generate characteristic words of the article

3、 Naive Bayes estimator Process to estimate

def naviebayes():
    """  Naive Bayes for text classification  :return: None """
    news = fetch_20newsgroups(subset='all')

    #  Data segmentation 
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target, test_size=0.25)

    #  Feature extraction of data set 
    tf = TfidfVectorizer()

    #  Count the importance of each article according to the list of words in the training set ['a','b','c','d']
    x_train = tf.fit_transform(x_train)

    print(tf.get_feature_names_out())
    print("*"*50)
    x_test = tf.transform(x_test)

    #  The prediction of naive Bayesian algorithm 
    mlt = MultinomialNB(alpha=1.0)
    
    print(x_train.toarray())
    print("*"*50)
    mlt.fit(x_train, y_train)

    y_predict = mlt.predict(x_test)

    print(" The predicted article category is :", y_predict)
    print("*"*50)
    #  Get the accuracy 
    print(" Accuracy rate is :", mlt.score(x_test, y_test))
    print("*"*50)
    print(" Accuracy rate and recall rate of each category :", classification_report(y_test, y_predict, target_names=news.target_names))
    print("*"*50)
    return None
if __name__ =="__main__":
    naviebayes()
['00' '000' '0000' ... 'óáíïìåô' 'ýé' 'ÿhooked']
**************************************************
[[0.         0.02654538 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]
**************************************************
 The predicted article category is : [ 5  2 17 ...  1 13  7]
**************************************************
 Accuracy rate is : 0.8612054329371817
**************************************************
 Accuracy rate and recall rate of each category :                           precision    recall  f1-score   support

             alt.atheism       0.88      0.80      0.84       200
           comp.graphics       0.88      0.79      0.83       241
 comp.os.ms-windows.misc       0.89      0.78      0.83       254
comp.sys.ibm.pc.hardware       0.76      0.87      0.81       245
   comp.sys.mac.hardware       0.84      0.90      0.86       229
          comp.windows.x       0.90      0.85      0.88       245
            misc.forsale       0.93      0.67      0.78       241
               rec.autos       0.91      0.92      0.92       263
         rec.motorcycles       0.94      0.95      0.94       265
      rec.sport.baseball       0.94      0.95      0.95       237
        rec.sport.hockey       0.91      0.98      0.94       238
               sci.crypt       0.79      0.98      0.88       259
         sci.electronics       0.91      0.82      0.86       238
                 sci.med       0.98      0.90      0.94       239
               sci.space       0.87      0.97      0.92       249
  soc.religion.christian       0.62      0.98      0.76       260
      talk.politics.guns       0.80      0.95      0.87       230
   talk.politics.mideast       0.92      0.98      0.95       230
      talk.politics.misc       1.00      0.65      0.79       196
      talk.religion.misc       0.97      0.23      0.37       153

                accuracy                           0.86      4712
               macro avg       0.88      0.85      0.85      4712
            weighted avg       0.88      0.86      0.86      4712

**************************************************
原网站

版权声明
本文为[Bayesian grandson]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/186/202207051839178562.html