当前位置:网站首页>Machine learning notes - bird species classification using machine learning

Machine learning notes - bird species classification using machine learning

2022-07-07 03:43:00 Sit and watch the clouds rise

One 、 Summary of problems

         Scientists have determined that , A known bird should be divided into 3 Different and independent species . These species are endemic to specific areas of the country , Their populations must be tracked and estimated as accurately as possible .

         therefore , A non-profit conservation association undertook this task . They need to be able to record the species they encounter based on the characteristics observed by field officials .

         Use some genetic characteristics and location data , Can you predict the bird species that have been observed ?

         This is a beginner level practice competition , Your goal is to predict the species of birds based on their attributes or locations .

Bird Species Classification Challenge | bitgrit bitgrit hosts data science competitions for all levels. Join in and compete in a range of competitions.https://bitgrit.net/competition/16

Two 、 Data sets

         The data has been easily split into training and test data sets . In every training and test , You will get a position 1 To 3 Bird data .

        Dataset download address

 link :https://pan.baidu.com/s/1aalzQNr0IQLQc3X4JTu9nQ 
 Extraction code :xvy0

          Let's take a look at the first five lines training_set.csv

bill_depthbill_lengthwing_lengthlocationmasssexID
14.348.2210loc_246000284
14.448.4203loc_246250101
18.4NA200loc_334000400
14.9821138247.50487805NANA4800098
18.9821138238.25930705217.1869919loc_352000103

         training_set and training_target According to ‘id’ Column Association .

        The meaning of column is as follows

species      :  Animal species  (A, B, C) 
bill_length :  Beak length  (mm) 
bill_depth   :  Deep beak  (mm) 
wing_length :  Wing length  (mm) 
mass         :  weight  (g) 
location     :  Island  (Location 1, 2, 3 )
 Gender          : Animal sex (0: men ;1: women ;NA: Unknown )

3、 ... and 、 Write code

1、 Import library

import pandas as pd

# plotting
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
matplotlib.rcParams['figure.dpi'] = 100
sns.set(rc={'figure.figsize':(11.7,8.27)})
sns.set(style="whitegrid")
%matplotlib inline

# ml
from sklearn.metrics import ConfusionMatrixDisplay, classification_report
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

2、 How to deal with missing values

def missing_vals(df):
    """prints out columns with perc of missing values"""
    missing = [
        (df.columns[idx], perc)
        for idx, perc in enumerate(df.isna().mean() * 100)
        if perc > 0
    ]

    if len(missing) == 0:
        return "no missing values"
        

    # sort desc by perc
    missing.sort(key=lambda x: x[1], reverse=True)

    print(f"There are a total of {len(missing)} variables with missing values\n")

    for tup in missing:
        print(str.ljust(f"{tup[0]:<20} => {round(tup[1], 3)}%", 1))

3、 Load data

         First , We use read_csv Function to load training and test data .

         We will also training_set.csv( Include features ) And training_target.csv( Contains the target variable ) Combine and form training data .

train = pd.read_csv("dataset/training_set/training_set.csv")
labels = pd.read_csv("dataset/training_set/training_target.csv")

# join target variable to training set
train = train.merge(labels, on="ID")

test = pd.read_csv("dataset/test_set/test_set.csv")

target_cols = "species"
num_cols = ["bill_depth", "bill_length", "wing_length", "mass"]
cat_cols = ["location", "sex"]
all_cols = num_cols + cat_cols + [target_cols]

train = train[all_cols]

4、 Exploratory data analysis Exploratory Data Analysis (EDA)

          This is where we study data trends and patterns , Including numbers and classifications .

train.info()

         Use info function , We can see the number of rows and data types . 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 435 entries, 0 to 434
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bill_depth   434 non-null    float64
 1   bill_length  295 non-null    float64
 2   wing_length  298 non-null    float64
 3   mass         433 non-null    float64
 4   location     405 non-null    object 
 5   sex          379 non-null    float64
 6   species      435 non-null    object 
dtypes: float64(5), object(2)
memory usage: 27.2+ KB

        Numerical

         Let's draw a histogram of numerical variables . 

train[num_cols].hist(figsize=(20, 14));

        bill_depth stay 15 and 19 Peak left and right
         The length of the note is 39 and 47 Peak left and right
         Wings grow in 190 and 216 Peak left and right
         The mass tilts to the right

        Categorical

to_plot = cat_cols + [target_cols]
fig, axes = plt.subplots(1, 3, figsize=(20, 7), dpi=100)

for i, col_name in enumerate(train[to_plot].columns):
    sns.countplot(x = col_name, data = train, palette="Set1", ax=axes[i % 3])
    axes[i % 3].set_title(f"{col_name}", fontsize=13)
    plt.subplots_adjust(hspace=0.45)

          We see that the location and species seem to match their respective locations and species (loc2 And species C、loc3 And species A). We also see females (1) There are slightly more birds than males .

train.species.value_counts()
C    182
A    160
B     93
Name: species, dtype: int64

         Observe carefully , We find that the target variable is unbalanced , among B analogy C Low near 100 Classes , Than A Low, about 70 individual .

         Unbalanced classes are a problem , Because it makes the model prefer to pay more attention to classes with more samples , namely . C Than B More often predicted .

5、 missing data

         Percentage of missing values

missing_vals(train)
There are a total of 6 variables with missing values

bill_length          => 32.184%
wing_length          => 31.494%
sex                  => 12.874%
location             => 6.897%
mass                 => 0.46%
bill_depth           => 0.23%

         Through our auxiliary function , We found that bill_length and wing_length There are more than 30% The missing value

        Thermogram Heatplot

plt.figure(figsize=(10, 6))
sns.heatmap(train.isnull(), yticklabels=False, cmap='viridis', cbar=False);

          We can also draw heat maps to visualize missing values and see if there are any patterns .

         Estimate classification column

         Let's first look at how many missing variables are in our classification variables

train.sex.value_counts(dropna=False)
1.0    195
0.0    184
NaN     56
Name: sex, dtype: int64
train.location.value_counts(dropna=False)
loc_2    181
loc_3    141
loc_1     83
NaN       30
Name: location, dtype: int64

         Let's use the simple imputer To deal with them , Replace them with the most frequent values .

cat_imp = SimpleImputer(strategy="most_frequent")

train[cat_cols] = cat_imp.fit_transform(train[cat_cols])

        To confirm again , There are no missing values . As you can see , adopt “ Most frequent ” Strategy , The missing value is estimated as 1.0, This is the most frequent . 

train.sex.value_counts(dropna=False)
1.0    251
0.0    184
Name: sex, dtype: int64

         Estimated value column

         Let's use the median to estimate our value

num_imp = SimpleImputer(strategy="median")

train[num_cols] = num_imp.fit_transform(train[num_cols])
missing_vals(train)
'no missing values'

6、 Feature Engineering

train.species.value_counts()
C    182
A    160
B     93
Name: species, dtype: int64

         Coding classification variables

         Use label encoder , We can classify variables ( And target variables ) Code as numeric . We do this because most ML The model is not applicable to string values .

le = LabelEncoder()
le.fit(train['species'])
le_name_map = dict(zip(le.classes_, le.transform(le.classes_)))
le_name_map
{'A': 0, 'B': 1, 'C': 2}

         We can first fit the encoder to the variable , Then see what the mapping looks like , Then we can reverse the mapping

train['species'] = le.fit_transform(train['species'])

         For others with string variables ( The digital ) The column of , We also do the same coding .

for col in cat_cols:
    if train[col].dtype == "object":
        train[col] = le.fit_transform(train[col])
train.head()

# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
    train[col] = pd.Categorical(train[col])

          We also transform classification features into pd.Categorical dtype

train.dtypes
bill_depth      float64
bill_length     float64
wing_length     float64
mass            float64
location       category
sex            category
species           int64
dtype: object

7、 Create a new feature

train['b_depth_length_ratio'] = train['bill_depth'] / train['bill_length']
train['b_length_depth_ratio'] = train['bill_length'] / train['bill_depth']
train['w_length_mass_ratio'] = train['wing_length'] / train['mass']

         ad locum , We create some features with division to form the ratio of variables

train.head()

8、 Model

         Training test split

         Now is the time to build the model , We first split it into X( features ) and y( Target variable ), Then it is divided into training set and evaluation set .

         Training is where we train our models , Evaluation is where we test the model before fitting it to the test set .

X, y = train.drop(["species"], axis=1), train[["species"]].values.flatten()

X_train, X_eval, y_train, y_eval = train_test_split(
    X, y, test_size=0.25, random_state=0)

         Simple decision tree classifier

         ad locum , We use max_depth = 2 Simple hyperparametric fitting baseline model

dtree_model = DecisionTreeClassifier(max_depth = 2).fit(X_train, y_train)

         After fitting the data , We can use it to predict

dtree_pred = dtree_model.predict(X_eval)

9、 Model performance

print(classification_report(dtree_pred, y_eval))
              precision    recall  f1-score   support

           0       1.00      0.70      0.82        57
           1       0.71      0.92      0.80        13
           2       0.75      1.00      0.86        39

    accuracy                           0.83       109
   macro avg       0.82      0.87      0.83       109
weighted avg       0.88      0.83      0.83       109

         The classification report shows us the useful indicators of the classifier .

         for example , We model f1-score by 0.83

10、 Confusion matrix

         We can also build a confusion matrix to visualize what our classifier is doing well / bad .

# save the target variable classes
class_names = le_name_map.keys()

titles_options = [
    ("Confusion matrix, without normalization", None),
    ("Normalized confusion matrix", "true"),
]
for title, normalize in titles_options:
    fig, ax = plt.subplots(figsize=(8, 8))

    disp = ConfusionMatrixDisplay.from_estimator(
        dtree_model,
        X_eval,
        y_eval,
        display_labels=class_names,
        cmap=plt.cm.Blues,
        normalize=normalize,
        ax = ax
    )
    disp.ax_.set_title(title)
    disp.ax_.grid(False)

    print(title)
    print(disp.confusion_matrix)
Confusion matrix, without normalization
[[40  0  0]
 [ 5 12  0]
 [12  1 39]]
Normalized confusion matrix
[[1.         0.         0.        ]
 [0.29411765 0.70588235 0.        ]
 [0.23076923 0.01923077 0.75      ]]

          The confusion matrix shows us that it predicts more A Classes and C class , That's not surprising , Because we have more samples .

         It also shows that the model should be B/C I predicted more A class .

11、 Importance of features

feature_imp = pd.DataFrame(sorted(zip(dtree_model.feature_importances_,X.columns)), columns=['Value','Feature'])

plt.figure(figsize=(20, 15))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features')
plt.tight_layout()
# plt.savefig('lightgbm_fimp.png')

         In terms of the importance of characteristics , It seems that quality has the best ability to predict species , Secondly, the beak is long . The importance of other variables in the classifier seems to be zero

fig = plt.figure(figsize=(25,20))
_ = tree.plot_tree(dtree_model, 
                   feature_names=X.columns,  
                   class_names=list(class_names),
                   filled=True)

         We saw how to use feature importance in the visualization of our decision tree classifier .

         In the root node , If the quality is lower than 4600 about , Then check bill_length, Otherwise check bill_depth, Then predict the category at the leaf .

  Four 、 Forecast test data

         Now is the time before we fit the model to the test data , The same feature preprocessing and engineering are carried out on the training data .

le = LabelEncoder()

cat_imp = SimpleImputer(strategy="most_frequent")
num_imp = SimpleImputer(strategy="median")

test[cat_cols] = cat_imp.fit_transform(test[cat_cols])
test[num_cols] = num_imp.fit_transform(test[num_cols])

for col in cat_cols:
    if test[col].dtype == "object":
        test[col] = le.fit_transform(test[col])

# Convert cat_features to pd.Categorical dtype
for col in cat_cols:
    test[col] = pd.Categorical(test[col])

# save ID column
test_id = test["ID"]

all_cols.remove('species')
test = test[all_cols]

test['b_depth_length_ratio'] = test['bill_depth'] / test['bill_length']
test['b_length_depth_ratio'] = test['bill_length'] / test['bill_depth']
test['w_length_mass_ratio'] = test['wing_length'] / test['mass']
test_preds = dtree_model.predict(test)
submission_df = pd.concat([test_id, pd.DataFrame(test_preds, columns=['species'])], axis=1)
submission_df.head()
IDspecies
022
150
270
380
490

         Please note that , Species values are numbers , We have to convert it back to a string value . Use with fit Tag encoder , We can do that .

le_name_map
{'A': 0, 'B': 1, 'C': 2}
inv_map = {v: k for k, v in le_name_map.items()}
inv_map
{0: 'A', 1: 'B', 2: 'C'}
submission_df['species'] = submission_df['species'].map(inv_map)  
submission_df.head()
IDspecies
02C
15A
27A
38A
49A
submission_df.to_csv('solution.csv', index=False)

         Last , We write the data frame csv file .

原网站

版权声明
本文为[Sit and watch the clouds rise]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207062046063119.html