当前位置:网站首页>Love math experiment | phase 9 - intelligent health diagnosis using machine learning method

Love math experiment | phase 9 - intelligent health diagnosis using machine learning method

2022-06-27 20:59:00 Data science artificial intelligence

Love number class :idatacourse.cn

field : Medical care

brief introduction : The burden of chronic liver disease in India has been very high in recent years , 2017 Due to liver cirrhosis, nearly 22 Ten thousand deaths . Chronic liver disease can also lead to serious overlapping infections , Acute chronic liver failure , Increase fulminant hepatic failure and mortality . In this case , We conducted an exploratory analysis of the indicators that affect the incidence of liver disease in India , And build a machine learning classification model , Automatic intelligent diagnosis of liver diseases .

data :

./dataset/Indian Liver Patient Dataset (ILPD).csv

Catalog

1. Data reading and preprocessing

1.1 Data set profile

Indian liver disease patient data set (Indian Liver Patient Dataset) contain 416 Patients with liver disease were recorded and 167 Records of patients without liver disease . The data set is from Andra, India · Collected in the northeast of Pradesh . The label column is used to group ( With or without liver disease ) Class tags for label. This dataset contains 441 Male patient records and 142 Female patient records . Dataset Links :http://www.idatascience.cn/dataset-detail?table_id=88

The following are the specific meanings represented by the columns of the dataset :

Name

data type

Definition

Age

Integer

The patient's age

Gender

String

Gender of patients

TB

Float

Total bilirubin

DB

Float

Direct bilirubin

Alkphos

Integer

Alkaline phosphatase

Sgpt

Integer

Alanine aminotransferase

Sgot

Integer

Aspartate aminotransferase

TP

Float

Total protein

ALB

Float

Albumin

A/G Ratio

Float

Albumin to globulin ratio

label

Integer

Are you sick (1 For illness ,2 For health )

1.2 Import data

First, import what you need Python package

import pandas as pd
import numpy as np

#  Do not display warnings 
import warnings
warnings.filterwarnings('ignore')

Reading data sets , And view the attribute information related to each column of the dataset :

data = pd.read_csv("./dataset/Indian Liver Patient Dataset (ILPD).csv")
data.head()
#  View data information 
data.info()

You can see it here ,A/G Ratio The field is a floating-point number segment , There is 4 Missing values , We need to fill it up . also Gender For the only object features , We need to deal with it in advance as a numerical type , Easy to model .

Use .describe() Make statistics on the overall data , View the statistical characteristics of common data .

data.describe(include='all')

From here we can see that ,Gender Column as a character variable , The remaining columns are numeric variables .Sgpt and SgotTB and DB The maximum value of is tens of times of the average value , This shows that the distribution of these characteristics is relatively scattered

1.3 Digital coding

Gender Column as string , In order to facilitate the introduction and processing in the subsequent model , We are right. Gender Digitally encode .

from sklearn.preprocessing import LabelEncoder
#  Digital coding 
le = LabelEncoder()

#  Digital coding 
data["Gender"] = le.fit_transform(data["Gender"][:])

data

After the coding , women (Female) Encoded as 0, men (Male) Encoded as 1.

1.4 Missing value filling

After importing the data and viewing the data information, you can find ,A/G Ratio There are missing values , We need to fill it in . Here we use the mean filling method to fill .

from sklearn.impute import SimpleImputer

#  Use the mean filling method 
imp = SimpleImputer(strategy="mean")
data["A/G Ratio"] = imp.fit_transform(data["A/G Ratio"].to_frame())

data.describe()

After filling ,A/G Ratio Column mean The value remains the same , Same as before filling in missing values 0.947064, however count Already by 579 Turned into 583.

1.5 Data column value conversion

Because when processing ,LogisticRegression The label column of needs to be or , But the label column of this dataset label The value of is , So we need to convert first .

#  Definition label Column 
calculate_col = "label"
calculate_value = 2

#  take label Value from {1,2}, Turn into {1,0} And changed its name to label_cal
data[calculate_col + '_cal'] = calculate_value - data[calculate_col]
#  Delete label Column 
data = data.drop(labels='label', axis=1)

# View the converted dataset 
data

After the transformation , Be ill The label is 1, Is a positive class label , Not sick by 0, Is a negative class label , It is convenient for us to use in subsequent modeling .

2. Data relevance exploration

2.1 Age distribution histogram 、 Sex number histogram

First of all, Age Gender And so on , Draw the histogram of age distribution and the histogram of sex number .

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
plt.rcParams['font.sans-serif'] = ['SimHei']  #  Used to display Chinese labels normally 
plt.rcParams['axes.unicode_minus'] = False  #  Used to display negative sign normally 

#  Definition genders Column 
genders = [' women ', ' men ']
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(12, 5))

#  Draw the histogram of age distribution 
sns.histplot(data['Age'], kde=False, palette="Set2", ax=ax1)
#  Draw a histogram of gender distribution 
sns.countplot(x='Gender', data=data, palette="Set2", ax=ax2)

ax1.set(xlabel=' Age ', ylabel=' Number ')
ax2.set(xlabel=' Gender ', ylabel=' Number ')
ax2.set_xticklabels(genders)

As can be seen from the above figure , Our data set , The sample number of middle-aged people and men is larger

2.2 Age 、 Gender and disease distribution

Yes Age Gender Give a general description of whether you have liver disease , The methods used are box chart and histogram .

#  Define two images 
fig, [ax1, ax2] = plt.subplots(1, 2, figsize=(12, 5))

#  Draw an age box 
sns.boxplot(x='label_cal', y='Age', hue='label_cal', data=data, palette="Set2", ax=ax1)
#  Draw a gender histogram 
sns.countplot(x='Gender', hue='label_cal', data=data, palette="Set2", ax=ax2)

#  Title The image 
ax1.set_title(' Distribution of age and illness ', fontsize=13)
ax2.set_title(' The distribution of gender and prevalence ', fontsize=13)

#  Label the image 
labels = [' Not sick ', ' Be ill ']
ax1.set(xlabel=' Are you sick ', ylabel=' Age ')
ax1.set_xticklabels(labels)
ax1.legend_.remove()
ax2.set(xlabel=' Gender ', ylabel=' Number ')
ax2.set_xticklabels(genders)
ax2.legend([' Not sick ', ' Be ill '])

plt.show()

As can be seen from these two pictures , Liver diseases are more inclined to older people in the sample , And the proportion of male patients is higher .

2.3 Correlation among indicators

Next, use the heat map (heatmap) Calculate the linear correlation coefficient between continuous objects , Visualize their relevance .

def correlation_heatmap(df):
    hm, ax = plt.subplots(figsize=(12, 8))

#  Draw a heat map 
    hm = sns.heatmap(
        df.corr(),
        cmap='Blues',
        square=True,
        cbar_kws={'shrink': .9},
        ax=ax,
        annot=True,
        linewidths=0.1, vmax=1.0, linecolor='white',
        annot_kws={'fontsize': 12}
    )

    plt.title(' Pearson correlation coefficient between continuous objects ', y=1.05, size=15)

#  Delete Gender And label_cal Column 
df = data.drop(['Gender', 'label_cal'], axis=1)
correlation_heatmap(df)

It can be seen from the correlation coefficient diagram between features , The deeper the color , The stronger the positive correlation between features ; The lighter the color , The stronger the negative correlation between features . among , Total bilirubin (TB) And Direct bilirubin (DB) The correlation coefficient is 0.87, They are all indicators of bilirubin in the blood ; Alanine aminotransferase (sgpt) and Aspartate aminotransferase (sgot) The correlation coefficient is 0.79, They measure the amount of serum enzymes in the blood ; Albumin (ALB) And Total protein (TP) Albumin (ALB) And globulin ratio (A/G Ratio) The correlation coefficients are all in 0.6 above , They measure the level of serum proteins . On the whole , There is a strong correlation between features , When modeling later , It is necessary to focus on the feature selection of the model .

For two discrete variables Gender and label_cal, We use chi square test to observe the correlation , The way to do it is sklearn.feature_selection Medium chi2 Method :

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

#  Define two new data frames 
gender = pd.DataFrame(data['Gender'])
label_cal = pd.DataFrame(data['label_cal'])

#  Use chi square test 
model1 = SelectKBest(chi2, k=1) #  choice k The best feature 
model1.fit_transform(gender, label_cal)

#  Print scores 
print(' The score between the gender variable and the disease was :%.4f' % model1.scores_)
print(' The relationship between sex variables and disease pvalue by :%.4f' % model1.pvalues_)

It can be seen that ,p The value is 0.3261, Far greater than 0.05, It shows that there is no significant statistical difference between gender and disease , That is, sex is related to whether you get sick .

Calculation label_cal And other continuous variables , The method used is sklearn.feature_selection Medium f_classif Method :

from sklearn.feature_selection import f_classif


fdata = pd.DataFrame(data.drop(['Gender', 'label_cal'], axis=1))#  Delete gender Column and label_cal Column 
label_cal = pd.DataFrame(data['label_cal'])

#  Use f_classif Calculation label_cal Relationship with other continuous variables 
F, p_val = f_classif(fdata, label_cal)
# f The distribution of the 0.05 quantile 
print(' Name of each continuous variable :')
print(fdata.columns.tolist())
print(' The relationship between continuous variables and disease F The value is :')
print(F)
print(' The relationship between continuous variables and disease pvalue by :')
print(p_val)

From the calculation P The value shows , except TP The value of is 0.398, Greater than 0.05 outside , Each continuous variable is related to label_cal There is a high correlation between .

3. Building a classification model

Whether a patient is ill is a dichotomous problem , We will use logistic regression 、 Decision tree and random forest method are used to model the data .

3.1 Training set test set division

We calculated according to label_cal To divide the data set . The division scale is set to Test set : Training set = 20%:80%.

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
x = data.drop(['label_cal'], axis=1) # x To delete label_cal Data set of 
y = data['label_cal'] # y by label_cal Column 
#  Divide the data set 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=2, stratify=data['label_cal'])

3.2 Logical regression

Logistic regression is a generalized linear regression model , On the basis of regression , Use Logistic Function maps the output of a continuous type to (0,1) Between , To solve the problem of classification .

stay Python in , Use sklearn_model Of LogisticRegression Conduct classification modeling , The main parameters used are :

  • penalty —— May be set as l1 perhaps l2, representative L1 and L2 Regularization , The default is l2.
  • class_weight ——class-weight Used to specify the weight of each category of the sample , It is mainly to prevent the training set from having too many samples of certain categories , The decision tree of training is too biased towards these categories .
  • random_state —— Random seeds , Set to a constant , Ensure that the result of each run is the same .

We use these parameters to build the model , among ,class_weight The weight we give ourselves :

from sklearn.linear_model import LogisticRegression

#  Assign a value to the weight 
weights = {0: 1, 1: 1.3}

#  Conduct logistic Return to 
lr = LogisticRegression(penalty='l2', random_state=8, class_weight=weights)
lr.fit(x_train, y_train)

#  Yes y To make predictions 
y_predprb = lr.predict_proba(x_test)[:, 1]
y_pred = lr.predict(x_test)
from sklearn import metrics
from sklearn.metrics import auc

#  Calculation fpr,tpr And thresholds Value 
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_predprb)
#  Calculation gmean Value 
gmean = np.sqrt(tpr*(1-fpr))

#  Calculate the biggest gmean Value corresponding thresholds value 
dictionary = dict(zip(thresholds,gmean))
max_thresholds = max(dictionary, key=dictionary.get)

print(" maximal GMean The value is :%.4f"%(max(gmean)))
print(" maximal GMean Corresponding thresholds by :%.4f"%(max_thresholds))

Calculation AUC value , Print the classification forecast report and draw the confusion matrix :

from sklearn.metrics import roc_auc_score
#  Calculation AUC value 
test_roc_auc = roc_auc_score(y_test, y_predprb)
print(test_roc_auc)
#  Print model classification forecast report 
print(classification_report(y_test, y_pred))
#  Draw the confusion matrix thermodynamic diagram 
cm1 = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm1, annot=True, linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'ROC AUC Score: {0}'.format(round(test_roc_auc,2))
plt.title(all_sample_title, size=15)

You can see , Our trained logistic regression model can be implemented on the test set Disease category (label_cal=1) Of Recall rate (Recall) achieve 0.93, And accuracy (Precision) achieve 0.71, The overall average F1_score achieve 0.45, It is a model with general classification level .

3.3 Decision tree

Use sklearn Medium DecisionTreeClassifier Algorithm to train decision tree model . The main parameters used are :

  • max_depth: Set the maximum depth of the decision tree . It is a better value set after multiple tests .
  • class_weight: Used to specify the weight of each category of the sample , It is mainly to prevent the training set from having too many samples of certain categories , The decision tree of training is too biased towards these categories . Set to balanced when , It can ensure that the category with a small sample size has a higher weight .
  • random_state —— Random seeds , Set to a constant , Ensure that the result of each run is the same .
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

#  Build a decision tree model 
model = DecisionTreeClassifier(random_state=5, class_weight=weights)
model = model.fit(x_train, y_train)

#  Yes y To make predictions 
y_predict = model.predict(x_test)
y_predprb = model.predict_proba(x_test)[:, 1]

We use the evaluation index of the binary model to evaluate the decision tree model , Print the classification report and draw the confusion matrix :

#  Calculation AUC value 
test_roc_auc = roc_auc_score(y_test, y_predprb)
print(test_roc_auc)
#  Print model classification forecast report 
print(classification_report(y_test, y_predict))
#  Draw the confusion matrix thermodynamic diagram 
cm2 = confusion_matrix(y_test, y_predict)
plt.figure(figsize=(9, 9))
sns.heatmap(cm2, annot=True, linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'ROC AUC score: {0}'.format(round(test_roc_auc,2))
plt.title(all_sample_title, size=15)

You can see , It is close to the logistic regression model , Our trained decision tree model can be implemented on the test set Disease category (label_cal=1) Of Recall rate (Recall) achieve 0.71, And accuracy (Precision) achieve 0.77, The overall average F1_score achieve 0.58, The classification level is average .

3.4 Random forests

Random forest is an integrated model , Samples and features are extracted from the data in a random way , Train multiple different decision trees , formation “ The forest ”. Each tree gives its own classification opinions , call “ vote ”.

stay Python in , Use sklearn.ensemble Of RandomForestClassifier Conduct classification modeling , The main parameters used are :

  • n_estimator: Number of training classifiers , The default value is 100.
  • max_depth: The maximum depth of each tree , The default is 3.
  • random_state: Random seeds , Set to a constant , Ensure that the result of each run is the same .
  • class_weight: Used to specify the weight of each category of the sample , It is mainly to prevent the training set from having too many samples of certain categories , The decision tree of training is too biased towards these categories .
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

#  Build a random forest model 
ran_for = RandomForestClassifier(n_estimators=80, random_state=0, class_weight=weights)
ran_for.fit(x_train, y_train)

#  Yes y To make predictions 
y_pred_ran = ran_for.predict(x_test)
y_predprb = ran_for.predict_proba(x_test)[:, 1]
#  Calculation AUC value 
test_roc_auc = roc_auc_score(y_test, y_predprb)
print(test_roc_auc)
#  Print model classification forecast report 
print(classification_report(y_test, y_pred_ran, digits=2))
#  Draw the confusion matrix thermodynamic diagram 
cm3 = confusion_matrix(y_test, y_pred_ran)
plt.figure(figsize=(9, 9))
sns.heatmap(cm3, annot=True, linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'ROC AUC score: {0}'.format(round(test_roc_auc,2))
plt.title(all_sample_title, size=15)

The integrated model has improved our classification accuracy , But the effect is limited , Our trained random forest model can be implemented on the test set Disease category (label_cal=1) Of Recall rate (Recall) achieve 0.80, And accuracy (Precision) achieve 0.77, The overall average F1_score achieve 0.61.

It can be concluded that the classification effect of random forest model is higher than that of logistic regression and decision tree .

3.5 Principal component analysis

PCA Dimensionality reduction is a common data dimensionality reduction method , The aim is to “ Information ” On the premise of less loss , Convert high dimensional data to low dimensional data , Thus, the amount of calculation is reduced .PCA It is usually used for exploration and visualization of high-dimensional data sets , It can also be used for data compression , Data preprocessing, etc .

In addition to the above model we built , Because there are many characteristics of data , We use PCA Principal component analysis Dimensionality reduction of data .

Principal component analysis must begin with a table of variables of the same dimension . Because it is necessary to assign the total variance of the variable to the characteristic root , So the variables must have the same physical units , Variance makes sense ( The unit of variance is the square of the variable unit ). The variables of principal component analysis can also be dimensionless data , For example, standardized or logarithmically transformed data . So before building the model , We need to standardize the data . Common standardization methods are min-max Standardization and z-score Standardization etc. . In this case , We directly adopt z-score Standardization Method .

Standardize data :

from sklearn import preprocessing
X = data.iloc[:, :-1] #  except label_cal Column 
y = data['label_cal']

np.random.seed(123)
perm = np.random.permutation(len(X)) #  Randomly generate a new sequence from the array 
X = X.loc[perm]
y = y[perm]
X = preprocessing.scale(X)#  Standardized treatment 

Then, the standardized data is used PCA Dimension reduction , Choose to keep 6 Dimensions :

from sklearn.decomposition import PCA
#  Use PCA Carry out dimension reduction 
pca = PCA(copy=True, n_components=6, whiten=False, random_state=1)
X_new = pca.fit_transform(X)

print(u' Reserved 6 The variance contribution rate of the principal components is :')
print(pca.explained_variance_ratio_)
print(u' The top 2 The principal component eigenvector of is :')
print(pca.components_[0:1])
print(u' The cumulative variance contribution rate is :')
print(sum(pca.explained_variance_ratio_))

The cumulative variance contribution rate of the first six principal components reached 89.52%, It can achieve better dimension reduction effect .

After dimensionality reduction, the data set needs to be divided again :

#  Divide the data set 
x_train, x_test, y_train, y_test = train_test_split(X_new, y, test_size=0.2, random_state=2, stratify=data['label_cal'])

After dimension reduction , We use... Again Random forests Building a classification model , And evaluate the classification results .

#  Build a random forest model 
ran_for = RandomForestClassifier(n_estimators=80, random_state=0, class_weight=weights)
#  Training models 
ran_for.fit(x_train, y_train)

#  Yes y To make predictions 
y_pred_ran = ran_for.predict(x_test)
y_predprb = ran_for.predict_proba(x_test)[:, 1]

Calculation AUC value , Print the classification report and draw the confusion matrix :

#  Calculation AUC value 
test_roc_auc = roc_auc_score(y_test, y_predprb)
print(test_roc_auc)
#  Print model classification forecast report 
print(classification_report(y_test, y_pred_ran, digits=2))
#  Draw the confusion matrix thermodynamic diagram 
cm4 = confusion_matrix(y_test, y_pred_ran)
plt.figure(figsize=(9, 9))
sns.heatmap(cm4, annot=True, linewidths=.5, square=True, cmap='Blues')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
all_sample_title = 'ROC AUC score: {0}'.format(round(test_roc_auc,2))
plt.title(all_sample_title, size=15)

The reduced dimension stochastic forest model has achieved a significant improvement , It can be implemented on the test set Disease category (label_cal=1) Of Recall rate (Recall) achieve 0.85, And accuracy (Precision) achieve 0.83, The overall average F1_score achieve 0.65.

4. summary

This case-based and machine learning model has analyzed and predicted the diagnostic data of Indian patients with liver diseases , Through data preprocessing 、 The three stages of exploratory analysis and classification modeling are deeply explored . In data preprocessing , By viewing the data description information, it is found that there are missing values in the data and they are filled ; In data exploratory analysis , Different ages were compared by grouping 、 The proportion of diseases in the population of gender ; In the process of classification modeling , Logistic regression is used respectively , Decision tree , There are three different methods to predict random forest , By comparing the classification model Recall、Precision and F1 Value to evaluate the model , The results show that the random forest model has the best prediction effect , In order to further improve the accuracy and efficiency of the model , We do principal component analysis on the data to reduce the dimension , The reduced dimension data are further classified by random forest model , And the effect of the model is improved to some extent .

Love number class (iDataCourse) It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .

原网站

版权声明
本文为[Data science artificial intelligence]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/178/202206271840139517.html