当前位置：网站首页>Competition: diabetes genetic risk detection challenge (iFLYTEK)

By 2022 year , Diabetes mellitus in China 1.3 Billion . The causes of diabetes in China are influenced by lifestyle 、 Aging 、 Urbanization 、 Family heredity and other factors affect . meanwhile , People with diabetes tend to be younger .

Diabetes can lead to cardiovascular disease 、 Kidneys 、 Occurrence of cerebrovascular complications . therefore , Accurate diagnosis of individuals with diabetes has very important clinical significance . Early genetic risk prediction of diabetes will help to prevent the occurrence of diabetes .

according to 《 China 2 Guidelines for the prevention and treatment of type 2 diabetes （2017 Edition of ）》, The diagnostic standard of diabetes is to have typical symptoms of diabetes （ Be thirsty and drink more 、 urine 、 More food 、 Unexplained weight loss ） And random intravenous plasma glucose ≥11.1mmol/L Or fasting venous plasma glucose ≥7.0mmol/L Or oral glucose tolerance test （OGTT） After load 2h Plasma glucose ≥11.1mmol/L.

In this competition , You need to build a genetic risk prediction model for diabetes through training data sets , Then predict whether the individuals in the test data set have diabetes , Join us to help diabetes patients solve this problem “ Sweet troubles ”.

Two 、 The mission of the event

2.1 Data set field description

Number ： A number that identifies an individual ;

Gender ：1 For men ,0 For women ;

Year of birth ： Year of Birth ;

Body mass index ： Weight divided by the square of height , Company kg/m2;

Family history of diabetes ： Identify the genetic characteristics of diabetes , Record the family members with diabetes in the family , There are three kinds of signs , One parent has diabetes 、 One uncle or aunt has diabetes 、 No record ;

diastolic pressure ： When the heart relaxes , When arterial elasticity retracts , The resulting pressure is called diastolic pressure , Company mmHg;

Oral glucose tolerance test ： A laboratory test for diagnosing diabetes . The competition data adopts 120 Blood glucose value after minutes of glucose tolerance test , Company mmol/L;

Insulin release test ： Fasting quantitative oral glucose stimulates islets β Cells release insulin . The competition data is after taking sugar 120 Minute plasma insulin level , Company pmol/L;

Triceps brachii skinfold thickness ： At the key point of the connection between the acromion and the olecranon behind the right upper arm , Clip the skin fold parallel to the long axis of the upper limb , Longitudinal measurement , Company cm;

Signs of diabetes ： Data labels ,1 It means having diabetes ,0 It means that you don't have diabetes .

2.2 Training set description

Training set （ Game training set .csv） Altogether 5070 Data , Used to build your forecasting model （ You may need to do data analysis first ）. The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness 、 Signs of diabetes （ The last column ）, You can also use feature engineering techniques to build new features .

2.3 Test set description

Test set （ Competition test set .csv） Altogether 1000 Data , Used to verify the performance of the prediction model . The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness .

3、 ... and 、 Submit instructions

For individuals in the test data set , You must predict whether they have diabetes （ Have diabetes ：1, No diabetes ：0）, The predicted value can only be an integer 1 perhaps 0. The submitted data should have the following format ：

uuid,label

1,0

2,1

3,1

...

In this competition , The result file of the prediction model needs to be named ： Predicted results .csv, And then submit . Please ensure that the file format you submit is standard .

Four 、 Evaluation indicators

For the submitted results , The system will use F1-score Indicators for evaluation ,F1-score The larger the size, the better the performance of the prediction model ,F1-score Is defined as follows ：

among ：

5、 ... and 、 Data analysis

5.1 Import data

Decompress the game data , And use pandas To read ;

import pandas as pd

train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')

print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)

5.2 View training set and test set field types

5.3 Missing value of statistical field

train_df.isnull().sum()

 Number             0
 Gender             0
 Year of birth           0
 Body mass index           0
 Family history of diabetes         0
 diastolic pressure          247
 Oral glucose tolerance test        0
 Insulin release test        0
 Triceps brachii skinfold thickness       0
 Signs of diabetes        0
dtype: int64

test_df.isnull().sum()

 Number            0
 Gender            0
 Year of birth          0
 Body mass index          0
 Family history of diabetes        0
 diastolic pressure          49
 Oral glucose tolerance test       0
 Insulin release test       0
 Triceps brachii skinfold thickness      0
dtype: int64

Calculation of missing proportion of each column in training set and test set

The only column that contains missing values is the diastolic pressure column , And the missing value accounts for a small proportion .

But it is obvious that the training is concentrated ：

Oral glucose tolerance test is -1 Is also a missing value , The insulin release test is 0 Is also a missing value , The thickness of triceps brachii is 0 Is also a missing value , To be dealt with later .

5.4 Analyze field types

Screenshot from ashui .

Description of training set and test set

5.5 Field correlation

import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['FangSong']  #  Used to display Chinese labels normally 
plt.rcParams['axes.unicode_minus'] = False  #  Used to display negative sign normally 

#  Training set correlation thermogram matrix 
plt.subplots(figsize=(10,10))
sns.heatmap(train_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('train_pearson.jpg', dpi=800)

How to draw a heat map ： http://t.csdn.cn/FQIro

#  Test set correlation thermogram matrix 
plt.subplots(figsize=(10,10))
sns.heatmap(test_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('test_pearson.jpg', dpi=800)

As can be seen from the heat map , Focus on training Body mass index and Triceps brachii skinfold thickness The relevance with labels is relatively high , Triceps brachii skinfold thickness has the highest correlation with labels . The correlation between fields is generally not high .

6、 ... and 、 Logistic regression attempts

6.1 Import sklearn The logical return of

#  Building a logistic regression model 
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline

#  Building a logistic regression model 
model = make_pipeline(
    MinMaxScaler(),
    LogisticRegression()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])

6.2 Use training sets and logistic regression for training , And make predictions on the test set ;

test_dataset["label"] = model.predict(test_dataset.drop([" Number "],axis=1))

test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_lr.csv",index=None)

6.3 Submit results

6.4 Try the decision tree model

#  Try to build a decision tree model 
model = make_pipeline(
    MinMaxScaler(),
    DecisionTreeClassifier()
    
)
model.fit(train_dataset,train_data[" Signs of diabetes "])

test_dataset["label"] = model.predict(test_dataset.drop([" Number ",'label'],axis=1))
test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_dt.csv",index=None)

result ：

7、 ... and 、 Feature Engineering

7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average

train_dataset.groupby(" Gender ")[" Body mass index "].apply(np.mean)

7.2 Calculate the difference between the average value of each patient and each sex

"""
 The normal value of the body mass index for adults is 18.5-24 Between 
 lower than 18.5 It's a low BMI 
 stay 24-27 Between them is overweight 
27 The above consideration is obesity 
 higher than 32 You are very fat .
"""
def BMI(a):
    if a<18.5:
        return 0
    elif 18.5<=a<=24:
        return 1
    elif 24<a<=27:
        return 2
    elif 27<a<=32:
        return 3
    else:
        return 4
    
data['BMI']=data[' Body mass index '].apply(BMI)
data[' Year of birth ']=2022-data[' Year of birth ']  # Change to age 
# Family history of diabetes 
"""
 No record 
 One uncle or aunt has diabetes / One uncle or aunt has diabetes 
 One parent has diabetes 
"""
def FHOD(a):
    if a==' No record ':
        return 0
    elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
        return 1
    else:
        return 2
    
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
"""
 The diastolic pressure range is 60-90
"""
def DBP(a):
    if a<60:
        return 0
    elif 60<=a<=90:
        return 1
    elif a>90:
        return 2
    else:
        return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
data

8、 ... and 、 High order tree model

8.1 install lightgbm

import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training

train_data = pd.read_csv(" Game training set .csv",encoding='gbk')
test_data = pd.read_csv(" Competition test set .csv",encoding='gbk')

train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
#  Divide the data set 
train_x,valid_x = train_test_split(train_data,test_size=0.2)

clf_lgb = lgb.LGBMClassifier(
    max_depth=3, 
    n_estimators=4000, 
    n_jobs=-1, 
    verbose=-1,
    verbosity=-1,
    learning_rate=0.1,
)
clf_lgb.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts = clf_lgb.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts))

[LightGBM] [Warning] verbosity is set=-1, verbose=-1 will be ignored. Current value: verbosity=-1
0.9546351084812623

#  Search parameters 
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=2022)
classifier = lgb.LGBMClassifier()
params = {
    " max_depth":[4,5,6],
    "n_estimators":[3000,4000,5000],
    "learning_rate":[0.15,0.2,0.25]
}
clf  = GridSearchCV(estimator=classifier,param_grid=params,verbose=True,cv=kfold)
clf.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts1 = clf.best_estimator_.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts1))

Nine 、 Multi fold training and integration

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold
import lightgbm as lgb

#  Reading data 
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')


#  Foundation Feature Engineering 
train_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
test_df[' Body mass index _round'] = train_df[' Body mass index '] // 10

train_df[' Oral glucose tolerance test '] = train_df[' Oral glucose tolerance test '].replace(-1, np.nan)
test_df[' Oral glucose tolerance test '] = test_df[' Oral glucose tolerance test '].replace(-1, np.nan)

dict_ Family history of diabetes  = {
    ' No record ': 0,
    ' One uncle or aunt has diabetes ': 1,
    ' One uncle or aunt has diabetes ': 1,
    ' One parent has diabetes ': 2
}

train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
test_df[' Family history of diabetes '] = test_df[' Family history of diabetes '].map(dict_ Family history of diabetes )

train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
test_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')

train_df[' Gender '] = train_df[' Gender '].astype('category')
test_df[' Gender '] = train_df[' Gender '].astype('category')

train_df[' Age '] = 2022 - train_df[' Year of birth ']
test_df[' Age '] = 2022 - test_df[' Year of birth ']

train_df[' Oral glucose tolerance test _diff'] = train_df[' Oral glucose tolerance test '] - train_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
test_df[' Oral glucose tolerance test _diff'] = test_df[' Oral glucose tolerance test '] - test_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']


#  Model cross validation 
def run_model_cv(model, kf, X_tr, y, X_te, cate_col=None):
    train_pred = np.zeros( (len(X_tr), len(np.unique(y))) )
    test_pred = np.zeros( (len(X_te), len(np.unique(y))) )

    cv_clf = []
    for tr_idx, val_idx in kf.split(X_tr, y):
        x_tr = X_tr.iloc[tr_idx]; y_tr = y.iloc[tr_idx]

        x_val = X_tr.iloc[val_idx]; y_val = y.iloc[val_idx]

        call_back = [
            lgb.early_stopping(50),
        ]
        eval_set = [(x_val, y_val)]
        model.fit(x_tr, y_tr, eval_set=eval_set, callbacks=call_back, verbose=-1)

        cv_clf.append(model)

        train_pred[val_idx] = model.predict_proba(x_val)
        test_pred += model.predict_proba(X_te)

    test_pred /= kf.n_splits
    return train_pred, test_pred, cv_clf

clf = lgb.LGBMClassifier(
    max_depth=3, 
    n_estimators=4000, 
    n_jobs=-1, 
    verbose=-1,
    verbosity=-1,
    learning_rate=0.1,
)

train_pred, test_pred, cv_clf = run_model_cv(
    clf, KFold(n_splits=5),
    train_df.drop([' Number ', ' Signs of diabetes '], axis=1),
    train_df[' Signs of diabetes '],
    test_df.drop([' Number '], axis=1),
)

print((train_pred.argmax(1) == train_df[' Signs of diabetes ']).mean())
test_df['label'] = test_pred.argmax(1)
test_df.rename({' Number ': 'uuid'}, axis=1)[['uuid', 'label']].to_csv('submit.csv', index=None)