当前位置:网站首页>Competition: diabetes genetic risk detection challenge (iFLYTEK)
Competition: diabetes genetic risk detection challenge (iFLYTEK)
2022-07-28 08:42:00 【Ling Xianwen】

2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Catalog
2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Two 、 The mission of the event
2.1 Data set field description
3、 ... and 、 Submit instructions
5.2 View training set and test set field types
5.3 Missing value of statistical field
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
6.4 Try the decision tree model
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
7.2 Calculate the difference between the average value of each patient and each sex
8、 ... and 、 High order tree model
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
Nine 、 Multi fold training and integration
One 、 Background of the event
By 2022 year , Diabetes mellitus in China 1.3 Billion . The causes of diabetes in China are influenced by lifestyle 、 Aging 、 Urbanization 、 Family heredity and other factors affect . meanwhile , People with diabetes tend to be younger .
Diabetes can lead to cardiovascular disease 、 Kidneys 、 Occurrence of cerebrovascular complications . therefore , Accurate diagnosis of individuals with diabetes has very important clinical significance . Early genetic risk prediction of diabetes will help to prevent the occurrence of diabetes .
according to 《 China 2 Guidelines for the prevention and treatment of type 2 diabetes (2017 Edition of )》, The diagnostic standard of diabetes is to have typical symptoms of diabetes ( Be thirsty and drink more 、 urine 、 More food 、 Unexplained weight loss ) And random intravenous plasma glucose ≥11.1mmol/L Or fasting venous plasma glucose ≥7.0mmol/L Or oral glucose tolerance test (OGTT) After load 2h Plasma glucose ≥11.1mmol/L.
In this competition , You need to build a genetic risk prediction model for diabetes through training data sets , Then predict whether the individuals in the test data set have diabetes , Join us to help diabetes patients solve this problem “ Sweet troubles ”.
Two 、 The mission of the event
2.1 Data set field description
Number : A number that identifies an individual ;
Gender :1 For men ,0 For women ;
Year of birth : Year of Birth ;
Body mass index : Weight divided by the square of height , Company kg/m2;
Family history of diabetes : Identify the genetic characteristics of diabetes , Record the family members with diabetes in the family , There are three kinds of signs , One parent has diabetes 、 One uncle or aunt has diabetes 、 No record ;
diastolic pressure : When the heart relaxes , When arterial elasticity retracts , The resulting pressure is called diastolic pressure , Company mmHg;
Oral glucose tolerance test : A laboratory test for diagnosing diabetes . The competition data adopts 120 Blood glucose value after minutes of glucose tolerance test , Company mmol/L;
Insulin release test : Fasting quantitative oral glucose stimulates islets β Cells release insulin . The competition data is after taking sugar 120 Minute plasma insulin level , Company pmol/L;
Triceps brachii skinfold thickness : At the key point of the connection between the acromion and the olecranon behind the right upper arm , Clip the skin fold parallel to the long axis of the upper limb , Longitudinal measurement , Company cm;
Signs of diabetes : Data labels ,1 It means having diabetes ,0 It means that you don't have diabetes .
2.2 Training set description
Training set ( Game training set .csv) Altogether 5070 Data , Used to build your forecasting model ( You may need to do data analysis first ). The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness 、 Signs of diabetes ( The last column ), You can also use feature engineering techniques to build new features .
2.3 Test set description
Test set ( Competition test set .csv) Altogether 1000 Data , Used to verify the performance of the prediction model . The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness .
3、 ... and 、 Submit instructions
For individuals in the test data set , You must predict whether they have diabetes ( Have diabetes :1, No diabetes :0), The predicted value can only be an integer 1 perhaps 0. The submitted data should have the following format :
uuid,label
1,0
2,1
3,1
...
In this competition , The result file of the prediction model needs to be named : Predicted results .csv, And then submit . Please ensure that the file format you submit is standard .
Four 、 Evaluation indicators
For the submitted results , The system will use F1-score Indicators for evaluation ,F1-score The larger the size, the better the performance of the prediction model ,F1-score Is defined as follows :

among :


5、 ... and 、 Data analysis
5.1 Import data
- Decompress the game data , And use pandas To read ;
import pandas as pd
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)5.2 View training set and test set field types


5.3 Missing value of statistical field
train_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 247
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
Signs of diabetes 0
dtype: int64test_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 49
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
dtype: int64Calculation of missing proportion of each column in training set and test set


The only column that contains missing values is the diastolic pressure column , And the missing value accounts for a small proportion .
But it is obvious that the training is concentrated :
Oral glucose tolerance test is -1 Is also a missing value , The insulin release test is 0 Is also a missing value , The thickness of triceps brachii is 0 Is also a missing value , To be dealt with later .
5.4 Analyze field types
Screenshot from ashui .


Description of training set and test set


5.5 Field correlation
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['FangSong'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Used to display negative sign normally
# Training set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(train_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('train_pearson.jpg', dpi=800)
How to draw a heat map : http://t.csdn.cn/FQIro
# Test set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(test_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('test_pearson.jpg', dpi=800)
As can be seen from the heat map , Focus on training Body mass index and Triceps brachii skinfold thickness The relevance with labels is relatively high , Triceps brachii skinfold thickness has the highest correlation with labels . The correlation between fields is generally not high .
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
# Building a logistic regression model
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline# Building a logistic regression model
model = make_pipeline(
MinMaxScaler(),
LogisticRegression()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
test_dataset["label"] = model.predict(test_dataset.drop([" Number "],axis=1))test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_lr.csv",index=None)
6.3 Submit results
6.4 Try the decision tree model
# Try to build a decision tree model
model = make_pipeline(
MinMaxScaler(),
DecisionTreeClassifier()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
test_dataset["label"] = model.predict(test_dataset.drop([" Number ",'label'],axis=1))
test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_dt.csv",index=None)result :
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
train_dataset.groupby(" Gender ")[" Body mass index "].apply(np.mean)
7.2 Calculate the difference between the average value of each patient and each sex

"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
data[' Year of birth ']=2022-data[' Year of birth '] # Change to age
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
data8、 ... and 、 High order tree model
8.1 install lightgbm
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
train_data = pd.read_csv(" Game training set .csv",encoding='gbk')
test_data = pd.read_csv(" Competition test set .csv",encoding='gbk')
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
# Divide the data set
train_x,valid_x = train_test_split(train_data,test_size=0.2)clf_lgb = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
clf_lgb.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts = clf_lgb.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts))
[LightGBM] [Warning] verbosity is set=-1, verbose=-1 will be ignored. Current value: verbosity=-1
0.9546351084812623# Search parameters
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=2022)
classifier = lgb.LGBMClassifier()
params = {
" max_depth":[4,5,6],
"n_estimators":[3000,4000,5000],
"learning_rate":[0.15,0.2,0.25]
}
clf = GridSearchCV(estimator=classifier,param_grid=params,verbose=True,cv=kfold)
clf.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts1 = clf.best_estimator_.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts1))
Nine 、 Multi fold training and integration
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold
import lightgbm as lgb
# Reading data
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
# Foundation Feature Engineering
train_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
test_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
train_df[' Oral glucose tolerance test '] = train_df[' Oral glucose tolerance test '].replace(-1, np.nan)
test_df[' Oral glucose tolerance test '] = test_df[' Oral glucose tolerance test '].replace(-1, np.nan)
dict_ Family history of diabetes = {
' No record ': 0,
' One uncle or aunt has diabetes ': 1,
' One uncle or aunt has diabetes ': 1,
' One parent has diabetes ': 2
}
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
test_df[' Family history of diabetes '] = test_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
test_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
train_df[' Gender '] = train_df[' Gender '].astype('category')
test_df[' Gender '] = train_df[' Gender '].astype('category')
train_df[' Age '] = 2022 - train_df[' Year of birth ']
test_df[' Age '] = 2022 - test_df[' Year of birth ']
train_df[' Oral glucose tolerance test _diff'] = train_df[' Oral glucose tolerance test '] - train_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
test_df[' Oral glucose tolerance test _diff'] = test_df[' Oral glucose tolerance test '] - test_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
# Model cross validation
def run_model_cv(model, kf, X_tr, y, X_te, cate_col=None):
train_pred = np.zeros( (len(X_tr), len(np.unique(y))) )
test_pred = np.zeros( (len(X_te), len(np.unique(y))) )
cv_clf = []
for tr_idx, val_idx in kf.split(X_tr, y):
x_tr = X_tr.iloc[tr_idx]; y_tr = y.iloc[tr_idx]
x_val = X_tr.iloc[val_idx]; y_val = y.iloc[val_idx]
call_back = [
lgb.early_stopping(50),
]
eval_set = [(x_val, y_val)]
model.fit(x_tr, y_tr, eval_set=eval_set, callbacks=call_back, verbose=-1)
cv_clf.append(model)
train_pred[val_idx] = model.predict_proba(x_val)
test_pred += model.predict_proba(X_te)
test_pred /= kf.n_splits
return train_pred, test_pred, cv_clf
clf = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
train_pred, test_pred, cv_clf = run_model_cv(
clf, KFold(n_splits=5),
train_df.drop([' Number ', ' Signs of diabetes '], axis=1),
train_df[' Signs of diabetes '],
test_df.drop([' Number '], axis=1),
)
print((train_pred.argmax(1) == train_df[' Signs of diabetes ']).mean())
test_df['label'] = test_pred.argmax(1)
test_df.rename({' Number ': 'uuid'}, axis=1)[['uuid', 'label']].to_csv('submit.csv', index=None)

For the first time, I participated in the competition of data mining , Many places learn from the boss , Learned a lot , Next time try .
边栏推荐
- 招贤纳士,GBASE高端人才招募进行中
- C轮融资已完成!思迈特软件领跑国内BI生态赋能,产品、服务竿头一步
- 2022 Niuke multi school second problem solving Report
- How to import and export Youxuan database
- Hcip --- LDP and MPLS Technology (detailed explanation)
- Solution: indexerror: index 13 is out of bounds for dimension 0 with size 13
- 客户至上 | 国产BI领跑者,思迈特软件完成C轮融资
- Day112.尚医通:手机验证码登录功能
- 机器学习如何做到疫情可视化——疫情数据分析与预测实战
- Does gbase 8s support storing relational data and object-oriented data?
猜你喜欢
![[mindspire YiDianTong robot-01] you may have seen many Knowledge Q & A robots, but this is a little different](/img/d1/c2c2e4a605deddd0073a05d528733f.jpg)
[mindspire YiDianTong robot-01] you may have seen many Knowledge Q & A robots, but this is a little different

【软考软件评测师】2013综合知识历年真题

Machine learning how to achieve epidemic visualization -- epidemic data analysis and prediction practice
![[Qt5] QT small software release](/img/83/9867bd4513caadac6a056c801abe48.png)
[Qt5] QT small software release

Gbase 8A MPP and Galaxy Kirin (x86 version) complete deep adaptation

Use of tkmapper - super detailed

GBase 8a MPP与银河麒麟(x86版)完成深度适配

The cooperation between starfish OS and metabell is just the beginning

Characteristics of EMC EMI beads

tkMapper的使用-超详细
随机推荐
分布式系统架构理论与组件
2022牛客多校第二场解题报告
How can MySQL query judge whether multiple field values exist at the same time
Export SQL server query results to excel table
Use of namespaces
Analysis and recurrence of network security vulnerabilities
CI框架如何集成Smarty模板
Leetcode brushes questions. I recommend this video of the sister Xueba at station B
[Qt5] small software with 5 people randomly selected from the bid evaluation expert base
Matlab file path
半桥BUCK电路—记录篇
微信小程序----微信小程序浏览pdf文件
Characteristics of EMC EMI beads
CAT1 4g+ Ethernet development board 232 data is sent to the server through 4G module TCP
第2章-14 求整数段和
Introduction to self drive tour of snow mountains in the West in January 2018
招贤纳士,GBASE高端人才招募进行中
PHP基础知识 - PHP 使用 PDO
PHP Basics - PHP uses mysqli
GBase 8s是否支持存储关系型数据和对象型数据?


