当前位置:网站首页>Competition: diabetes genetic risk detection challenge (iFLYTEK)
Competition: diabetes genetic risk detection challenge (iFLYTEK)
2022-07-28 08:42:00 【Ling Xianwen】

2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Catalog
2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Two 、 The mission of the event
2.1 Data set field description
3、 ... and 、 Submit instructions
5.2 View training set and test set field types
5.3 Missing value of statistical field
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
6.4 Try the decision tree model
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
7.2 Calculate the difference between the average value of each patient and each sex
8、 ... and 、 High order tree model
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
Nine 、 Multi fold training and integration
One 、 Background of the event
By 2022 year , Diabetes mellitus in China 1.3 Billion . The causes of diabetes in China are influenced by lifestyle 、 Aging 、 Urbanization 、 Family heredity and other factors affect . meanwhile , People with diabetes tend to be younger .
Diabetes can lead to cardiovascular disease 、 Kidneys 、 Occurrence of cerebrovascular complications . therefore , Accurate diagnosis of individuals with diabetes has very important clinical significance . Early genetic risk prediction of diabetes will help to prevent the occurrence of diabetes .
according to 《 China 2 Guidelines for the prevention and treatment of type 2 diabetes (2017 Edition of )》, The diagnostic standard of diabetes is to have typical symptoms of diabetes ( Be thirsty and drink more 、 urine 、 More food 、 Unexplained weight loss ) And random intravenous plasma glucose ≥11.1mmol/L Or fasting venous plasma glucose ≥7.0mmol/L Or oral glucose tolerance test (OGTT) After load 2h Plasma glucose ≥11.1mmol/L.
In this competition , You need to build a genetic risk prediction model for diabetes through training data sets , Then predict whether the individuals in the test data set have diabetes , Join us to help diabetes patients solve this problem “ Sweet troubles ”.
Two 、 The mission of the event
2.1 Data set field description
Number : A number that identifies an individual ;
Gender :1 For men ,0 For women ;
Year of birth : Year of Birth ;
Body mass index : Weight divided by the square of height , Company kg/m2;
Family history of diabetes : Identify the genetic characteristics of diabetes , Record the family members with diabetes in the family , There are three kinds of signs , One parent has diabetes 、 One uncle or aunt has diabetes 、 No record ;
diastolic pressure : When the heart relaxes , When arterial elasticity retracts , The resulting pressure is called diastolic pressure , Company mmHg;
Oral glucose tolerance test : A laboratory test for diagnosing diabetes . The competition data adopts 120 Blood glucose value after minutes of glucose tolerance test , Company mmol/L;
Insulin release test : Fasting quantitative oral glucose stimulates islets β Cells release insulin . The competition data is after taking sugar 120 Minute plasma insulin level , Company pmol/L;
Triceps brachii skinfold thickness : At the key point of the connection between the acromion and the olecranon behind the right upper arm , Clip the skin fold parallel to the long axis of the upper limb , Longitudinal measurement , Company cm;
Signs of diabetes : Data labels ,1 It means having diabetes ,0 It means that you don't have diabetes .
2.2 Training set description
Training set ( Game training set .csv) Altogether 5070 Data , Used to build your forecasting model ( You may need to do data analysis first ). The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness 、 Signs of diabetes ( The last column ), You can also use feature engineering techniques to build new features .
2.3 Test set description
Test set ( Competition test set .csv) Altogether 1000 Data , Used to verify the performance of the prediction model . The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness .
3、 ... and 、 Submit instructions
For individuals in the test data set , You must predict whether they have diabetes ( Have diabetes :1, No diabetes :0), The predicted value can only be an integer 1 perhaps 0. The submitted data should have the following format :
uuid,label
1,0
2,1
3,1
...
In this competition , The result file of the prediction model needs to be named : Predicted results .csv, And then submit . Please ensure that the file format you submit is standard .
Four 、 Evaluation indicators
For the submitted results , The system will use F1-score Indicators for evaluation ,F1-score The larger the size, the better the performance of the prediction model ,F1-score Is defined as follows :

among :


5、 ... and 、 Data analysis
5.1 Import data
- Decompress the game data , And use pandas To read ;
import pandas as pd
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)5.2 View training set and test set field types


5.3 Missing value of statistical field
train_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 247
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
Signs of diabetes 0
dtype: int64test_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 49
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
dtype: int64Calculation of missing proportion of each column in training set and test set


The only column that contains missing values is the diastolic pressure column , And the missing value accounts for a small proportion .
But it is obvious that the training is concentrated :
Oral glucose tolerance test is -1 Is also a missing value , The insulin release test is 0 Is also a missing value , The thickness of triceps brachii is 0 Is also a missing value , To be dealt with later .
5.4 Analyze field types
Screenshot from ashui .


Description of training set and test set


5.5 Field correlation
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['FangSong'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Used to display negative sign normally
# Training set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(train_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('train_pearson.jpg', dpi=800)
How to draw a heat map : http://t.csdn.cn/FQIro
# Test set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(test_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('test_pearson.jpg', dpi=800)
As can be seen from the heat map , Focus on training Body mass index and Triceps brachii skinfold thickness The relevance with labels is relatively high , Triceps brachii skinfold thickness has the highest correlation with labels . The correlation between fields is generally not high .
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
# Building a logistic regression model
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline# Building a logistic regression model
model = make_pipeline(
MinMaxScaler(),
LogisticRegression()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
test_dataset["label"] = model.predict(test_dataset.drop([" Number "],axis=1))test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_lr.csv",index=None)
6.3 Submit results
6.4 Try the decision tree model
# Try to build a decision tree model
model = make_pipeline(
MinMaxScaler(),
DecisionTreeClassifier()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
test_dataset["label"] = model.predict(test_dataset.drop([" Number ",'label'],axis=1))
test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_dt.csv",index=None)result :
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
train_dataset.groupby(" Gender ")[" Body mass index "].apply(np.mean)
7.2 Calculate the difference between the average value of each patient and each sex

"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
data[' Year of birth ']=2022-data[' Year of birth '] # Change to age
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
data8、 ... and 、 High order tree model
8.1 install lightgbm
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
train_data = pd.read_csv(" Game training set .csv",encoding='gbk')
test_data = pd.read_csv(" Competition test set .csv",encoding='gbk')
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
# Divide the data set
train_x,valid_x = train_test_split(train_data,test_size=0.2)clf_lgb = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
clf_lgb.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts = clf_lgb.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts))
[LightGBM] [Warning] verbosity is set=-1, verbose=-1 will be ignored. Current value: verbosity=-1
0.9546351084812623# Search parameters
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=2022)
classifier = lgb.LGBMClassifier()
params = {
" max_depth":[4,5,6],
"n_estimators":[3000,4000,5000],
"learning_rate":[0.15,0.2,0.25]
}
clf = GridSearchCV(estimator=classifier,param_grid=params,verbose=True,cv=kfold)
clf.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts1 = clf.best_estimator_.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts1))
Nine 、 Multi fold training and integration
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold
import lightgbm as lgb
# Reading data
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
# Foundation Feature Engineering
train_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
test_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
train_df[' Oral glucose tolerance test '] = train_df[' Oral glucose tolerance test '].replace(-1, np.nan)
test_df[' Oral glucose tolerance test '] = test_df[' Oral glucose tolerance test '].replace(-1, np.nan)
dict_ Family history of diabetes = {
' No record ': 0,
' One uncle or aunt has diabetes ': 1,
' One uncle or aunt has diabetes ': 1,
' One parent has diabetes ': 2
}
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
test_df[' Family history of diabetes '] = test_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
test_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
train_df[' Gender '] = train_df[' Gender '].astype('category')
test_df[' Gender '] = train_df[' Gender '].astype('category')
train_df[' Age '] = 2022 - train_df[' Year of birth ']
test_df[' Age '] = 2022 - test_df[' Year of birth ']
train_df[' Oral glucose tolerance test _diff'] = train_df[' Oral glucose tolerance test '] - train_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
test_df[' Oral glucose tolerance test _diff'] = test_df[' Oral glucose tolerance test '] - test_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
# Model cross validation
def run_model_cv(model, kf, X_tr, y, X_te, cate_col=None):
train_pred = np.zeros( (len(X_tr), len(np.unique(y))) )
test_pred = np.zeros( (len(X_te), len(np.unique(y))) )
cv_clf = []
for tr_idx, val_idx in kf.split(X_tr, y):
x_tr = X_tr.iloc[tr_idx]; y_tr = y.iloc[tr_idx]
x_val = X_tr.iloc[val_idx]; y_val = y.iloc[val_idx]
call_back = [
lgb.early_stopping(50),
]
eval_set = [(x_val, y_val)]
model.fit(x_tr, y_tr, eval_set=eval_set, callbacks=call_back, verbose=-1)
cv_clf.append(model)
train_pred[val_idx] = model.predict_proba(x_val)
test_pred += model.predict_proba(X_te)
test_pred /= kf.n_splits
return train_pred, test_pred, cv_clf
clf = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
train_pred, test_pred, cv_clf = run_model_cv(
clf, KFold(n_splits=5),
train_df.drop([' Number ', ' Signs of diabetes '], axis=1),
train_df[' Signs of diabetes '],
test_df.drop([' Number '], axis=1),
)
print((train_pred.argmax(1) == train_df[' Signs of diabetes ']).mean())
test_df['label'] = test_pred.argmax(1)
test_df.rename({' Number ': 'uuid'}, axis=1)[['uuid', 'label']].to_csv('submit.csv', index=None)

For the first time, I participated in the competition of data mining , Many places learn from the boss , Learned a lot , Next time try .
边栏推荐
- GBase 8s是否支持存储关系型数据和对象型数据?
- One key switch circuit
- Leetcode brushes questions. I recommend this video of the sister Xueba at station B
- 客户至上 | 国产BI领跑者,思迈特软件完成C轮融资
- 说透缓存一致性与内存屏障
- MySQL how to add users and set permissions?
- 阿里巴巴内部面试资料
- How CI framework integrates Smarty templates
- JS手写函数之slice函数(彻底弄懂包头不包尾)
- Three different numbers with 0 in leetcode/ array
猜你喜欢

CAT1 4G+以太网开发板232数据通过4G模块TCP发到服务器

Blog Building 9: add search function to Hugo

网络安全漏洞分析与漏洞复现
![[pyqt] pyqt development experience_ How to find events and methods of controls](/img/40/7597d6413c88793e22276325d5f602.png)
[pyqt] pyqt development experience_ How to find events and methods of controls

Smartbi of smart smart smart software completed the c-round financing and accelerated the domestic Bi into the intelligent era

5张图告诉你:同样是职场人,差距怎么这么大?

学术界爆火的类脑智能,啥时候能落地?来听行业大咖怎么说丨量子位·对撞派 x 时识科技...

C轮融资已完成!思迈特软件领跑国内BI生态赋能,产品、服务竿头一步

When unity switches to another scene, he finds that the scene is dimmed

Gbase appears in Unicom cloud Tour (Sichuan Station) to professionally empower cloud ecology
随机推荐
解决:IndexError: index 13 is out of bounds for dimension 0 with size 13
How can MySQL query judge whether multiple field values exist at the same time
Gbase 8A MPP and Galaxy Kirin (x86 version) complete deep adaptation
leetcode/数组中和为0的三个不同数
Use of namespaces
博客搭建七:hugo
Feign call
Introduction to self drive tour of snow mountains in the West in January 2018
2021-07-02
ciou损失
The five pictures tell you: why is there such a big gap between people in the workplace?
opengauss同步状态疑问
MySQL how to add users and set permissions?
The cooperation between starfish OS and metabell is just the beginning
postgresql查询【表字段类型】和库中【所有序列】
PHPUnit在Window下如何配置
PHP Basics - PHP uses mysqli
Recruiting talents, gbase high-end talent recruitment in progress
oracle sql 问题
[Qt5] QT small software release


