当前位置:网站首页>Competition: diabetes genetic risk detection challenge (iFLYTEK)
Competition: diabetes genetic risk detection challenge (iFLYTEK)
2022-07-28 08:42:00 【Ling Xianwen】

2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Catalog
2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Two 、 The mission of the event
2.1 Data set field description
3、 ... and 、 Submit instructions
5.2 View training set and test set field types
5.3 Missing value of statistical field
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
6.4 Try the decision tree model
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
7.2 Calculate the difference between the average value of each patient and each sex
8、 ... and 、 High order tree model
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
Nine 、 Multi fold training and integration
One 、 Background of the event
By 2022 year , Diabetes mellitus in China 1.3 Billion . The causes of diabetes in China are influenced by lifestyle 、 Aging 、 Urbanization 、 Family heredity and other factors affect . meanwhile , People with diabetes tend to be younger .
Diabetes can lead to cardiovascular disease 、 Kidneys 、 Occurrence of cerebrovascular complications . therefore , Accurate diagnosis of individuals with diabetes has very important clinical significance . Early genetic risk prediction of diabetes will help to prevent the occurrence of diabetes .
according to 《 China 2 Guidelines for the prevention and treatment of type 2 diabetes (2017 Edition of )》, The diagnostic standard of diabetes is to have typical symptoms of diabetes ( Be thirsty and drink more 、 urine 、 More food 、 Unexplained weight loss ) And random intravenous plasma glucose ≥11.1mmol/L Or fasting venous plasma glucose ≥7.0mmol/L Or oral glucose tolerance test (OGTT) After load 2h Plasma glucose ≥11.1mmol/L.
In this competition , You need to build a genetic risk prediction model for diabetes through training data sets , Then predict whether the individuals in the test data set have diabetes , Join us to help diabetes patients solve this problem “ Sweet troubles ”.
Two 、 The mission of the event
2.1 Data set field description
Number : A number that identifies an individual ;
Gender :1 For men ,0 For women ;
Year of birth : Year of Birth ;
Body mass index : Weight divided by the square of height , Company kg/m2;
Family history of diabetes : Identify the genetic characteristics of diabetes , Record the family members with diabetes in the family , There are three kinds of signs , One parent has diabetes 、 One uncle or aunt has diabetes 、 No record ;
diastolic pressure : When the heart relaxes , When arterial elasticity retracts , The resulting pressure is called diastolic pressure , Company mmHg;
Oral glucose tolerance test : A laboratory test for diagnosing diabetes . The competition data adopts 120 Blood glucose value after minutes of glucose tolerance test , Company mmol/L;
Insulin release test : Fasting quantitative oral glucose stimulates islets β Cells release insulin . The competition data is after taking sugar 120 Minute plasma insulin level , Company pmol/L;
Triceps brachii skinfold thickness : At the key point of the connection between the acromion and the olecranon behind the right upper arm , Clip the skin fold parallel to the long axis of the upper limb , Longitudinal measurement , Company cm;
Signs of diabetes : Data labels ,1 It means having diabetes ,0 It means that you don't have diabetes .
2.2 Training set description
Training set ( Game training set .csv) Altogether 5070 Data , Used to build your forecasting model ( You may need to do data analysis first ). The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness 、 Signs of diabetes ( The last column ), You can also use feature engineering techniques to build new features .
2.3 Test set description
Test set ( Competition test set .csv) Altogether 1000 Data , Used to verify the performance of the prediction model . The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness .
3、 ... and 、 Submit instructions
For individuals in the test data set , You must predict whether they have diabetes ( Have diabetes :1, No diabetes :0), The predicted value can only be an integer 1 perhaps 0. The submitted data should have the following format :
uuid,label
1,0
2,1
3,1
...
In this competition , The result file of the prediction model needs to be named : Predicted results .csv, And then submit . Please ensure that the file format you submit is standard .
Four 、 Evaluation indicators
For the submitted results , The system will use F1-score Indicators for evaluation ,F1-score The larger the size, the better the performance of the prediction model ,F1-score Is defined as follows :

among :


5、 ... and 、 Data analysis
5.1 Import data
- Decompress the game data , And use pandas To read ;
import pandas as pd
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)5.2 View training set and test set field types


5.3 Missing value of statistical field
train_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 247
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
Signs of diabetes 0
dtype: int64test_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 49
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
dtype: int64Calculation of missing proportion of each column in training set and test set


The only column that contains missing values is the diastolic pressure column , And the missing value accounts for a small proportion .
But it is obvious that the training is concentrated :
Oral glucose tolerance test is -1 Is also a missing value , The insulin release test is 0 Is also a missing value , The thickness of triceps brachii is 0 Is also a missing value , To be dealt with later .
5.4 Analyze field types
Screenshot from ashui .


Description of training set and test set


5.5 Field correlation
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['FangSong'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Used to display negative sign normally
# Training set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(train_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('train_pearson.jpg', dpi=800)
How to draw a heat map : http://t.csdn.cn/FQIro
# Test set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(test_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('test_pearson.jpg', dpi=800)
As can be seen from the heat map , Focus on training Body mass index and Triceps brachii skinfold thickness The relevance with labels is relatively high , Triceps brachii skinfold thickness has the highest correlation with labels . The correlation between fields is generally not high .
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
# Building a logistic regression model
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline# Building a logistic regression model
model = make_pipeline(
MinMaxScaler(),
LogisticRegression()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
test_dataset["label"] = model.predict(test_dataset.drop([" Number "],axis=1))test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_lr.csv",index=None)
6.3 Submit results
6.4 Try the decision tree model
# Try to build a decision tree model
model = make_pipeline(
MinMaxScaler(),
DecisionTreeClassifier()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
test_dataset["label"] = model.predict(test_dataset.drop([" Number ",'label'],axis=1))
test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_dt.csv",index=None)result :
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
train_dataset.groupby(" Gender ")[" Body mass index "].apply(np.mean)
7.2 Calculate the difference between the average value of each patient and each sex

"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
data[' Year of birth ']=2022-data[' Year of birth '] # Change to age
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
data8、 ... and 、 High order tree model
8.1 install lightgbm
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
train_data = pd.read_csv(" Game training set .csv",encoding='gbk')
test_data = pd.read_csv(" Competition test set .csv",encoding='gbk')
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
# Divide the data set
train_x,valid_x = train_test_split(train_data,test_size=0.2)clf_lgb = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
clf_lgb.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts = clf_lgb.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts))
[LightGBM] [Warning] verbosity is set=-1, verbose=-1 will be ignored. Current value: verbosity=-1
0.9546351084812623# Search parameters
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=2022)
classifier = lgb.LGBMClassifier()
params = {
" max_depth":[4,5,6],
"n_estimators":[3000,4000,5000],
"learning_rate":[0.15,0.2,0.25]
}
clf = GridSearchCV(estimator=classifier,param_grid=params,verbose=True,cv=kfold)
clf.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts1 = clf.best_estimator_.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts1))
Nine 、 Multi fold training and integration
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold
import lightgbm as lgb
# Reading data
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
# Foundation Feature Engineering
train_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
test_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
train_df[' Oral glucose tolerance test '] = train_df[' Oral glucose tolerance test '].replace(-1, np.nan)
test_df[' Oral glucose tolerance test '] = test_df[' Oral glucose tolerance test '].replace(-1, np.nan)
dict_ Family history of diabetes = {
' No record ': 0,
' One uncle or aunt has diabetes ': 1,
' One uncle or aunt has diabetes ': 1,
' One parent has diabetes ': 2
}
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
test_df[' Family history of diabetes '] = test_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
test_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
train_df[' Gender '] = train_df[' Gender '].astype('category')
test_df[' Gender '] = train_df[' Gender '].astype('category')
train_df[' Age '] = 2022 - train_df[' Year of birth ']
test_df[' Age '] = 2022 - test_df[' Year of birth ']
train_df[' Oral glucose tolerance test _diff'] = train_df[' Oral glucose tolerance test '] - train_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
test_df[' Oral glucose tolerance test _diff'] = test_df[' Oral glucose tolerance test '] - test_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
# Model cross validation
def run_model_cv(model, kf, X_tr, y, X_te, cate_col=None):
train_pred = np.zeros( (len(X_tr), len(np.unique(y))) )
test_pred = np.zeros( (len(X_te), len(np.unique(y))) )
cv_clf = []
for tr_idx, val_idx in kf.split(X_tr, y):
x_tr = X_tr.iloc[tr_idx]; y_tr = y.iloc[tr_idx]
x_val = X_tr.iloc[val_idx]; y_val = y.iloc[val_idx]
call_back = [
lgb.early_stopping(50),
]
eval_set = [(x_val, y_val)]
model.fit(x_tr, y_tr, eval_set=eval_set, callbacks=call_back, verbose=-1)
cv_clf.append(model)
train_pred[val_idx] = model.predict_proba(x_val)
test_pred += model.predict_proba(X_te)
test_pred /= kf.n_splits
return train_pred, test_pred, cv_clf
clf = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
train_pred, test_pred, cv_clf = run_model_cv(
clf, KFold(n_splits=5),
train_df.drop([' Number ', ' Signs of diabetes '], axis=1),
train_df[' Signs of diabetes '],
test_df.drop([' Number '], axis=1),
)
print((train_pred.argmax(1) == train_df[' Signs of diabetes ']).mean())
test_df['label'] = test_pred.argmax(1)
test_df.rename({' Number ': 'uuid'}, axis=1)[['uuid', 'label']].to_csv('submit.csv', index=None)

For the first time, I participated in the competition of data mining , Many places learn from the boss , Learned a lot , Next time try .
边栏推荐
- GBase 8s是否支持存储关系型数据和对象型数据?
- Maximum product of leetcode/ word length
- The current value of uniapp's swiper dynamic setting does not take effect solution
- MySQL how to add users and set permissions?
- Sliding screen switching on uniapp supports video and image rotation, similar to Tiktok effect
- bash-shell 免交互
- Hcip day 8
- 优炫数据库导入和导出方法
- Use of tkmapper - super detailed
- NDK 系列(6):说一下注册 JNI 函数的方式和时机
猜你喜欢

解决:IndexError: index 13 is out of bounds for dimension 0 with size 13

HCIP第八天

GB/T 41479-2022信息安全技术 网络数据处理安全要求 导图概览

Gb/t 41479-2022 information security technology network data processing security requirements map overview

bash-shell 免交互

Source code analysis of linkedblockingqueue

VK1620温控仪/智能电表LED数显驱动芯片3/4线接口内置 RC振荡器,提供技术支持

Unity切换到另一个场景的时候,发现该场景变暗了

tkMapper的使用-超详细

微服务架构 Sentinel 的服务限流及熔断
随机推荐
[Qt5] small software with 5 people randomly selected from the bid evaluation expert base
uniapp的swiper动态设置current值不生效解决办法
leetcode刷题,我推荐B站这个妹子学霸的视频
GBase 8a MPP与银河麒麟(x86版)完成深度适配
Redis 基本知识,快来回顾一下
Sparksql and flinksql create and link table records
JS手写函数之slice函数(彻底弄懂包头不包尾)
Introduction to self drive tour of snow mountains in the West in January 2018
Blog Building 9: add search function to Hugo
Solution: indexerror: index 13 is out of bounds for dimension 0 with size 13
本人男,27岁技术经理,收入太高,心头慌得一比
How does QT delete all controls in a layout?
客户至上 | 国产BI领跑者,思迈特软件完成C轮融资
‘全局事件总线’&‘消息订阅与发布’
学术界爆火的类脑智能,啥时候能落地?来听行业大咖怎么说丨量子位·对撞派 x 时识科技...
File editing component
Opencv+paddle Orc recognize pictures and extract table information
kubernetes之Deployment
PostgreSQL: cannot change the type of column used by a view or rule
The five pictures tell you: why is there such a big gap between people in the workplace?


