当前位置:网站首页>Competition: diabetes genetic risk detection challenge (iFLYTEK)
Competition: diabetes genetic risk detection challenge (iFLYTEK)
2022-07-28 08:42:00 【Ling Xianwen】

2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Catalog
2022 iFLYTEK A.I. Developer competition - IFLYTEK open platform
Two 、 The mission of the event
2.1 Data set field description
3、 ... and 、 Submit instructions
5.2 View training set and test set field types
5.3 Missing value of statistical field
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
6.4 Try the decision tree model
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
7.2 Calculate the difference between the average value of each patient and each sex
8、 ... and 、 High order tree model
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
Nine 、 Multi fold training and integration
One 、 Background of the event
By 2022 year , Diabetes mellitus in China 1.3 Billion . The causes of diabetes in China are influenced by lifestyle 、 Aging 、 Urbanization 、 Family heredity and other factors affect . meanwhile , People with diabetes tend to be younger .
Diabetes can lead to cardiovascular disease 、 Kidneys 、 Occurrence of cerebrovascular complications . therefore , Accurate diagnosis of individuals with diabetes has very important clinical significance . Early genetic risk prediction of diabetes will help to prevent the occurrence of diabetes .
according to 《 China 2 Guidelines for the prevention and treatment of type 2 diabetes (2017 Edition of )》, The diagnostic standard of diabetes is to have typical symptoms of diabetes ( Be thirsty and drink more 、 urine 、 More food 、 Unexplained weight loss ) And random intravenous plasma glucose ≥11.1mmol/L Or fasting venous plasma glucose ≥7.0mmol/L Or oral glucose tolerance test (OGTT) After load 2h Plasma glucose ≥11.1mmol/L.
In this competition , You need to build a genetic risk prediction model for diabetes through training data sets , Then predict whether the individuals in the test data set have diabetes , Join us to help diabetes patients solve this problem “ Sweet troubles ”.
Two 、 The mission of the event
2.1 Data set field description
Number : A number that identifies an individual ;
Gender :1 For men ,0 For women ;
Year of birth : Year of Birth ;
Body mass index : Weight divided by the square of height , Company kg/m2;
Family history of diabetes : Identify the genetic characteristics of diabetes , Record the family members with diabetes in the family , There are three kinds of signs , One parent has diabetes 、 One uncle or aunt has diabetes 、 No record ;
diastolic pressure : When the heart relaxes , When arterial elasticity retracts , The resulting pressure is called diastolic pressure , Company mmHg;
Oral glucose tolerance test : A laboratory test for diagnosing diabetes . The competition data adopts 120 Blood glucose value after minutes of glucose tolerance test , Company mmol/L;
Insulin release test : Fasting quantitative oral glucose stimulates islets β Cells release insulin . The competition data is after taking sugar 120 Minute plasma insulin level , Company pmol/L;
Triceps brachii skinfold thickness : At the key point of the connection between the acromion and the olecranon behind the right upper arm , Clip the skin fold parallel to the long axis of the upper limb , Longitudinal measurement , Company cm;
Signs of diabetes : Data labels ,1 It means having diabetes ,0 It means that you don't have diabetes .
2.2 Training set description
Training set ( Game training set .csv) Altogether 5070 Data , Used to build your forecasting model ( You may need to do data analysis first ). The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness 、 Signs of diabetes ( The last column ), You can also use feature engineering techniques to build new features .
2.3 Test set description
Test set ( Competition test set .csv) Altogether 1000 Data , Used to verify the performance of the prediction model . The fields of the data are numbered 、 Gender 、 Year of birth 、 Body mass index 、 Family history of diabetes 、 diastolic pressure 、 Oral glucose tolerance test 、 Insulin release test 、 Triceps brachii skinfold thickness .
3、 ... and 、 Submit instructions
For individuals in the test data set , You must predict whether they have diabetes ( Have diabetes :1, No diabetes :0), The predicted value can only be an integer 1 perhaps 0. The submitted data should have the following format :
uuid,label
1,0
2,1
3,1
...
In this competition , The result file of the prediction model needs to be named : Predicted results .csv, And then submit . Please ensure that the file format you submit is standard .
Four 、 Evaluation indicators
For the submitted results , The system will use F1-score Indicators for evaluation ,F1-score The larger the size, the better the performance of the prediction model ,F1-score Is defined as follows :

among :


5、 ... and 、 Data analysis
5.1 Import data
- Decompress the game data , And use pandas To read ;
import pandas as pd
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
print(train_df.shape, test_df.shape)
print(train_df.dtypes, test_df.dtypes)5.2 View training set and test set field types


5.3 Missing value of statistical field
train_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 247
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
Signs of diabetes 0
dtype: int64test_df.isnull().sum() Number 0
Gender 0
Year of birth 0
Body mass index 0
Family history of diabetes 0
diastolic pressure 49
Oral glucose tolerance test 0
Insulin release test 0
Triceps brachii skinfold thickness 0
dtype: int64Calculation of missing proportion of each column in training set and test set


The only column that contains missing values is the diastolic pressure column , And the missing value accounts for a small proportion .
But it is obvious that the training is concentrated :
Oral glucose tolerance test is -1 Is also a missing value , The insulin release test is 0 Is also a missing value , The thickness of triceps brachii is 0 Is also a missing value , To be dealt with later .
5.4 Analyze field types
Screenshot from ashui .


Description of training set and test set


5.5 Field correlation
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['font.sans-serif'] = ['FangSong'] # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False # Used to display negative sign normally
# Training set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(train_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('train_pearson.jpg', dpi=800)
How to draw a heat map : http://t.csdn.cn/FQIro
# Test set correlation thermogram matrix
plt.subplots(figsize=(10,10))
sns.heatmap(test_df.corr(method='pearson'), annot=True, vmax=1, square=True, cmap='YlGnBu')
plt.savefig('test_pearson.jpg', dpi=800)
As can be seen from the heat map , Focus on training Body mass index and Triceps brachii skinfold thickness The relevance with labels is relatively high , Triceps brachii skinfold thickness has the highest correlation with labels . The correlation between fields is generally not high .
6、 ... and 、 Logistic regression attempts
6.1 Import sklearn The logical return of
# Building a logistic regression model
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline# Building a logistic regression model
model = make_pipeline(
MinMaxScaler(),
LogisticRegression()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
6.2 Use training sets and logistic regression for training , And make predictions on the test set ;
test_dataset["label"] = model.predict(test_dataset.drop([" Number "],axis=1))test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_lr.csv",index=None)
6.3 Submit results
6.4 Try the decision tree model
# Try to build a decision tree model
model = make_pipeline(
MinMaxScaler(),
DecisionTreeClassifier()
)
model.fit(train_dataset,train_data[" Signs of diabetes "])
test_dataset["label"] = model.predict(test_dataset.drop([" Number ",'label'],axis=1))
test_dataset.rename({" Number ":'uuid'},axis=1)[['uuid','label']].to_csv("submit_dt.csv",index=None)result :
7、 ... and 、 Feature Engineering
7.1 Count the corresponding [ Body mass index ]、[ diastolic pressure ] Average
train_dataset.groupby(" Gender ")[" Body mass index "].apply(np.mean)
7.2 Calculate the difference between the average value of each patient and each sex

"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
data[' Year of birth ']=2022-data[' Year of birth '] # Change to age
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
data8、 ... and 、 High order tree model
8.1 install lightgbm
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
8.2 Training set 20% Divided into validation sets , Use LightGBM Finish training
train_data = pd.read_csv(" Game training set .csv",encoding='gbk')
test_data = pd.read_csv(" Competition test set .csv",encoding='gbk')
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)
# Divide the data set
train_x,valid_x = train_test_split(train_data,test_size=0.2)clf_lgb = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
clf_lgb.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts = clf_lgb.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts))
[LightGBM] [Warning] verbosity is set=-1, verbose=-1 will be ignored. Current value: verbosity=-1
0.9546351084812623# Search parameters
kfold = StratifiedKFold(n_splits=5,shuffle=True,random_state=2022)
classifier = lgb.LGBMClassifier()
params = {
" max_depth":[4,5,6],
"n_estimators":[3000,4000,5000],
"learning_rate":[0.15,0.2,0.25]
}
clf = GridSearchCV(estimator=classifier,param_grid=params,verbose=True,cv=kfold)
clf.fit(train_x.drop([" Signs of diabetes "],axis=1),train_x[" Signs of diabetes "])
predicts1 = clf.best_estimator_.predict(valid_x.drop([" Signs of diabetes "],axis=1))
print(accuracy_score(valid_x[" Signs of diabetes "], predicts1))
Nine 、 Multi fold training and integration
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import KFold
import lightgbm as lgb
# Reading data
train_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Game training set .csv', encoding='gbk')
test_df = pd.read_csv('./ Open data of diabetes genetic risk prediction challenge / Competition test set .csv', encoding='gbk')
# Foundation Feature Engineering
train_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
test_df[' Body mass index _round'] = train_df[' Body mass index '] // 10
train_df[' Oral glucose tolerance test '] = train_df[' Oral glucose tolerance test '].replace(-1, np.nan)
test_df[' Oral glucose tolerance test '] = test_df[' Oral glucose tolerance test '].replace(-1, np.nan)
dict_ Family history of diabetes = {
' No record ': 0,
' One uncle or aunt has diabetes ': 1,
' One uncle or aunt has diabetes ': 1,
' One parent has diabetes ': 2
}
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
test_df[' Family history of diabetes '] = test_df[' Family history of diabetes '].map(dict_ Family history of diabetes )
train_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
test_df[' Family history of diabetes '] = train_df[' Family history of diabetes '].astype('category')
train_df[' Gender '] = train_df[' Gender '].astype('category')
test_df[' Gender '] = train_df[' Gender '].astype('category')
train_df[' Age '] = 2022 - train_df[' Year of birth ']
test_df[' Age '] = 2022 - test_df[' Year of birth ']
train_df[' Oral glucose tolerance test _diff'] = train_df[' Oral glucose tolerance test '] - train_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
test_df[' Oral glucose tolerance test _diff'] = test_df[' Oral glucose tolerance test '] - test_df.groupby(' Family history of diabetes ').transform('mean')[' Oral glucose tolerance test ']
# Model cross validation
def run_model_cv(model, kf, X_tr, y, X_te, cate_col=None):
train_pred = np.zeros( (len(X_tr), len(np.unique(y))) )
test_pred = np.zeros( (len(X_te), len(np.unique(y))) )
cv_clf = []
for tr_idx, val_idx in kf.split(X_tr, y):
x_tr = X_tr.iloc[tr_idx]; y_tr = y.iloc[tr_idx]
x_val = X_tr.iloc[val_idx]; y_val = y.iloc[val_idx]
call_back = [
lgb.early_stopping(50),
]
eval_set = [(x_val, y_val)]
model.fit(x_tr, y_tr, eval_set=eval_set, callbacks=call_back, verbose=-1)
cv_clf.append(model)
train_pred[val_idx] = model.predict_proba(x_val)
test_pred += model.predict_proba(X_te)
test_pred /= kf.n_splits
return train_pred, test_pred, cv_clf
clf = lgb.LGBMClassifier(
max_depth=3,
n_estimators=4000,
n_jobs=-1,
verbose=-1,
verbosity=-1,
learning_rate=0.1,
)
train_pred, test_pred, cv_clf = run_model_cv(
clf, KFold(n_splits=5),
train_df.drop([' Number ', ' Signs of diabetes '], axis=1),
train_df[' Signs of diabetes '],
test_df.drop([' Number '], axis=1),
)
print((train_pred.argmax(1) == train_df[' Signs of diabetes ']).mean())
test_df['label'] = test_pred.argmax(1)
test_df.rename({' Number ': 'uuid'}, axis=1)[['uuid', 'label']].to_csv('submit.csv', index=None)

For the first time, I participated in the competition of data mining , Many places learn from the boss , Learned a lot , Next time try .
边栏推荐
猜你喜欢

MySQL how to add users and set permissions?

竞赛:糖尿病遗传风险检测挑战赛(科大讯飞)

Matlab file path
![[Qt5] QT small software release](/img/83/9867bd4513caadac6a056c801abe48.png)
[Qt5] QT small software release

Shell编程规范与变量

When will brain like intelligence, which is popular in academia, land? Let's listen to what the industry masters say - qubits, colliders, x-knowledge Technology

bash-shell 免交互

CAT1 4G+以太网开发板232数据通过4G模块TCP发到服务器

GB/T 41479-2022信息安全技术 网络数据处理安全要求 导图概览
![[Qt5] a method of multi window parameter transmission (using custom signal slot) and case code download](/img/6d/870add6179f0e3a2f9b719f79594f3.png)
[Qt5] a method of multi window parameter transmission (using custom signal slot) and case code download
随机推荐
2022 Niuke multi school second problem solving Report
阻塞队列LinkedBlockingQueue 源码解析
Export SQL server query results to excel table
File editing component
微服务架构 Sentinel 的服务限流及熔断
5张图告诉你:同样是职场人,差距怎么这么大?
Day112. Shangyitong: Mobile verification code login function
本人男,27岁技术经理,收入太高,心头慌得一比
Half bridge buck circuit - record
QT 怎么删除布局里的所有控件?
PHP基础知识 - PHP 使用 PDO
Gb/t 41479-2022 information security technology network data processing security requirements map overview
Starfish Os打造的元宇宙生态,跟MetaBell的合作只是开始
Usage of qcombobox
Introduction to self drive tour of snow mountains in the West in January 2018
Pyflink connecting iceberg practice
Flink Window&Time 原理
feign 调用
CAT1 4g+ Ethernet development board 232 data is sent to the server through 4G module TCP
Get the clicked line number in qtablewidget


