当前位置：网站首页>How to play a data mining game entry Edition

How to play a data mining game entry Edition

2022-07-28 13:46:00 【Demeanor 78】

Datawhale dried food

contributor ： Herding bear , Luoxiutao , Si Yuxin , Pan Shuyu etc.

This is a simple competition tutorial , Our goal is to help students step out AI The first step in training Masters . There will be a lot to learn in data mining , It is suggested that students who are getting started can temporarily understand the principles of various codes without worrying , Get through the code first , Then look at the knowledge points involved in the code to query relevant materials for learning , This will make your study more targeted , It is also easy to find the fun of learning . A journey , Begins with a single step , From here , Open your AI A journey of study ！

—— contributor ： Herding bear 、 Luoxiutao

One 、 Preparation steps

1.1 Platform registration and Competition Registration

Links to events ：
https://challenge.xfyun.cn/topic/info?type=diabetes&ch=ds22-dw-gzh02
register （ Remember to fill in your personal information ）

Click on the top right corner of the page ： register

Fill in personal information , Registered successfully

Click to register , Show successful enrollment

Click on ： entrants

Successful registration

1.2 Data download

Data acquisition

Download data on the official website ： Download data and real name authentication .
Detailed operations can be viewed ：https://xj15uxcopw.feishu.cn/docx/doxcn11gwo7cEuAXWhCrDld4Inb
Please put the data file and code file in the same folder , Ensure normal operation

1.3 Reference material

python Please refer to ：

Mac equipment ：Mac Installation on Anaconda Most comprehensive tutorial https://zhuanlan.zhihu.com/p/350828057
Windows equipment ：Anaconda Super detailed installation tutorial
https://blog.csdn.net/fan18317517352/article/details/123035625

Two 、 Practical ideas

This competition is a data mining competition , Players need to build models through training set data , Then predict the validation set data , Submit the prediction results .

The task of this topic is to build a model , The model can predict whether the patient has diabetes according to the patient's test data . This type of task is a typical binary classification problem （ Have diabetes / No diabetes ）, The prediction output of the model is 0 or 1 （ Have diabetes ：1, No diabetes ：0）

Machine learning , About the classification task, we usually think of logical regression 、 Decision tree and other algorithms , In this Baseline in , We try to use decision tree to build our model . When we solve machine learning problems , Generally, the following process will be followed ：

2.1 Code implementation

The following code , Please be there. jupyter notbook or python In the compiler environment

# Install dependent Libraries   If it is windows System ,cmd Input in the command box pip install , Refer to the above environment configuration 
#!pip install sklearn
#!pip install pandas
#---------------------------------------------------
# Import library 
#---------------- Data exploration ----------------
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Data preprocessing 
data1=pd.read_csv(' Game training set .csv',encoding='gbk')
data2=pd.read_csv(' Competition test set .csv',encoding='gbk')
#label Marked as -1
data2[' Signs of diabetes ']=-1
# The training set and the testing machine are merged 
data=pd.concat([data1,data2],axis=0,ignore_index=True)
# Fill the missing values in the diastolic blood pressure characteristics with -1
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)

#---------------- Feature Engineering ----------------
"""
 Convert the year of birth into age 
"""
data[' Age ']=2022-data[' Year of birth ']  # Change to age 

"""
 The normal value of the body mass index for adults is 18.5-24 Between 
 lower than 18.5 It's a low BMI 
 stay 24-27 Between them is overweight 
27 The above consideration is obesity 
 higher than 32 You are very fat .
"""
def BMI(a):
    if a<18.5:
        return 0
    elif 18.5<=a<=24:
        return 1
    elif 24<a<=27:
        return 2
    elif 27<a<=32:
        return 3
    else:
        return 4

data['BMI']=data[' Body mass index '].apply(BMI)

# Family history of diabetes 
"""
 No record 
 One uncle or aunt has diabetes / One uncle or aunt has diabetes 
 One parent has diabetes 
"""
def FHOD(a):
    if a==' No record ':
        return 0
    elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
        return 1
    else:
        return 2

data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
"""
 The diastolic pressure range is 60-90
"""
def DBP(a):
    if 0<=a<60:
        return 0
    elif 60<=a<=90:
        return 1
    elif a>90:
        return 2
    else:
        return a
data['DBP']=data[' diastolic pressure '].apply(DBP)

#------------------------------------
# The processed feature engineering is divided into training set and test set , The training set is used to train the model , The test set is used to evaluate the accuracy of the model 
# There is no relationship between the number and whether the patient has diabetes , Irrelevant features shall be deleted 
train=data[data[' Signs of diabetes '] !=-1]
test=data[data[' Signs of diabetes '] ==-1]
train_label=train[' Signs of diabetes ']
train=train.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
test=test.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)

#---------------- model training ----------------
model = DecisionTreeClassifier()
model.fit(train, train_label) 
y_pre=model.predict(test)
y_pre

#---------------- Results output ----------------
result=pd.read_csv(' Submit sample .csv')
result['label']=y_pre
result.to_csv('result-de.csv',index=False)

2.2 Results submitted

Submit at the submission result , Submit Predicted results .csv（ Program generated CSV file ）, Check your score ranking

Select the generated result.csv Click on the submit

Click my grades to view the results

List of colleges and Universities Participating in the evaluation of tutorials

 Past highlights 




 It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download machine learning and deep learning notes and other information printing 《 Statistical learning method 》 Code reproduction album machine learning communication qq Group 955171419, Please scan the code to join wechat group

原网站

版权声明
本文为[Demeanor 78]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281216103383.html