当前位置:网站首页>How to play a data mining game entry Edition
How to play a data mining game entry Edition
2022-07-28 13:46:00 【Demeanor 78】
Datawhale dried food
contributor : Herding bear , Luoxiutao , Si Yuxin , Pan Shuyu etc.
This is a simple competition tutorial , Our goal is to help students step out AI The first step in training Masters . There will be a lot to learn in data mining , It is suggested that students who are getting started can temporarily understand the principles of various codes without worrying , Get through the code first , Then look at the knowledge points involved in the code to query relevant materials for learning , This will make your study more targeted , It is also easy to find the fun of learning . A journey , Begins with a single step , From here , Open your AI A journey of study !
—— contributor : Herding bear 、 Luoxiutao

One 、 Preparation steps
1.1 Platform registration and Competition Registration
Links to events :
https://challenge.xfyun.cn/topic/info?type=diabetes&ch=ds22-dw-gzh02register ( Remember to fill in your personal information )


Click to register , Show successful enrollment


1.2 Data download
Data acquisition
Download data on the official website : Download data and real name authentication .
Detailed operations can be viewed :https://xj15uxcopw.feishu.cn/docx/doxcn11gwo7cEuAXWhCrDld4InbPlease put the data file and code file in the same folder , Ensure normal operation
1.3 Reference material
python Please refer to :
Mac equipment :Mac Installation on Anaconda Most comprehensive tutorial https://zhuanlan.zhihu.com/p/350828057
Windows equipment :Anaconda Super detailed installation tutorial
https://blog.csdn.net/fan18317517352/article/details/123035625
Two 、 Practical ideas
This competition is a data mining competition , Players need to build models through training set data , Then predict the validation set data , Submit the prediction results .
The task of this topic is to build a model , The model can predict whether the patient has diabetes according to the patient's test data . This type of task is a typical binary classification problem ( Have diabetes / No diabetes ), The prediction output of the model is 0 or 1 ( Have diabetes :1, No diabetes :0)
Machine learning , About the classification task, we usually think of logical regression 、 Decision tree and other algorithms , In this Baseline in , We try to use decision tree to build our model . When we solve machine learning problems , Generally, the following process will be followed :

2.1 Code implementation
The following code , Please be there. jupyter notbook or python In the compiler environment
# Install dependent Libraries If it is windows System ,cmd Input in the command box pip install , Refer to the above environment configuration
#!pip install sklearn
#!pip install pandas
#---------------------------------------------------
# Import library
#---------------- Data exploration ----------------
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# Data preprocessing
data1=pd.read_csv(' Game training set .csv',encoding='gbk')
data2=pd.read_csv(' Competition test set .csv',encoding='gbk')
#label Marked as -1
data2[' Signs of diabetes ']=-1
# The training set and the testing machine are merged
data=pd.concat([data1,data2],axis=0,ignore_index=True)
# Fill the missing values in the diastolic blood pressure characteristics with -1
data[' diastolic pressure ']=data[' diastolic pressure '].fillna(-1)
#---------------- Feature Engineering ----------------
"""
Convert the year of birth into age
"""
data[' Age ']=2022-data[' Year of birth '] # Change to age
"""
The normal value of the body mass index for adults is 18.5-24 Between
lower than 18.5 It's a low BMI
stay 24-27 Between them is overweight
27 The above consideration is obesity
higher than 32 You are very fat .
"""
def BMI(a):
if a<18.5:
return 0
elif 18.5<=a<=24:
return 1
elif 24<a<=27:
return 2
elif 27<a<=32:
return 3
else:
return 4
data['BMI']=data[' Body mass index '].apply(BMI)
# Family history of diabetes
"""
No record
One uncle or aunt has diabetes / One uncle or aunt has diabetes
One parent has diabetes
"""
def FHOD(a):
if a==' No record ':
return 0
elif a==' One uncle or aunt has diabetes ' or a==' One uncle or aunt has diabetes ':
return 1
else:
return 2
data[' Family history of diabetes ']=data[' Family history of diabetes '].apply(FHOD)
"""
The diastolic pressure range is 60-90
"""
def DBP(a):
if 0<=a<60:
return 0
elif 60<=a<=90:
return 1
elif a>90:
return 2
else:
return a
data['DBP']=data[' diastolic pressure '].apply(DBP)
#------------------------------------
# The processed feature engineering is divided into training set and test set , The training set is used to train the model , The test set is used to evaluate the accuracy of the model
# There is no relationship between the number and whether the patient has diabetes , Irrelevant features shall be deleted
train=data[data[' Signs of diabetes '] !=-1]
test=data[data[' Signs of diabetes '] ==-1]
train_label=train[' Signs of diabetes ']
train=train.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
test=test.drop([' Number ',' Signs of diabetes ',' Year of birth '],axis=1)
#---------------- model training ----------------
model = DecisionTreeClassifier()
model.fit(train, train_label)
y_pre=model.predict(test)
y_pre
#---------------- Results output ----------------
result=pd.read_csv(' Submit sample .csv')
result['label']=y_pre
result.to_csv('result-de.csv',index=False)2.2 Results submitted
Submit at the submission result , Submit Predicted results .csv( Program generated CSV file ), Check your score ranking




Past highlights
It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download machine learning and deep learning notes and other information printing 《 Statistical learning method 》 Code reproduction album machine learning communication qq Group 955171419, Please scan the code to join wechat group 
边栏推荐
- Tutorial on the principle and application of database system (060) -- MySQL exercise: operation questions 11-20 (IV)
- Jenkins -- continuous integration server
- C language: merge sort
- 比XShell更好用、更现代的终端工具!
- 基于神经网络的帧内预测和变换核选择
- 国产口服新冠药阿兹夫定安全吗?专家权威解读
- P1797重型运输 题解
- 用非递归的方法实现二叉树中的层遍历,先序遍历,中序遍历和后序遍历
- [ecmascript6] symbol and its related use
- 在 Kubernetes 中部署应用交付服务(第 1 部分)
猜你喜欢

拒绝服务 DDoS 攻击

Customized template in wechat applet

SQL daily practice (Niuke new question bank) - day 4: advanced operators

火山石投资章苏阳:硬科技,下一个10年相对确定的答案

Leetcode-136. numbers that appear only once

Night God simulator packet capturing wechat applet

30天刷题计划(四)

Jenkins -- continuous integration server

微信小程序中自定义模板

半波整流点亮LED
随机推荐
力扣 2354. 优质数对的数目
I miss the year of "losing" Li Ziqi
Debezium series: major changes and new features of 2.0.0.beta1
I'm bald! Who should I choose for unique index or general index?
长封闭期私募产品再现 业内人士看法各异
Force buckle 2354. Number of high-quality pairs
30天刷题计划(四)
Night God simulator packet capturing wechat applet
30天刷题计划(二)
JWT login authentication + token automatic renewal scheme, well written!
Tutorial on the principle and application of database system (060) -- MySQL exercise: operation questions 11-20 (IV)
国产口服新冠药阿兹夫定安全吗?专家权威解读
R语言使用dpois函数生成泊松分布密度数据、使用plot函数可视化泊松分布密度数据(Poisson distribution)
【黑马早报】字节估值缩水,降至2700亿美元;“二舅”视频作者回应抄袭;任泽平称取消商品房预售制是大势所趋;美联储宣布再加息75个基点...
R语言使用lm函数构建多元回归模型(Multiple Linear Regression)、并根据模型系数写出回归方程、使用confint函数给出回归系数的95%置信区间
GO语言-栈的应用-表达式求值
How to check if the interface cannot be adjusted? I didn't expect that the old bird of the 10-year test was planted on this interview question
Tutorial on the principle and application of database system (058) -- MySQL exercise (2): single choice question
Deployment之滚动更新策略。
最强分布式锁工具:Redisson