当前位置:网站首页>Systematic arrangement | how many children's shoes have forgotten to be done carefully before the model development (practical operation)
Systematic arrangement | how many children's shoes have forgotten to be done carefully before the model development (practical operation)
2022-06-22 04:26:00 【Tomato risk control】
Model development is a set of standardized processes , From data cleaning , feature selection , Variable selection , Model fitting , Just like the assembly line in a factory . Today we choose an important link in the whole assembly line , Tell you about the coding methods systematically . This time we take the most popular xgb Explain the actual operation to everyone . In our usual modeling , We often use XGBoost To train the model , However, there are more or less classified character variables in the sample data , For example, educational background 、 Housing type, etc . therefore , We also often use one-hot Code to process , Convert typed character variables to numeric variables , To ensure that the model can be fitted and trained normally . however , Why is it XGBoost Under the algorithm, it is necessary to convert the variables of different types , The reason is for XGBoost In terms of models , The learning trees used are cart Back to the tree , This means that this kind of lifting tree algorithm only accepts numerical feature input , Category type features are not directly supported . In this case , We have to XGBoost Before model training , Carry out appropriate feature coding for the classified features . although one-hot Coding is commonly used , But combined with different actual business scenarios , Master a variety of feature coding methods and apply them , It can provide a more suitable approach for our model training and model optimization .
This article will focus on the above business scenarios , Let me introduce you to XGBoost Several commonly used feature coding methods for categorical variables in modeling samples , Include One-Hot code 、Label code 、Ordinal code 、Frequency code 、Mean Coding, etc . In the specific content introduction , We implement it synchronously through specific sample operation cases , It also focuses on the basic characteristics and application ideas of each feature coding method .
1、 Case sample introduction
The sample data of this case is shown in the figure 1 Shown , among , Field “ Marital status ”、“ Education level ”、“ Housing type ” Are classified character variables .

chart 1 Sample data
For categorical variables in the sample ,“ Marital status ” It is an unordered classification variable ,“ Education level ” And “ Housing type ” It is an ordered classification variable , The corresponding data distribution results are shown in Figure 2 Shown .

chart 2 Sample description analysis
After getting familiar with the basic information of the sample , Let's code according to different features , To realize the conversion process of the above classified variables .
2、One-Hot code
One-Hot code ( Hot coding alone ), It is the process of converting classified variables into multiple binary new variables , among 1 Represents the current tag attribute ,0 Represents that does not have the current tag attribute , The number of new fields generated is the same as that of the original feature category , This method is the most commonly used feature coding method . for example , A categorical variable “ credit rating ” The values of are A、B、C、D, So after One-Hot After code conversion , Will generate 4 A new variable , At the same time, the result is 1/0( yes / no ). What needs to be noted here is , When the number of values of classified features is small ,One-Hot Coding is very effective , It can generate important new variables . However, when the number of feature values is large ,One-Hot Coding will generate high-dimensional sparse features , In this way XGBoost During the training of the equal tree model , It is easy to cause over fitting phenomenon . stay Python In language ,One-Hot Coding can be done through get_dummies( ) Function to implement , The specific code is shown in the figure 3 Shown , The output result is shown in the figure 4 Shown .

chart 3 One-Hot Coding code
、
stay Python In language ,One-Hot Coding can be done through get_dummies( ) Function to implement , Code as shown 3 Shown , Some results are shown in Figure 4 Shown .

chart 4 One-Hot Coding results
3、Label code
Label code ( Tag code ), Suppose a categorical variable has n Value categories , Then the feature is encoded as... According to the size of the feature data 0~n-1 Integer between . stay Python In language ,Label Coding can be done through LabelEncoder( ) Function to implement , The specific code is shown in the figure 3 Shown , The output result is shown in the figure 4 Shown .

chart 5 Label Coding code 
chart 6 Label Coding results
4、Ordinal code
Ordinal code ( Serial number code ), It is used to code the category of the current ordered classification variable according to the sequence number value specified in advance .Ordinal Code and Label The coding logic is very similar , The biggest difference is Ordinal Codes assign values to categories in order , To ensure that the original order is preserved after coding . Besides ,Ordinal The coding is generally based on serial number 1 Start coding . for example , A categorical variable “ credit rating ” The value of A、B、C、D, May, in accordance with the D-1、C-2、B-3、A-4 The logic of . stay Python In language ,Ordinal Coding can be done through temp_dict To assign a tag , The specific implementation code is shown in figure 7 Shown , The output result is shown in the figure 8 Shown .

chart 7 Ordinal Coding code

chart 8 Ordinal Coding results
5、Frequency code
Frequency code ( Frequency coding ), Is to take the frequency of the feature value category as a new label , This method can reflect the relationship between characteristic data and target variables , The specific implementation code is shown in figure 7 Shown , The output result is shown in the figure 8 Shown .

chart 7 Frequency Coding code

chart 8 Frequency Coding results
6、Mean code
Mean code ( Average coding ), Also called target code , Is to take values for each category of the feature tag , It is expressed by the average value of the target variable in the sample data , This method can reflect the relationship between similar categories , The specific implementation code is shown in figure 9 Shown , The output result is shown in the figure 10 Shown .

chart 9 Mean Coding code

chart 10 Mean Coding results
7、 Feature coding application
Several feature coding methods introduced above , It is often used in the modeling process , The specific application needs to be combined with the actual situation of the sample data . Now let's return to the scenario requirements of this article , That is, after the feature coding processing , use XGBoost Classification algorithm to build the model . Let's choose One-Hot The data training model after coding , The implementation code is shown in the figure 11 Shown .

chart 11 XGBoost model training
Output evaluation index after model training Accuracy And AUC The result is shown in Fig. 12 Shown , Respectively 0.8518、0.5131.

chart 11 Model evaluation indicators
Now the importance coefficients of the variables in the model are output , The implementation code and output results are shown in Figure 12、13 Shown . It can be seen from the importance of each characteristic variable , after One-Hot Some characteristic variables after coding ( Housing type _ Self owned mortgage 、 Housing type _ other 、 Education level _ high school / secondary specialized school 、 Education level _ Specialty 、 Housing type _ Rent a house, etc ), It has a relatively high contribution in the model , This also shows the effectiveness of feature coding in Feature Engineering of data modeling process .

chart 12 Feature importance code

chart 13 Feature importance distribution
Combine the above , We aim at XGBoost Business scenarios for processing classified feature variables in the modeling process , Several common feature coding methods are introduced , Include One-Hot code 、Label code 、Ordinal code 、Frequency code 、Mean Coding, etc . meanwhile , Sample data around actual cases , It mainly describes the basic characteristics and application ideas of each feature coding method . Last , selection One-Hot Encode the processed data XGBoost model training , And get a more reasonable model effect . Of course , There are other methods of feature coding , But no matter how the processing logic changes , The purpose is to use reasonable and effective methods , The classification feature is introduced into the model fitting variable pool , And in actual modeling , Different feature coding processing methods can be tried , Compare and analyze and select the method with better effect . In order to make you more familiar with different feature coding methods in practical modeling applications , We have prepared the sample data synchronized with the content of this article Python Code , For your reference , For details, please move to the knowledge planet to view the relevant content .

…
~ Original article
边栏推荐
- Multithread interrupt usage
- IDEA安装及其使用详解教程
- On the income of enterprise executives
- Basic operation of sequence table
- active RM机子断电后,RM HA切换正常。但是YarnUI上查看不到集群资源,application也一直处于ACCEPTED状态。
- Segment tree & tree array template
- PHP determines whether the current time has exceeded
- 通过ip如何免费反查域名?
- With this set of templates, it is easier to play with weekly, monthly and annual reports
- Cloud security daily 220621: Intel microcode vulnerability found in Ubuntu operating system, which needs to be upgraded as soon as possible
猜你喜欢

WPF DataContext 使用(2)

冒泡排序

Laravel realizes file (picture) uploading

window10无法访问局域网共享文件夹

Laravel implements soft deletion

商汤智慧医疗团队研究员解读智慧医疗下的器官图像处理

Redis和MySQL如何保持数据一致性?强一致性,弱一致性,最终一致性

解决Swagger2显示UI界面但是没内容

Accumulate ERP global parameters, flexibly configure cost value rules, and complete accounting

有了这几个刷题网站,还愁跳槽不涨薪?
随机推荐
Use of markdown markup language
Is the Guoyuan futures account reliable? How can a novice safely open an account?
How does twitter decentralize? Look at these ten socialfi projects
Django learning - model and database operation (II)
Shell Sort
On the income of enterprise executives
順序錶的基本操作
POSIX shared memory
Larave database backup scheduled task
Solutions pour l'écran bleu idea
Accumulate ERP global parameters, flexibly configure cost value rules, and complete accounting
线索二叉树
邻接矩阵,邻接表,十字链表,邻接多重表
[you don't understand the routing strategy? Try it!]
EcRT of EtherCAT igh source code_ slave_ config_ Understanding of dc() function.
Cloud security daily 220621: Intel microcode vulnerability found in Ubuntu operating system, which needs to be upgraded as soon as possible
Axios get parameter transfer splicing database fields
DFS of graph
Quick sort
Ora-48132 ora-48170 appears in the alarm log