当前位置:网站首页>What is label encoding? How to distinguish and use one hot encoding and label encoding?
What is label encoding? How to distinguish and use one hot encoding and label encoding?
2022-07-03 15:01:00 【Hali_ Botebie】
List of articles
What is? Label encoding
label encoding It means encoding with tags , That is, the original eigenvalue is encoded into a customized digital tag to complete the quantization coding process .
give an example :
If there are three color characteristics : red 、 yellow 、 blue . In the use of machine learning algorithms, it is generally necessary to carry out vectorization or digitization . Then you may want to make red =1, yellow =2, blue =3. So this actually implements tag coding , That is to label different categories .
characteristic
- advantage : It solves the problem of classification and coding , Quantitative numbers can be defined freely . But it is also a disadvantage , Because the value itself has no meaning , It's just sorting . For example, the code of large, medium and small is 123, It can also be coded as 321, That is, the value is meaningless .
- shortcoming : Poor interpretability . Such as the [dog,cat,dog,mouse,cat], We turn it into [1,2,1,3,2], There is a strange phenomenon :dog and mouse The average value of is cat. therefore ,Label encoding Coding doesn't have a wide application scenario .
one-hot encoding ,label encoding How to distinguish and use the two codes ?
1. Characteristic data type
For categorical data , It is recommended to use one-hot encoding. Classification is pure classification , Don't order , There's no logic . For example, gender is divided into male and female , There is no logical relationship between men and women , We can't say that men are better than women , Or vice versa . also , The classification of provinces and cities in China can also use the unique hot coding , Similarly, there is no logical relationship between provinces , Use at this time one-hot encoding It's better to meet you . But pay attention to , Generally, one variable is left out , For example, the opposite of men must be women , So women are repeating information , So just keep one of the variables .
For ordered data , It is recommended to use label encoding. Ordering types are also classifications , But there's sort logic , Higher in rank than in class . such as , Education is divided into primary schools , Junior high school , high school , Undergraduate , Graduate student , There is a certain logic between the categories , Obviously graduate education is the highest , Primary school is the lowest . Use at this time Label encoding It would be more appropriate , Because the custom number order can not destroy the original logic , And corresponding to this logic .
2. The model used
Models that are sensitive to numerical size must use one-hotencoding. A typical example is LR and SVM. The loss function of the two is sensitive to the numerical value , And the numerical value between variables is significant . and Label encoding There is no meaning of numerical value in the digital code of , It's just a sort of ordering , So for all of these models one-hot encoding.
Models insensitive to numerical size ( Like a tree model ) Not recommended one-hotencoding. Generally, this kind of model is tree model . If there are too many categories , that one-hot encoding It splits out a lot of characteristic variables . Now , If we limit the depth of the tree model and can't split it down , Some characteristic variables may be abandoned because the model can not continue to split . therefore , In this case, we can consider using Label encoding.
The above two considerations need to be considered comprehensively , Instead of judging alone . That is to say, we need to choose the coding method according to the data type and model .
————————————————
Copyright notice : This paper is about CSDN Blogger 「 Plain paper and breeze 」 The original article of , follow CC 4.0 BY-SA Copyright agreement , For reprint, please attach the original source link and this statement .
Link to the original text :https://blog.csdn.net/weixin_45834085/article/details/102991983
边栏推荐
- Zzuli:1055 rabbit reproduction
- How can entrepreneurial teams implement agile testing to improve quality and efficiency? Voice network developer entrepreneurship lecture Vol.03
- Tensor ellipsis (three points) slice
- Déformation de la chaîne bm83 de niuke (conversion de cas, inversion de chaîne, remplacement de chaîne)
- Troubleshooting method of CPU surge
- 【Transform】【实践】使用Pytorch的torch.nn.MultiheadAttention来实现self-attention
- 牛客 BM83 字符串變形(大小寫轉換,字符串反轉,字符串替換)
- Global and Chinese market of trimethylamine 2022-2028: Research Report on technology, participants, trends, market size and share
- PHP GD image upload bypass
- Zzuli:1040 sum of sequence 1
猜你喜欢
Qt—绘制其他东西
复合类型(自定义类型)
零拷贝底层剖析
C language fcntl function
Detailed explanation of four modes of distributed transaction (Seata)
4-33--4-35
B2020 分糖果
Centos7 deployment sentry redis (with architecture diagram, clear and easy to understand)
[engine development] in depth GPU and rendering optimization (basic)
Adobe Premiere Pro 15.4 has been released. It natively supports Apple M1 and adds the function of speech to text
随机推荐
Web server code parsing - thread pool
C language fcntl function
ASTC texture compression (adaptive scalable texture compression)
Zzuli:1043 max
[engine development] in depth GPU and rendering optimization (basic)
Zzuli:1045 numerical statistics
Zzuli: cumulative sum of 1050 factorials
[ue4] geometry drawing pipeline
C language memory function
[wechat applet] wxss template style
5-1 blocking / non blocking, synchronous / asynchronous
Devaxpress: range selection control rangecontrol uses
Global and Chinese markets for infrared solutions (for industrial, civil, national defense and security applications) 2022-2028: Research Report on technology, participants, trends, market size and sh
Bucket sorting in C language
[combinatorics] permutation and combination (set combination, one-to-one correspondence model analysis example)
C language DUP function
Simulation of LS -al command in C language
5.4-5.5
Tensor 省略号(三个点)切片
Global and Chinese markets of AC electromechanical relays 2022-2028: Research Report on technology, participants, trends, market size and share