当前位置:网站首页>What is label encoding? How to distinguish and use one hot encoding and label encoding?
What is label encoding? How to distinguish and use one hot encoding and label encoding?
2022-07-03 15:01:00 【Hali_ Botebie】
List of articles
What is? Label encoding
label encoding It means encoding with tags , That is, the original eigenvalue is encoded into a customized digital tag to complete the quantization coding process .
give an example :
If there are three color characteristics : red 、 yellow 、 blue . In the use of machine learning algorithms, it is generally necessary to carry out vectorization or digitization . Then you may want to make red =1, yellow =2, blue =3. So this actually implements tag coding , That is to label different categories .
characteristic
- advantage : It solves the problem of classification and coding , Quantitative numbers can be defined freely . But it is also a disadvantage , Because the value itself has no meaning , It's just sorting . For example, the code of large, medium and small is 123, It can also be coded as 321, That is, the value is meaningless .
- shortcoming : Poor interpretability . Such as the [dog,cat,dog,mouse,cat], We turn it into [1,2,1,3,2], There is a strange phenomenon :dog and mouse The average value of is cat. therefore ,Label encoding Coding doesn't have a wide application scenario .
one-hot encoding ,label encoding How to distinguish and use the two codes ?
1. Characteristic data type
For categorical data , It is recommended to use one-hot encoding. Classification is pure classification , Don't order , There's no logic . For example, gender is divided into male and female , There is no logical relationship between men and women , We can't say that men are better than women , Or vice versa . also , The classification of provinces and cities in China can also use the unique hot coding , Similarly, there is no logical relationship between provinces , Use at this time one-hot encoding It's better to meet you . But pay attention to , Generally, one variable is left out , For example, the opposite of men must be women , So women are repeating information , So just keep one of the variables .
For ordered data , It is recommended to use label encoding. Ordering types are also classifications , But there's sort logic , Higher in rank than in class . such as , Education is divided into primary schools , Junior high school , high school , Undergraduate , Graduate student , There is a certain logic between the categories , Obviously graduate education is the highest , Primary school is the lowest . Use at this time Label encoding It would be more appropriate , Because the custom number order can not destroy the original logic , And corresponding to this logic .
2. The model used
Models that are sensitive to numerical size must use one-hotencoding. A typical example is LR and SVM. The loss function of the two is sensitive to the numerical value , And the numerical value between variables is significant . and Label encoding There is no meaning of numerical value in the digital code of , It's just a sort of ordering , So for all of these models one-hot encoding.
Models insensitive to numerical size ( Like a tree model ) Not recommended one-hotencoding. Generally, this kind of model is tree model . If there are too many categories , that one-hot encoding It splits out a lot of characteristic variables . Now , If we limit the depth of the tree model and can't split it down , Some characteristic variables may be abandoned because the model can not continue to split . therefore , In this case, we can consider using Label encoding.
The above two considerations need to be considered comprehensively , Instead of judging alone . That is to say, we need to choose the coding method according to the data type and model .
————————————————
Copyright notice : This paper is about CSDN Blogger 「 Plain paper and breeze 」 The original article of , follow CC 4.0 BY-SA Copyright agreement , For reprint, please attach the original source link and this statement .
Link to the original text :https://blog.csdn.net/weixin_45834085/article/details/102991983
边栏推荐
- Global and Chinese markets for sterile packaging 2022-2028: Research Report on technology, participants, trends, market size and share
- Leetcode sword offer find the number I (nine) in the sorted array
- Global and Chinese markets for transparent OLED displays 2022-2028: Research Report on technology, participants, trends, market size and share
- 【Transform】【实践】使用Pytorch的torch.nn.MultiheadAttention来实现self-attention
- Unity hierarchical bounding box AABB tree
- Yolov5进阶之九 目标追踪实例1
- Simulation of LS -al command in C language
- Neon global and Chinese markets 2022-2028: Research Report on technology, participants, trends, market size and share
- Zzuli:1049 square sum and cubic sum
- Yolov5 advanced nine target tracking example 1
猜你喜欢

QT - draw something else
![[ue4] cascading shadow CSM](/img/83/f4dfda3bd5ba0172676c450ba7693b.jpg)
[ue4] cascading shadow CSM
![[engine development] rendering architecture and advanced graphics programming](/img/a4/3526a4e0f68e49c1aa5ce23b578781.jpg)
[engine development] rendering architecture and advanced graphics programming

Byte practice plane longitude 2

QT program font becomes larger on computers with different resolutions, overflowing controls

B2020 points candy

【Transform】【实践】使用Pytorch的torch.nn.MultiheadAttention来实现self-attention
![[ue4] HISM large scale vegetation rendering solution](/img/a2/2ff2462207e3c3e8364a092765040c.jpg)
[ue4] HISM large scale vegetation rendering solution

How does vs+qt set the software version copyright, obtain the software version and display the version number?

Yolov5 series (I) -- network visualization tool netron
随机推荐
Detailed explanation of four modes of distributed transaction (Seata)
[graphics] efficient target deformation animation based on OpenGL es 3.0
Tensor ellipsis (three points) slice
cpu飙升排查方法
Web server code parsing - thread pool
Série yolov5 (i) - - netron, un outil de visualisation de réseau
Unity hierarchical bounding box AABB tree
2022/02/14
Zzuli:1049 square sum and cubic sum
Yolov5进阶之八 高低版本格式转换问题
To improve efficiency or increase costs, how should developers understand pair programming?
[engine development] in depth GPU and rendering optimization (basic)
Talking about part of data storage in C language
Zzuli:1041 sum of sequence 2
[ue4] cascading shadow CSM
How can entrepreneurial teams implement agile testing to improve quality and efficiency? Voice network developer entrepreneurship lecture Vol.03
链表有环,快慢指针走3步可以吗
NOI OPENJUDGE 1.3(06)
B2020 points candy
Yolov5系列(一)——網絡可視化工具netron