当前位置:网站首页>[feature selection] several methods of feature selection
[feature selection] several methods of feature selection
2022-07-24 20:29:00 【Sunny qt01】
- feature selection *
Invalid variable
Irrelevant variables , Redundant variables
Feature selection of statistical methods
Variance thresholding 、 Chi square test 、ANOVA Inspection and T test 、 Pearson correlation coefficient
Selection of highly relevant features ( Redundant variables )
Feature selection of model mode
Decision tree 、 Logical regression , Random forests ,XGBoost
The model will automatically select variables
Recursive feature selection .
Slowly eliminate the features , Limit to a specific range .

When input increases , Data must be added , Otherwise, the model will be unstable ,
- Invalid variable
Irrelevant variables , Redundant variables

Redundancy: The correlation between the two variables is too high , explain 1 Whether the concepts of the two may be close , That is, redundant variables , You can adopt the method of merging . Even delete fields , Both bring information
Irrelevancy:X4,X3 Is irrelevant variables ,X4 When it gets larger, you will find the change of the target value . When X3 The predicted value is random when it changes , Unrelated , Unable to bring information .

- Feature selection of statistical methods
VT Variance thresholding : Calculate the variance of numeric fields , If below a certain value , It means that it contains insufficient information .
Variance cannot be standardized in advance . such as Z-scold Its variance is 1, The mean for 0
A threshold must be determined , Delete this field
Binary variable : Code one of them as 1, One code is 0 The variance is P(1-P)( First do feature transformation )

When the variance is larger , Description is the more important field . The maximum is 0.25.
Of course , This has nothing to do with the goal
- Statistical inspection method :
The relationship between the input field and the target field
Category field : Chi square test : The relationship between the input field and the target field
Numeric fields :ANOVA test ( The target field is greater than 2 Just go ):T test ( The target field has only 2 It's worth , such as yes or no): To verify the relevance between the input field and the target field .
ANOVA Case study : Whether background music will affect consumers' mood . music ( Input field ) Relationship with alcohol purchase .
No music ,French Accordion ,italian Accordion
Alcohol :French、italian、 Other alcoholic beverages
statistic

Real sales minus the sum of expectations divided by the sum of expectations


This is the expected frequency . Let the two be independent , probability 1 Multiply by the probability 2, Multiply by total 243.
Subtract the following table from the above table , Sum of squares , Divide by the sum of the mean

The larger the value, the better . The value of comparison can be found in the table ,
First calculate its chi square value , Use this value to look up the table , The corresponding probability , If it is less than the significance level 0.05, The probability that the two are irrelevant is very small , To exclude .
Case microfinance chi square test results :

1234 It's more important ,5678 Is not important
T Inspection process : to F test , How to T test


lower than 0.05 As an important variable
ANOVA Inspection process : Find out first F-value, How to find T-value

The result is very close to .
Pearson correlation coefficient :
Selection of highly relevant features ( Redundant variables ):
Highly relevant fields often appear , The information is repeated , Using Pearson correlation coefficient , Check the correlation between the two . Greater than 0.95 Just erase the variables .
It depends on keeping that , Variable can be found 1 And variables 2 Relationship with goals .
- Feature selection of model mode
Decision tree 、 Logical regression , Random forests ,XGBoost
The model will automatically select the most important variables , Variables that do not have collinearity ,
It can solve collinearity , Irrelevant issues .
- RFECV( Recursive variable selection .)
Cross validation method to verify .CV.
RFE: repeat
The evaluation index can use the index you decide . Remove Variable , If the index gets worse

backward : First use cross validation , Get the index value , Remove one of them , After the indicators get better , Continue to remove , If the index value becomes worse , Just go back and don't eliminate .
3 Methods , Forward method , backward , Stepwise regression
The best effect , But it consumes more energy , A waste of time
边栏推荐
- Failed to create a concurrent index, leaving an invalid index. How to find it
- How to apply Po mode in selenium automated testing
- [trial experience of Yuxin micro Wiota ad hoc network protocol development kit] RT thread BSP Software package production
- How to test WebService interface
- [training Day10] tree [interval DP]
- Generate self signed certificate: generate certificate and secret key
- English grammar_ Demonstrative pronoun this / these / that / those
- Connect the smart WiFi remote control in the home assistant
- Machine learning job interview summary: five key points that resume should pay attention to
- Substr and substring function usage in SQL
猜你喜欢

Transport layer protocol parsing -- UDP and TCP
![[training Day10] point [enumeration] [bidirectional linked list]](/img/62/41dcab40eeb6aea545602e10c1c1a0.png)
[training Day10] point [enumeration] [bidirectional linked list]

Rhodamine B labeled PNA | rhodamine b-pna | biotin modified PNA | biotin modified PNA | specification information

"Hualiu is the top stream"? Share your idea of yyds

Synthesis of peptide nucleic acid PNA labeled with heptachydrin dye cy7 cy7-pna

How does starknet change the L2 landscape?

Leetcode 1911. maximum subsequence alternating sum

Istio II traffic hijacking process

Open source demo | release of open source example of arcall applet

What does software testing need to learn?
随机推荐
Modbus communication protocol specification (Chinese) sharing
Hook 32-bit function using the method modified to JMP instruction
Upgrade appium automation framework to the latest 2.0
Work notes - some problems encountered when using jest
Processing of null value of Oracle notes
Oracle 19C datagruad replication standby rman-05535 ora-01275
Istio二之流量劫持过程
Sql164 next day retention rate of new users per day in November 2021
Luogu - p1616 crazy herb picking
Risk control system, implemented by flink+clickhouse!
Oracle creates table spaces and views table spaces and usage
Actual measurement of Qunhui 71000 Gigabit Network
"Hualiu is the top stream"? Share your idea of yyds
Do you want to enroll in a training class or study by yourself?
英文翻译中文常见脏话
[training Day8] interesting number [digital DP]
Hcip early summary
How to learn automated testing? Can you teach yourself?
The difference between map and flatmap in stream
English grammar_ Demonstrative pronoun this / these / that / those