当前位置:网站首页>[summary of Feature Engineering] explain what features are and the steps of feature engineering
[summary of Feature Engineering] explain what features are and the steps of feature engineering
2022-07-24 20:29:00 【Sunny qt01】
- Introduction to feature engineering
Listen to people often , Data and features determine the upper limit of machine learning , Algorithms and models are just constantly approaching this upper limit . thus it can be seen , Feature engineering plays an indispensable role in machine learning .
Look back at the website Kaggle,KDD, Competitions at home and abroad , In fact, the champion of each competition did not use a very sophisticated algorithm , Most of them have done excellent work in feature engineering , Then we can get excellent performance by using some common algorithms .
Feature engineering is a key factor in machine learning .
- The importance and purpose of Feature Engineering
The purpose of Feature Engineering : It is to transform fields into features that can better represent potential problems , And then improve the efficiency of machine learning .
1. The better the characteristics are , The more flexible : Good features perform well in any model , The flexibility of good features is that they allow you to choose uncomplicated models , At the same time, the allowable speed will be faster , Make it easier for you to understand and maintain .
2. The better the characteristics are , The simpler the model is built : Have good characteristics , Even if the parameters of the model are not optimal , The performance of the model can still perform well , So you don't have to spend too much time looking for the optimal parameters , It greatly reduces the complexity of the model , Make the model simpler .
3. The better the characteristics are , The better the performance of the model . The purpose of our search for features is to improve the performance of the model .
How to evaluate feature Engineering
Build machine learning model Baseline Model( The most basic machine model )
Apply one or more feature engineering techniques to raw data
Rebuild machine learning model and Baseline Model Compare
If the increment of efficiency is greater than a certain critical value , It means it is beneficial
Before major feature engineering treatment , We have to
1. Feature understanding : Know what fields are in the dataset
2. Feature improvement : Data preprocessing for fields
- Feature understanding
1. Data is structure ( surface ) Or unstructured data ( Text , voice , video , Audio )
2. Type of field : Numerical type , Category type , Sequential type , Binary type
3. Descriptive data analysis (Exploratory Data Analysis)( In order to let you know the overall situation of the data )
(1) Descriptive statistics (Descriptive Statisties): Number of different values , Number of null values , Distribution of category values , Maximum , minimum value , Average , standard deviation , Outliers and other data quality reports
(2) Data visualization (Data Visualization): Collocation of various charts ( Pie chart , bar chart , Histogram , Scatter plot, etc. ), It can be presented with the target field .
Case study : Microcredit data set
Microfinance data protection 1551 Customer data
Each customer data contains a target field (Target Attribute)
Microfinance data includes 1551 Customer data
Each customer data contains a target field (Target Attribute) and 10 Input fields (Input Attribute)
8 Category fields ,2 Fields are numeric
This project divides customers into two categories
1. There will be microfinance ( Will respond ) The customer
167 Pen data
2. Will not come to microfinance ( No response ) The customer
Yes 1384 Pen data
Field 01:age( Numeric fields ): Age
Field 02:sex( Category field ): Gender
Field 03:region( Category field ): Residential area
Field 05:income( Numeric fields ): Monthly income
Field 06:children( Category field ): The number of children in the family
Field 07:car( Category field ) Is there a car
Field 08:save——act( Category field ): Whether there is a live savings account
Field 09:current_act( Category field ): Whether there is a deposit account
Field 10:mortgage( Category field ): Whether it is a mortgage account

Descriptive statistics : Data quality report
Total table

Different values of gender may be problematic .
Table of numeric fields
The maximum value and the minimum value are compared with the last two , See if there are outliers

Table 3 : Characteristics of classified data

Gender seems to have little to do with the target field

Different residential areas have slightly different loan ratios ( More important than gender )

Married people seem to have greater demand for loans

The relationship between the number of children and whether to loan , No children, no sense of responsibility ,( important )

Whether there is a car seems to have little to do with whether it is a loan

Whether there is a relationship between current account and loan

The age will be lower and lower .

Income also has a downward trend .

Feature improvement :
Improve on the premise of understanding the characteristics
Data cleaning : Wrong value 、 Null value 、 The treatment of outliers , It has been explained before
Data encoding , Data standardization (Data Standardization) And type conversion
Z-score,Min-Max, Code of category and sequential fields , It has been explained before
Generalization of data , Normalization of data ( Let the length of the vector be 1, For text analysis , The third part will explain )
Unstructured data structure ( The third part will specify )
data ( Non text ) Normalization of
L2 The Euclidean distance representing the data therein is equal to 1, That is, the square root of the two is 1
L1
The sum of the two is 1,
- Coverage of Feature Engineering
Feature construction : Construct new features , Explore the relationship between features
Use external data , Data exploration , Expert experience , Data analysis , Feature construction method
feature selection : Select some useful features , For bad characteristics say no
Statistical method , Highly relevant features , Model way ( Random forests , Decision tree ), Recursive feature selection ( Gradual regression and so on )
Feature transformation :( With premise, for example PCA) Using mathematical methods ( Simple addition, subtraction, multiplication and division, principal component factor analysis, etc ), Merge old fields ( It can be a bad feature ) Produce new features , Extract the potential structure hidden in the data
linear (PCA, Matrix decomposition NMF,SVD,TSVD,LDA)
nonlinear (Kernel PCA,tSNE, neural network )
Two linear transformations
Feature learning :( No premise ) Use deep learning , Automatically learn new features .
( Association rules , neural network , Deep learning ) Feature based learning , Word embedding based text feature learning
With AI promote AI
边栏推荐
- 2022 chemical automation control instrument test question simulation test platform operation
- Are network security and data security indistinguishable? Why is data security important?
- Connect the smart WiFi remote control in the home assistant
- Luogu - p1616 crazy herb picking
- Solve the problem of error l6218e undefined symbol XXX
- Software testing interview tips | if you don't receive the offer, I'll wash my hair upside down
- Alibaba sentinel basic operation
- [training Day6] triangle [mathematics] [violence]
- Leetcode 206 reverse linked list, 3 longest substring without repeated characters, 912 sorted array (fast row), the kth largest element in 215 array, 53 largest subarray and 152 product largest subarr
- Monotone stack and monotone queue (linear complexity optimization)
猜你喜欢

Istio一之Envoy工作原理

Azide labeled PNA peptide nucleic acid | methylene blue labeled PNA peptide nucleic acid | tyrosine modified PNA | Tyr PNA Qiyue Bio

VLAN Technology

How to apply Po mode in selenium automated testing
![[training Day9] light tank [dynamic planning]](/img/69/e7a69972a2865408479c7f8c245c1f.png)
[training Day9] light tank [dynamic planning]

Pix2seq: Google brain proposes a unified interface for CV tasks!

Understand the domestic open source Magnolia license series agreement in simple terms

How to test WebService interface

Covid-19-20 - basic method of network segmentation based on vnet3d
![[training Day6] game [mathematics]](/img/b2/09c752d789eead9a6b60f4b4b1d5d4.png)
[training Day6] game [mathematics]
随机推荐
Covid-19-20 - basic method of network segmentation based on vnet3d
Fluoronisin peptide nucleic acid oligomer complex | regular active group alkyne, SH thiol alkynyl modified peptide nucleic acid
(posted) differences and connections between beanfactory and factorybean
Rhodamine B labeled PNA | rhodamine b-pna | biotin modified PNA | biotin modified PNA | specification information
Oracle creates table spaces and views table spaces and usage
[training Day10] point [enumeration] [bidirectional linked list]
The difference between map and flatmap in stream
Valdo2021 - vascular space segmentation in vascular disease detection challenge (I)
[training Day6] triangle [mathematics] [violence]
How to set appium script startup parameters
Failed to create a concurrent index, leaving an invalid index. How to find it
Near infrared dye cy7.5 labeling PNA polypeptide experimental steps cy7.5-pna|188re labeling anti gene peptide nucleic acid (agpna)
Application layer - typical protocol analysis
Get the current time in go language, and the simple implementation of MD5, HMAC, SHA1 algorithms
[training Day9] rotate [violence] [thinking]
1. Mx6u-alpha development board (buzzer experiment)
Do you want to verify and use the database in the interface test
Leetcode 560 and the subarray of K (with negative numbers, one-time traversal prefix and), leetcode 438 find all alphabetic ectopic words in the string (optimized sliding window), leetcode 141 circula
Redisgraph graphic database multi activity design scheme
Unity's ugui text component hard row display (improved)