当前位置:网站首页>Actual battle of financial risk control - under Feature Engineering
Actual battle of financial risk control - under Feature Engineering
2022-07-02 02:33:00 【Grateful_ Dead424】
feature selection
- Remove low variance features (Removing features with low variance)
- Univariate feature selection (Univariate feature selection)
- Recursive feature elimination (Recursive Feature Elimination)
- Use SelectFromModel Select features (Feature selection using SelectFromModel)
- Integrate the feature selection process into pipeline (Feature selection as part of a pipeline)
After data preprocessing , We need to select meaningful feature input machine learning algorithm and model for training .
Be careful : Generally, feature screening is not done before feature derivation , In addition to the variables with too many missing values, which are useless .
Generally speaking , Feature selection from two aspects :
Is the feature divergent
If a feature does not diverge , For example, variance is close to 0, That is to say, there is no difference between the samples in this feature , This feature is not useful for distinguishing samples .( Note that after the normalization process )
Correlation between features and objectives
This is obvious , Features highly relevant to the target , It should be preferred that . In addition to removing the low variance method , According to the form of feature selection, feature selection methods can be divided into 3 Kind of :
- Filter: Filtration method , Rate each feature by divergence or correlation , Set the threshold or the number of thresholds to be selected , Select features .
- Wrapper: Packaging method , According to the objective function ( It's usually a prediction score ), Select several features at a time , Or exclude some features , See whether the effect of the model is improved .
- Embedded: Embedding method , First use some machine learning algorithms and models for training , Get the weight coefficient of each feature , Select features from large to small coefficients . Be similar to Filter Method , But it's training that determines the quality of a feature .(= The importance of judging characteristics of digital analog random forest )
Feature selection has two main purposes :
- Reduce the number of features 、 Dimension reduction , Make model generalization more powerful , Reduce overfitting ( The main way : Rough box 、 Screening variables );
- Enhance understanding between features and eigenvalues .
Get the data set , A feature selection method , It is often difficult to achieve both ends at the same time . Usually , Choose one of the most familiar or convenient feature selection methods ( Often the purpose is to reduce dimension , While ignoring the purpose of understanding features and data ). Next we'll combine Scikit-learn The examples provided introduce several common feature selection methods , Their respective advantages, disadvantages and problems .
(1) Remove low variance features (Removing features with low variance)
Suppose that the eigenvalue of a feature is only 0 and 1, And in all the input samples ,95% The characteristic values of the instances of are 1, Then it can be considered that this feature has little effect . If 100% All are 1, Then this feature doesn't make sense . This method can only be used when the eigenvalues are discrete variables , If it's a continuous variable , You need to discretize continuous variables before you can use . And actually , It's not very likely to have 95% All of the above characteristics exist with a certain value , So this method is simple, but not easy to use . It can be used as a preprocessing for feature selection , First, remove the features with small value change , And then select the appropriate feature selection method from the next mentioned feature selection method for further feature selection .
import numpy as np
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]])
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
#array([[0, 1],
# [1, 0],
# [0, 0],
# [1, 1],
# [1, 0],
# [1, 1]])
Sure enough , VarianceThreshold Removed the first column of features , The eigenvalue in the first column is 0 The probability of reaching 5/6.
(2) Univariate feature selection (Univariate feature selection)
The principle of univariate feature selection is to calculate a statistical index of each variable separately , Judge which variables are important according to the index , Eliminate those unimportant variables .
For the classification problem (y discrete ), May adopt :
- Chi square test
- f_classif
- mutual_info_classif
- Mutual information
For the return question (y continuity ), May adopt :
- Pearson correlation coefficient
- f_regression,
- mutual_info_regression
- Maximum information coefficient
This method is relatively simple , Easy to run , Easy to understand , It's usually good for understanding data ( But for feature optimization 、 It's not necessarily effective to improve generalization ability ).
- SelectKBest Remove the score before k All features except the first name ( take top k)
- SelectPercentile Remove features that score after the user specified percentage ( take top k%)
- Use a general univariate statistical test for each feature : False positive rate (false positive rate) SelectFpr, False discovery rate (false discovery rate) SelectFdr, Or family error rate SelectFwe.
- GenericUnivariateSelect Different strategies can be set for single variable feature selection . At the same time, different selection strategies can also use super parameter optimization , So that we can find the best single variable feature selection strategy .
The methods based on F-test estimate the degree of linear dependency between two random variables. (F The test is used to evaluate the linear correlation between two random variables ) On the other hand, mutual information methods can capture any kind of statistical dependency, but being nonparametric, they require more samples for accurate estimation. ( On the other hand , The mutual information approach can capture any type of statistical dependency , But as a nonparametric method , More samples are needed to estimate accurately )
chi-square (Chi2) test ( and IV The value is almost )
The classical chi square test is to test the correlation between qualitative independent variable and qualitative dependent variable ( The correlation between discrete and discrete variables ). such as , We can test the sample once chi2 Test to select the best two features :
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
iris = load_iris()
X, y = iris.data, iris.target
X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
#(150, 4)
#(150, 2)
Pearson The correlation coefficient (Pearson Correlation)
Pearson correlation coefficient is one of the simplest , Methods to help understand the relationship between features and response variables , This method measures the linear correlation between variables , The value range of the result is [-1,1],-1 It means a complete negative correlation ,+1 It means complete positive correlation ,0 No linear correlation
import numpy as np
from scipy.stats import pearsonr
size = 300
x = np.random.normal(0, 1, size)
""" pearsonr(x, y) The input of is the characteristic matrix and the target vector , Be able to calculate the correlation coefficient and p-value. """
print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))
""" The differences of variables before and after adding noise are compared . When the noise is low , There's a strong correlation ,p-value Very low """
""" Use Pearson The correlation coefficient is mainly to see the correlation between features , Not between and the dependent variable . """
#Lower noise (0.7182483686213834, 7.324017313000586e-49)
#Higher noise (0.05796429207933808, 0.3170099388532581)
Recursive feature elimination (Recursive Feature Elimination)
The recursive elimination feature method uses a base model for multiple rounds of training , After each round of training , Remove some features of weight coefficients , Next training based on the new feature set .
A prediction model with weights for features ( for example , The linear model corresponds to parameters coefficients),RFE Feature selection by reducing the size of the feature set investigated recursively . First , The prediction model is trained on the original features , Each feature is assigned a weight . after , Those features with the least absolute weight are kicked out of the feature set ( Did woe You can't judge like that , Did woe Weight does not indicate importance ). So back and forth , Until the number of remaining features reaches the required number of features .
RFECV By means of cross validation RFE, To choose the best number of features : For a quantity of d Of feature Set , The number of all his subsets is 2 Of d The power minus 1( Contains empty sets ). Specify an external learning algorithm , such as SVM And so on. . The algorithm is used to calculate the sum of all subsets validation error. choice error The smallest subset as the selected feature .
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
rf = RandomForestClassifier()
rfe = RFE(estimator=rf, n_features_to_select=3)
X_rfe = rfe.fit_transform(X,y)
#(150, 3)
Use SelectFromModel Select features (Feature selection using SelectFromModel)
be based on L1 Feature selection of (L1-based feature selection)
Use L1 Linear model with norm as penalty term (Linear models) Will get sparse relief : The coefficients corresponding to most features are 0. When you want to reduce the dimension of features for other classifiers , Can pass feature_selection.SelectFromModel To choose not to 0 The coefficient of .
Sparse prediction models commonly used for this purpose are linear_model.Lasso( Return to ),linear_model.Logistic Regression and svm.LinearSVC( classification )
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X,y) #dual: Boolean value . The default is False. If it is equal to True, Then solve its dual form .
model = SelectFromModel(lsvc, prefit=True) #prefit : Boolean , The default is False, Is it a trained model , If it is False If so, first fit, Again transform.
X_embed = model.transform(X)
#(150, 3)
The actual business
First, let's review what problems we will encounter in our business model .
- The effect of the model is not good : There is a problem with the probability data
- The training set works well , Cross time testing ( Generally, the test sample is the of training data 1/10) The result is bad : The distribution of test data is different from that of training data , It indicates that there is a problem with the selected characteristic variable and the fluctuation is relatively large , View, analyze and compare the characteristic variables of fluctuations
The effect of cross time test is also good , After going online, the effect is not good : There is something wrong with the logic of offline and online variables , Offline feature information may contain future variables - After going online, the effect is good , After a few weeks, the score distribution began to decline : It shows that the effect of the model is not good , It shows that one or two variables have poor effect across time
- It's stable for a month or two , Suddenly the score distribution plummeted : It may be external factors , For example, some operations of the operation department or national policies lead to
- No obvious problem , But the model gradually fails every month
Then let's consider what variables the business needs .
- Variables must contribute to the model , In other words, it must be able to distinguish the customer group
- Logistic regression requires linear independence between variables
- The logistic regression Scorecard also expects the variables to show a monotonic trend ( Partly for business reasons , But from a model point of view , Monotonic variables are not necessarily better than variables with turning points )
- The distribution of the customer group on each variable is stable , Distribution migration is inevitable , But it can't fluctuate too much
Therefore, we find several methods that best fit the current use scenario from the above methods .
import pandas as pd
import numpy as np
df_train = pd.read_csv('/Users/zhucan/Desktop/ Financial risk control practice / Lesson 3 materials /train.csv')
1) The importance of variables
- IV value
- Chi square test
- Model filtering
Here we use IV Value or model filtering a little more
IV In fact, in the WOE Add an item before .
- p y i = y i y T p_{y_i}=\frac{y_i}{y_T} pyi=yTyi
- p n i = n i n T p_{n_i}=\frac{n_i}{n_T} pni=nTni
- w o e i = l n ( p y i p n i ) woe_i = ln(\frac{p_{y_i}}{p_{n_i}}) woei=ln(pnipyi)
- i v i = ( p y i − p n i ) × w o e i iv_i = (p_{y_i} - p_{n_i}) \times woe_i ivi=(pyi−pni)×woei
Finally, we just need to put the iv Add up and you get the total iv value :
I V = ∑ i v i IV = \sum iv_i IV=∑ivi
a = 0.4
b = 0.6
iv = (a - b) * math.log(a / b)
Separate boxes 、WOE、IV
import numpy as np
import pandas as pd
from scipy import stats
def mono_bin(Y,X,n=20):
good = Y.sum()
bad = Y.count()-good
while np.abs(r)< 1:
d3['iv']=(d3['rate']/(1-d3['rate']) - (good/bad)) * np.log((d3['rate']/(1-d3['rate']))/(good/bad))
return d4
mono_bin(df_train["label"],df_train["Age"],n = 20)
Or the importance of the output characteristics of the integrated model :
#lightGBM The importance of features in
feature = pd.DataFrame(
'name' : model.booster_.feature_name(),
'importance' : model.feature_importances_
}).sort_values(by = ['importance'],ascending = False)
2) Collinearity
- The correlation coefficient COR
- Coefficient of variance expansion VIF
When making many models based on the idea of spatial division , We must pay attention to the correlation between variables . When we look at the two variables alone, we will use the Pearson correlation coefficient .
import seaborn as sns
np.random.seed(sum(map(ord, "distributions")))
# Draw pairs of relationships in the dataset
sns.pairplot(df_train) # On the diagonal is a one-dimensional distribution
In multiple regression , We can calculate the variance expansion coefficient VIF To test whether there is a serious multicollinearity problem in the regression model . Definition :
V I F = 1 1 − R i 2 VIF = \frac{1}{1-R_{i}^2} VIF=1−Ri21
among , R i R_i Ri Independent variable The negative correlation coefficient of regression analysis for other independent variables . The coefficient of variance expansion is the tolerance 1 − R 2 1-R^2 1−R2 Reciprocal .
Coefficient of variance expansion VIF The bigger it is , It shows that the greater the possibility of collinearity between independent variables . In general , If the variance expansion factor exceeds 10, Then the regression model has serious multicollinearity . According to Hair(1995) Collinearity diagnostic criteria , When the tolerance of the independent variable is greater than 0.1, The coefficient of variance expansion is less than 10 The range of is acceptable , It shows that there is no collinearity problem between independent variables .
3) monotonicity
- bivar chart
# Equal frequency segmentation
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 10)
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_qcut']))
badrate = {
for x in alist:
a = df_train[df_train.fare_qcut == x]
bad = a[a.label == 1]['label'].count()
good = a[a.label == 0]['label'].count()
badrate[x] = bad/(bad+good)
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
# cut badrate
#9 (-0.001, 7.55] 0.141304
#6 (7.55, 7.854] 0.298851
#8 (7.854, 8.05] 0.179245
#7 (8.05, 10.5] 0.230769
#3 (10.5, 14.454] 0.428571
#4 (14.454, 21.679] 0.420455
#2 (21.679, 27.0] 0.516854
#5 (27.0, 39.688] 0.373626
#1 (39.688, 77.958] 0.528090
#0 (77.958, 512.329] 0.758621
def binn(x):
if x <10.5:
return 0
elif x <39.688:
return 1
return 2
df_train["fare_cut_new"] = df_train.Fare.map(lambda x:binn(x))
df_train = df_train.sort_values('Fare')
alist = list(set(df_train['fare_cut_new']))
badrate = {
for x in alist:
a = df_train[df_train.fare_cut_new == x]
bad = a[a.label == 1]['label'].count()
good = a[a.label == 0]['label'].count()
badrate[x] = bad/(bad+good)
f = zip(badrate.keys(),badrate.values())
f = sorted(f,key = lambda x : x[1],reverse = True )
badrate = pd.DataFrame(f)
badrate.columns = pd.Series(['cut','badrate'])
badrate = badrate.sort_values('cut')
4) stability
- Cross test across time
Cross test across time
Is to cut the sample according to the month , Train the model once as a training set and a test set , Take the intersection between the variables entering the model , But beware of collinear features !
(1) The first month as a test set , The next eleven months as a training set , Training models , Importance of output variables
(2) The second month as a test set , The remaining 11 months are used as a training set , Training models , Importance of output variables
(12) The last month as a test set , The remaining 11 months are used as a training set , Training models , Importance of output variables
(13) intersect
- You don't need to enter the model every time , Most of them are just
- First remove the collinearity ( That's why we also remove collinearity in the integration model )
Population stability index (population stability index)
The formula :
P S I = ∑ ( real Occasion Occupy Than − pre period Occupy Than ) ∗ ln ( real Occasion Occupy Than pre period Occupy Than ) PSI = \sum{( The actual proportion - Expected proportion )*{\ln(\frac{ The actual proportion }{ Expected proportion })}} PSI=∑( real Occasion Occupy Than − pre period Occupy Than )∗ln( pre period Occupy Than real Occasion Occupy Than )
Examples from Zhihu :
For example, train one logistic The regression model , There will be a probability output when predicting p.
The output on your test set is set to p1 Well , Sort it from small to large 10 Equal division , Such as 0-0.1,0.1-0.2,….
Now you use this model to predict new samples , The prediction result is called p2, Press p1 The interval of is also divided into 10 Equal division .
The actual percentage is p2 The proportion of users in each interval on , The expected percentage is p1 Proportion of users in each section of the .
The point is that if the model is stable , that p1 and p2 The users in each interval on the should be similar , The proportion will not change much , That is, the predicted probability will not vary greatly .
It is generally believed psi Less than 0.1 The stability of the model is very high ,0.1-0.25 commonly , Greater than 0.25 The stability of the model is poor , It is recommended to redo .
def var_PSI(dev_data, val_data):
dev_cnt, val_cnt = sum(dev_data), sum(val_data)
if dev_cnt * val_cnt == 0:
return None
PSI = 0
for i in range(len(dev_data)):
dev_ratio = dev_data[i] / dev_cnt
val_ratio = val_data[i] / val_cnt + 1e-10
psi = (dev_ratio - val_ratio) * math.log(dev_ratio/val_ratio)
PSI += psi
return PSI
Note that the number of bins will affect the of variables PSI value .
PSI It's not just about models , It's the same for variables . You only need to calculate the data across time boxes PSI that will do .
excel Calculation PSI
PSI Criteria
- Leetcode face T10 (1-9) array, ByteDance interview sharing
- What is the principle of bone conduction earphones and who is suitable for bone conduction earphones
- Duplicate keys detected: ‘0‘. This may cause an update error. found in
- The basic steps of using information theory to deal with scientific problems are
- C write TXT file
- [deep learning] Infomap face clustering facecluster
- oracle创建只读权限的用户简单四步走
- CVPR 2022 | 大连理工提出自校准照明框架,用于现实场景的微光图像增强
- Feature query of hypergraph iserver rest Service
- The middle element and the rightmost element of the shutter
[question 008: what is UV in unity?]
How to solve MySQL master-slave delay problem
Leetcode question brushing (10) - sequential question brushing 46 to 50
Feature query of hypergraph iserver rest Service
A quick understanding of analog electricity
[JSON] gson use and step on the pit
[learn C and fly] 2day Chapter 8 pointer (practice 8.1 password unlocking)
What is the MySQL column to row function
QT implementation interface jump
Pytest testing framework
CSDN article underlined, font color changed, picture centered, 1 second to understand
Pychart creates new projects & loads faster & fonts larger & changes appearance
JS slow animation
Which brand of sports headset is better? Bluetooth headset suitable for sports
Calculation (computer) code of suffix expression
2022 safety officer-c certificate examination questions and mock examination
JPM 2021 most popular paper released (with download)
Build a modern data architecture on the cloud with Amazon AppFlow, Amazon lake formation and Amazon redshift
How does proxy IP participate in the direct battle between web crawlers and anti crawlers
Infix expression to suffix expression (computer) code
Query word weight, search word weight calculation
A quick understanding of analog electricity
Decipher the AI black technology behind sports: figure skating action recognition, multi-mode video classification and wonderful clip editing
[liuyubobobo play with leetcode algorithm interview] [00] Course Overview
query词权重, 搜索词权重计算
Architecture evolution from MVC to DDD