当前位置：网站首页>Deep understanding of lightgbm

Deep understanding of lightgbm

2022-06-10 19:45:00 【Xiaobai learns vision】

Click on the above “ Xiaobai studies vision ”, Optional plus " Star standard " or “ Roof placement ”

Heavy dry goods , First time delivery

The main contents of this paper are summarized ：

1. LightGBM brief introduction

GBDT (Gradient Boosting Decision Tree) It's an enduring model of machine learning , The main idea is to use weak classifier （ Decision tree ） Iterative training to get the optimal model , The model has good training effect 、 It's not easy to fit .GBDT It's not only widely used in industry , It's often used to classify 、 Click through rate forecast 、 Search sorting and other tasks ; It's also a lethal weapon in various data mining competitions , According to statistics Kaggle More than half of the championship plans in the competition are based on GBDT. and LightGBM（Light Gradient Boosting Machine） Is an implementation GBDT The framework of the algorithm , Support efficient parallel training , And have faster training speed 、 Lower memory consumption 、 Better accuracy 、 Support distributed, can quickly process massive data and other advantages .

1.1 LightGBM Proposed motivation

Common machine learning algorithms , Such as neural networks and other algorithms , You can use mini-batch The way to train , The size of training data is not limited by memory . and GBDT At each iteration , All need to traverse the whole training data many times . If the whole training data is loaded into memory, the size of the training data will be limited ; If you don't load it into memory , Reading and writing training data over and over again takes a lot of time . Especially in the face of massive industrial data , ordinary GBDT The algorithm can not meet its needs .

LightGBM The main reason is to solve GBDT Problems encountered in massive data , Give Way GBDT It can be better and faster used in industrial practice .

1.2 XGBoost The disadvantages of and LightGBM The optimization of the

（1）XGBoost The shortcomings of

stay LightGBM Before I put it forward , The most famous GBDT Tool is XGBoost 了 , It is a decision tree algorithm based on pre sorting method . The basic idea of this algorithm for building a decision tree is ： First , All features are pre ordered according to their values . secondly , The cost of traversing the segmentation point is to find the best segmentation point on the feature . Last , After finding the best segmentation point of a feature , Split the data into left and right child nodes .

The advantage of such a pre sorting algorithm is that it can accurately find the segmentation point . But the disadvantages are obvious ： First , Big space consumption . Such an algorithm needs to preserve the eigenvalues of the data , The result of feature sorting is also saved （ for example , In order to quickly calculate the segmentation point , Saved the sorted index ）, This consumes twice as much memory as the training data . secondly , There's also a big cost of time , When traversing each segmentation point , We need to calculate the splitting gain , The cost of consumption is high . Last , Yes cache Optimization is not friendly . After pre ordering , Feature access to gradients is a kind of random access , And different features access in different order , Be able to cache To optimize . meanwhile , When trees grow on every floor , You need to randomly access an array of row indexes to leaf indexes , And the order of access to different features is different , It can also cause a big cache miss.

（2）LightGBM The optimization of the

To avoid the above XGBoost The defects of , And can speed up without compromising accuracy GBDT The training speed of the model ,lightGBM In traditional GBDT The algorithm is optimized as follows ：

be based on Histogram Decision tree algorithm .
One side gradient sampling Gradient-based One-Side Sampling(GOSS)： Use GOSS It can reduce a large number of data instances with small gradients , In this way, only the remaining data with high gradient can be used to calculate the information gain , comparison XGBoost Traversing all eigenvalues saves a lot of time and space .
Mutually exclusive feature binding Exclusive Feature Bundling(EFB)： Use EFB You can bind many mutually exclusive features into one feature , In this way, the goal of dimensionality reduction is achieved .
With depth limitation Leaf-wise Leaf growth strategy of ： majority GBDT Tools that use inefficient layering (level-wise) Decision tree growth strategy , Because it treats leaves of the same layer indiscriminately , It brings a lot of unnecessary expenses . In fact, many leaves have low splitting gains , There's no need to search and split .LightGBM The use of depth limited by leaf growth (leaf-wise) Algorithm .
Directly support category features (Categorical Feature)
Support efficient parallelism
Cache Hit rate optimization

Let's introduce in detail the above mentioned lightGBM optimization algorithm .

2. LightGBM The basic principle of

2.1 be based on Histogram Decision tree algorithm

（1） Histogram algorithm

Histogram algorithm It should be translated into histogram algorithm , The basic idea of histogram algorithm is ： First, the continuous floating-point eigenvalues are discretized into It's an integer , At the same time, construct a width of Histogram . When traversing data , According to the discretized value as the index, the statistics are accumulated in the histogram , After traversing the data once , The histogram accumulates the required Statistics , Then according to the discrete value of histogram , Traverse to find the best segmentation point .

chart ： Histogram algorithm

Histogram algorithm is simply understood as ： First determine how many boxes are needed for each feature （bin） And assign an integer to each bin ; Then the range of floating-point numbers is divided into several intervals , The number of intervals is equal to the number of boxes , Update the sample data belonging to the bin to the bin value ; Finally, histogram （#bins） Express . It looks very tall , It's actually histogram statistics , Put a large amount of data in a histogram .

We know that feature discretization has many advantages , Such as convenient storage 、 Faster calculation 、 Strong robustness 、 The model is more stable . For histogram algorithm, the most direct has the following two advantages ：

Less memory ： Histogram algorithm not only does not need to store the result of pre sorting , And only the discrete value of the feature can be saved , And this value is generally used Bit integer storage is enough , Memory consumption can be reduced to the original . in other words XGBoost Need to use Bit floating point number to store the characteristic value , And use Bit shaping to store index , and LightGBM Only need to use Bit to store histogram , Memory is reduced to ;

chart ： Memory consumption optimization for the pre sort algorithm 1/8

It's less computationally expensive ： Pre sort algorithm XGBoost Every time we traverse an eigenvalue, we need to calculate the gain of the splitting , And histogram algorithm LightGBM You just have to calculate Time （ It can be thought of as a constant ）, Directly reduce the time complexity from Down to , And we know that .

Of course ,Histogram Algorithms are not perfect . Because the feature is discretized , It's not a very precise segmentation point , So it has an impact on the results . But the results on different datasets show that , The effect of discrete segmentation points on the final accuracy is not great , Even better sometimes . The reason is that decision trees are inherently weak models , It doesn't matter whether the segmentation point is accurate or not ; Coarse segmentation points also have regularization effect , It can effectively prevent over fitting ; Even if the training error of a single tree is slightly larger than that of the accurate segmentation algorithm , But in gradient ascension （Gradient Boosting） There's not much impact under the framework of .

（2） Histogram difference acceleration

LightGBM Another optimization is Histogram（ Histogram ） Speed up . The histogram of a leaf can be obtained by the difference between the histogram of its parent node and that of its brother , Double the speed . Usually when constructing histogram , You need to traverse all the data on this leaf , But histogram difference only needs to traverse histogram k A barrel . In the process of actually building the tree ,LightGBM You can also calculate the leaf nodes with small histogram first , Then the histogram is used to make difference to get the leaf node with large histogram , So you can get the histogram of its brother leaves at a very small cost .

chart ： Histogram error

Be careful ： XGBoost Only non-zero values are considered to accelerate the pre sort , and LightGBM A similar strategy is used ： Only non-zero features are used to construct histograms .

2.2 With depth limitation Leaf-wise Algorithm

stay Histogram Algorithm above ,LightGBM Further optimization . First of all, it abandoned most of GBDT Tools are used to grow by layer (level-wise) Decision tree growth strategy , Instead, we use a depth limited growth by leaf (leaf-wise) Algorithm .

XGBoost use Level-wise The growth strategy of , This strategy can split the leaves of the same layer at the same time by traversing the data once , Easy to multithread , Or control the complexity of the model , It's not easy to over fit . But actually Level-wise It's an inefficient algorithm , Because it treats leaves of the same layer indiscriminately , In fact, many leaves have low splitting gains , There's no need to search and split , So there's a lot of unnecessary computing overhead .

chart ： Decision trees growing by layers

LightGBM use Leaf-wise The growth strategy of , The strategy takes all of the current leaves from , Find the leaf with the most splitting gain , And then split , So circular . So the same as Level-wise comparison ,Leaf-wise The advantages of ： With the same number of splits ,Leaf-wise Can reduce more errors , Get better accuracy ;Leaf-wise The disadvantage of ： A deeper decision tree may grow , Produce over fitting . therefore LightGBM Will be in Leaf-wise Added a maximum depth limit to , Avoid over fitting while ensuring high efficiency .

chart ： Decision trees based on leaf growth

2.3 One side gradient sampling algorithm

Gradient-based One-Side Sampling It should be translated as unilateral gradient sampling （GOSS）.GOSS The algorithm starts from the angle of reducing samples , Exclude most samples with small gradients , Only the remaining samples are used to calculate the information gain , It's an algorithm that balances data reduction and accuracy .

AdaBoost in , Sample weight is an indicator of the importance of data . However, in GBDT There is no original sample weight in , Weight sampling cannot be applied . Fortunately, , We observed that GBDT Each data in has a different gradient value , Very useful for sampling . That is, samples with small gradients , Training error is also relatively small , It shows that the data has been well learned by the model , The direct idea is to get rid of this little gradient data . However, doing so will change the distribution of the data , Will affect the accuracy of the training model , To avoid this problem , Put forward GOSS Algorithm .

GOSS Is a sample sampling algorithm , The purpose is to discard some samples which are not helpful to calculate the information gain and leave the useful ones . According to the definition of calculated information gain , The sample with large gradient has more influence on information gain . therefore ,GOSS In the process of data sampling, only the data with larger gradient is reserved , However, if all the data with smaller gradient are discarded directly, the overall distribution of data will be affected . therefore ,GOSS First, all the values of the feature to be split are sorted in descending order according to the absolute value （XGBoost It's also sorted , however LightGBM You don't have to save the sorted results ）, Choose the one with the largest absolute value Data . And then randomly select from the rest of the smaller gradient data Data . And then put this Times a constant , In this way, the algorithm will pay more attention to the samples with insufficient training , It doesn't change the distribution of the original data set too much . Finally use this To calculate the information gain . The picture below is GOSS The specific algorithm .

chart ： One side gradient sampling algorithm

2.4 Mutual exclusion feature binding algorithm

High dimensional data is often sparse , This sparsity inspired us to design a lossless method to reduce the dimension of features . Usually the features that are bound are mutually exclusive （ That is, the feature will not be nonzero at the same time , image one-hot）, So that the two features are tied together so that information is not lost . If the two features are not completely mutually exclusive （ In some cases, both features are nonzero ）, You can use an indicator to measure the degree to which features are not mutually exclusive , Call it the conflict ratio , When the value is small , We can choose to bundle two features that are not completely mutually exclusive , Without affecting the final accuracy . Mutual exclusion feature binding algorithm （Exclusive Feature Bundling, EFB） It is pointed out that if some features are fused and bound , You can reduce the number of features . In this way, the time complexity of constructing histogram is reduced from Turn into , here Refers to the number of feature packages after feature fusion binding , And Far less than .

In response to this idea , We're going to have two problems ：

How to determine which features should be tied together （build bundled）？
How to tie a feature to a （merge feature）？

（1） Solve which features should be tied together

Binding independent features is a NP-Hard problem ,LightGBM Of EFB The algorithm transforms this problem into a graph coloring problem , Think of all the features as vertices of a graph , Connect features that are not independent of each other with a single edge , The weight of an edge is the total conflict value of two connected features , The features that need to be bound are those points that need to be painted with the same color in the graph coloring problem （ features ）. Besides , We notice that there are usually a lot of features , Though not ％ Mutually exclusive , But it's rare to take nonzero values at the same time . If our algorithm allows for a small number of collisions , We can get fewer feature packs , Further improve the computational efficiency . After a simple calculation , Random contamination of a small number of eigenvalues will affect the accuracy most , Is the maximum conflict ratio in each binding , When it's relatively small , Can achieve the balance between precision and efficiency . The specific steps can be summarized as follows ：

Construct a weighted undirected graph , A vertex is a feature , Sides have weight , Its weight is related to the conflict between the two features ;
Sort in descending order of nodes , The greater the degree , The greater the conflict with other features ;
Walk through each feature , Assign it to an existing feature pack , Or create a new feature pack , To minimize the overall conflict .

The algorithm allows pairwise features not completely exclusive to increase the number of feature bundles , By setting the maximum conflict ratio To balance the accuracy and efficiency of the algorithm .EFB The pseudo code of the algorithm is as follows ：

chart ： Greedy binding algorithm

Algorithm 3 The time complexity of is , Only deal with it once before training , The time complexity is acceptable when there are not many features , But it's hard to deal with the characteristics of the million dimensions . In order to continue to improve efficiency ,LightGBM This paper proposes a more efficient sorting strategy without graph ： Sort features by the number of nonzero values , This is similar to the degree ordering of graph nodes , Because more non-zero values usually lead to conflicts , The new algorithm is in 3 Based on this, the sorting strategy is changed .

（2） How to tie features together

Feature merging algorithm , The key is that the original features can be separated from the merged features . Bind several features to the same bundle You need to ensure that the value of the original feature before binding can be in bundle To identify , in consideration of histogram-based The algorithm saves continuous values as discrete bins, We can divide the values of different features into bundle Different in bin（ The box ） in , This can be solved by adding an offset constant to the eigenvalue . such as , We are bundle Two features are bound in A and B,A The original value of a feature is an interval ,B The original value of a feature is an interval , We can do it in B Add an offset constant to the value of the feature , Change its value range to , The bound feature value range is , That way, you can easily merge features A and B 了 . The specific feature merging algorithm is as follows ：

chart ： Feature merging algorithm

3. LightGBM Engineering optimization of

We're going to write a paper 《Lightgbm: A highly efficient gradient boosting decision tree》 Optimization not mentioned in , And in his related papers 《A communication-efficient parallel algorithm for decision tree》 The optimization scheme mentioned in , Put it in this section as LightGBM To introduce to you .

3.1 Directly support category features

In fact, most machine learning tools can not directly support category features , Generally, we need to put the category characteristics , adopt one-hot code , Transform to multidimensional features , Reduced the efficiency of space and time . But we know that... Is not recommended for decision trees one-hot code , Especially when there are many categories in the category feature , There will be the following problems ：

There will be a sample segmentation imbalance problem , This leads to a very small slitting gain （ It's a waste of this feature ）. Use one-hot code , It means that you can only use one vs rest（ For example, is it a dog , Is it a cat or something ） Segmentation method of . for example , After the classification of animals , Will produce whether the dog , Whether it's a cat or not , There are only a few samples in this series of characteristics , A large number of samples are , At this time, the segmentation of the sample will produce imbalance , This means that the slitting gain will also be small . The smaller slice set , Its proportion in the total sample is too small , No matter how much gain , Multiplied by that ratio, it's almost negligible ; The larger split sample set , It's almost the original sample set , The gain is almost zero . A more intuitive understanding is that there is no difference between unbalanced segmentation and non segmentation .
It will affect the learning of decision tree . Because even if we can segment the features of this category , Single hot coding also splits the data into many scattered small spaces , As shown on the left of the figure below . Decision tree learning uses statistical information , In these small spaces of data , The statistics are not accurate , The learning effect will be worse . But if you use the segmentation method on the right side of the figure below , The data will be split into two larger spaces , Further study will be better . The meaning of the leaf node on the right of the figure below is or put on the left child , The rest is on the right .

chart ： The picture on the left is based on one-hot Code split , The right to LightGBM be based on many-vs-many Split up

The use of category features is very common in practice . And to solve one-hot The inadequacy of coding in dealing with category features ,LightGBM Optimized support for category features , You can directly input category features , No extra unfolding is required .LightGBM use many-vs-many The classification features are divided into two subsets , To achieve the optimal segmentation of class features . Suppose that a feature of a dimension has Categories , Then there are Maybe , The time complexity is ,LightGBM be based on Fisher Of 《On Grouping For Maximum Homogeneity》 The paper implements Time complexity of .

The algorithm flow is shown in the figure below , Before enumerating split points , First put the histogram according to each category label Ranking the mean values ; Then enumerate the optimal segmentation points according to the sorting results . You can see it in the picture below , Is the mean of the category . Of course , This method is easy to over fit , therefore LightGBM It also adds a lot of constraints and regularization for this method .

chart ：LightGBM An optimal segmentation algorithm for class features

stay Expo Experimental results on data sets show that , Compared to the expansion method , Use LightGBM The supported category features can accelerate the training speed by times , And the accuracy is the same . what's more ,LightGBM Is the first one that directly supports category features GBDT Tools .

3.2 Support efficient parallelism

（1） Feature parallelism

The main idea of feature parallelism is that different machines find the best segmentation point on different feature sets , Then synchronize the optimal split point between machines .XGBoost This feature parallel method is used . This feature parallel method has a big disadvantage ： It's about dividing the data vertically , Each machine contains different data , Then we use different machines to find the optimal split point of different features , The partition results need to be communicated to each machine , Added extra complexity .

LightGBM The data is not divided vertically , It's all stored on every machine , After getting the best partition scheme, partition can be performed locally and unnecessary communication can be reduced . The specific process is shown in the figure below .

chart ： Feature parallelism

（2） Data parallelism

The traditional data parallel strategy is to divide data horizontally , Let different machines construct histograms locally first , Then make a global merge , Finally, the best segmentation point is found on the merged histogram . There is a big drawback to this data partitioning ： Too much communication overhead . If you use point-to-point communication , The communication cost of a machine is about

402 Payment Required

LightGBM Use distributed protocols in data parallelism (Reduce scatter) Allocate the task of histogram merging to different machines , Reduce communication and Computing , And make use of histogram , Further reduced traffic by half . The specific process is shown in the figure below .

chart ： Data parallelism

（3） Vote in parallel

Data parallelism based on voting further optimizes the communication cost in data parallelism , Make the communication cost constant . When there's a lot of data , In order to reduce the traffic, only the histogram of some features is merged by voting parallel , Can get very good acceleration effect . The specific process is shown in the figure below .

The general steps are two steps ：

Find out locally Top K features , Based on the voting, the features that may be the optimal segmentation point are screened out ;
When merging, only the features selected by each machine are merged .

chart ： Vote in parallel

3.3 Cache Hit rate optimization

XGBoost Yes cache Optimization is not friendly , As shown in the figure below . After pre ordering , Feature access to gradients is a kind of random access , And different features access in different order , Be able to cache To optimize . meanwhile , When trees grow on every floor , You need to randomly access an array of row indexes to leaf indexes , And the order of access to different features is different , It can also cause a big cache miss. In order to solve the problem of low cache hit rate ,XGBoost The cache access algorithm is improved .

chart ： Random access can cause cache miss

and LightGBM The histogram algorithm is applied to Cache Friendly by nature ：

First , All features get gradients in the same way （ The difference in XGBoost The gradient is obtained by different indexes ）, Only the gradients need to be sorted and continuous access can be achieved , Greatly improves the cache hit rate ;
secondly , Because you don't need to store an array of row indexes to leaf indexes , Reduced storage consumption , And it doesn't exist Cache Miss The problem of .

chart ：LightGBM Increase cache hit rate

4. LightGBM Advantages and disadvantages

4.1 advantage

This part mainly summarizes LightGBM be relative to XGBoost The advantages of , This paper introduces memory and speed .

（1） Faster

LightGBM The histogram algorithm is used to transform the traversal sample into the ergodic histogram , Greatly reduces the time complexity ;
LightGBM In the training process, the single side gradient algorithm is used to filter out the samples with small gradient , It reduces a lot of computation ;
LightGBM Based on Leaf-wise The growth strategy of the algorithm constructs a tree , It reduces a lot of unnecessary computation ;
LightGBM The optimized features are used in parallel 、 Data parallel methods accelerate computation , When the amount of data is very large, we can also adopt the strategy of voting parallel ;
LightGBM The cache is also optimized , Increased cache hit rate ;

（2） Smaller memory

XGBoost After using the pre sort, it is necessary to record the index of the characteristic value and the statistical value of the corresponding sample , and LightGBM The histogram algorithm is used to transform the eigenvalue into bin value , And you don't need to record the characteristics to the index of the sample , Change the spatial complexity from Reduced to , Greatly reduces memory consumption ;
LightGBM The histogram algorithm is used to transform the stored eigenvalues into storage bin value , Reduced memory consumption ;
LightGBM In the training process, the mutual exclusion feature binding algorithm is used to reduce the number of features , Reduced memory consumption .

4.2 shortcoming

A deeper decision tree may grow , Produce over fitting . therefore LightGBM stay Leaf-wise A maximum depth limit has been added to it , Avoid over fitting while ensuring high efficiency ;
Boosting Families are iterative algorithms , Each iteration adjusts the sample weight according to the prediction results of the previous iteration , So as the iteration goes on , The error will be smaller and smaller , Deviation of the model （bias） It will keep falling . because LightGBM It's a deviation based algorithm , So it's sensitive to noise ;
When looking for the optimal solution , Based on the optimal segmentation variable , The idea that the optimal solution is the synthesis of all the characteristics is not taken into account ;

5. LightGBM example

All the data sets and code in this article are in my GitHub in , Address ：https://github.com/Microstrong0305/WeChat-zhihu-csdnblog-code/tree/master/Ensemble%20Learning/LightGBM

5.1 install LightGBM Dependency package

pip install lightgbm

5.2 LightGBM Classification and regression

LightGBM There are two broad categories of interfaces ：LightGBM The native interface and scikit-learn Interface , also LightGBM Can achieve classification and regression tasks .

（1） be based on LightGBM Classification of native interfaces

import lightgbm as lgb
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_auc_score, accuracy_score


#  Load data 
iris = datasets.load_iris()


#  Divide the training set and the test set 
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)


#  Convert to Dataset data format 
train_data = lgb.Dataset(X_train, label=y_train)
validation_data = lgb.Dataset(X_test, label=y_test)


#  Parameters 
params = {
    'learning_rate': 0.1,
    'lambda_l1': 0.1,
    'lambda_l2': 0.2,
    'max_depth': 4,
    'objective': 'multiclass',  #  Objective function 
    'num_class': 3,
}


#  model training 
gbm = lgb.train(params, train_data, valid_sets=[validation_data])


#  Model to predict 
y_pred = gbm.predict(X_test)
y_pred = [list(x).index(max(x)) for x in y_pred]
print(y_pred)


#  Model to evaluate 
print(accuracy_score(y_test, y_pred))

（2） be based on Scikit-learn Classification of interfaces

from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.externals import joblib


#  Load data 
iris = load_iris()
data = iris.data
target = iris.target


#  Divide training data and test data 
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)


#  model training 
gbm = LGBMClassifier(num_leaves=31, learning_rate=0.05, n_estimators=20)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], early_stopping_rounds=5)


#  Model storage 
joblib.dump(gbm, 'loan_model.pkl')
#  Model loading 
gbm = joblib.load('loan_model.pkl')


#  Model to predict 
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)


#  Model to evaluate 
print('The accuracy of prediction is:', accuracy_score(y_test, y_pred))


#  Importance of characteristics 
print('Feature importances:', list(gbm.feature_importances_))


#  The grid search , Parameter optimization 
estimator = LGBMClassifier(num_leaves=31)
param_grid = {
    'learning_rate': [0.01, 0.1, 1],
    'n_estimators': [20, 40]
}
gbm = GridSearchCV(estimator, param_grid)
gbm.fit(X_train, y_train)
print('Best parameters found by grid search are:', gbm.best_params_)

（3） be based on LightGBM Regression of native interfaces

about LightGBM Solve the problem of return , We use it Kaggle The return problem in the game ：House Prices: Advanced Regression Techniques, Address ：https://www.kaggle.com/c/house-prices-advanced-regression-techniques To give an example .

There are a total of columns in the training dataset of the house price forecast , The first column is Id, The last column is label, The middle column is the feature . In this list of features , Columns are typed variables , Columns are integer variables , Columns are floating point variables . There are missing values in the training data set .

import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import Imputer


# 1. Reading documents 
data = pd.read_csv('./dataset/train.csv')


# 2. Split data input ： features   Output ： Predict the target variable 
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])


# 3. Split training set 、 Test set , Segmentation ratio 7.5 : 2.5
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)


# 4. Null processing , The default method ： Fill in with the average value of the characteristic column 
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)


# 5. Convert to Dataset data format 
lgb_train = lgb.Dataset(train_X, train_y)
lgb_eval = lgb.Dataset(test_X, test_y, reference=lgb_train)


# 6. Parameters 
params = {
    'task': 'train',
    'boosting_type': 'gbdt',  #  Set promotion type 
    'objective': 'regression',  #  Objective function 
    'metric': {'l2', 'auc'},  #  Evaluation function 
    'num_leaves': 31,  #  Number of leaf nodes 
    'learning_rate': 0.05,  #  Learning rate 
    'feature_fraction': 0.9,  #  The proportion of feature selection for tree building 
    'bagging_fraction': 0.8,  #  The sampling ratio of the tree building sample is 
    'bagging_freq': 5,  # k  Mean every  k  The next iteration is executed bagging
    'verbose': 1  # <0  Show fatal , =0  Display error  ( Warning ), >0  display information 
}


# 7. call LightGBM Model , Training using training set data （ fitting ）
# Add verbosity=2 to print messages while running boosting
my_model = lgb.train(params, lgb_train, num_boost_round=20, valid_sets=lgb_eval, early_stopping_rounds=5)


# 8. Using models to predict test set data 
predictions = my_model.predict(test_X, num_iteration=my_model.best_iteration)


# 9. Evaluate the prediction results of the model （ Mean absolute error ）
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

（4） be based on Scikit-learn Regression of interface

import pandas as pd
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import Imputer


# 1. Reading documents 
data = pd.read_csv('./dataset/train.csv')


# 2. Split data input ： features   Output ： Predict the target variable 
y = data.SalePrice
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])


# 3. Split training set 、 Test set , Segmentation ratio 7.5 : 2.5
train_X, test_X, train_y, test_y = train_test_split(X.values, y.values, test_size=0.25)


# 4. Null processing , The default method ： Fill in with the average value of the characteristic column 
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)


# 5. call LightGBM Model , Training using training set data （ fitting ）
# Add verbosity=2 to print messages while running boosting
my_model = lgb.LGBMRegressor(objective='regression', num_leaves=31, learning_rate=0.05, n_estimators=20,
                             verbosity=2)
my_model.fit(train_X, train_y, verbose=False)


# 6. Using models to predict test set data 
predictions = my_model.predict(test_X)


# 7. Evaluate the prediction results of the model （ Mean absolute error ）
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))

5.3 LightGBM Adjustable parameter

In the last part ,LightGBM Some parameters of the model are set simply , But most of them use the default parameters of the model , But default parameters are not the best . Want to let LightGBM Better performance , Need to be right LightGBM Fine tuning the model parameters . The regression model is shown in the following figure , The parameters that the classification model needs to adjust are similar to this .

chart ：LightGBM Regression model parameters adjustment

6. About LightGBM Reflections on some problems

6.1 LightGBM And XGBoost What are the connections and differences between them ？

（1）LightGBM Based on histogram Decision tree algorithm , This is different from XGBoost Greedy algorithm and approximation algorithm in ,histogram The algorithm has many advantages in memory and computing cost .1） Memory advantage ： Obviously , The memory consumption of histogram algorithm is

402 Payment Required

Bit floating point number to save .2） The computational advantage ： The pre sorting algorithm needs to traverse the eigenvalues of all samples when choosing the splitting features to calculate the splitting revenue , Time is , The histogram algorithm only needs to traverse the bucket , Time is .

（2）XGBoost It's using level-wise The split strategy , and LightGBM Adopted leaf-wise The strategy of , The difference is that XGBoost Do undifferentiated splitting for all nodes of each layer , Maybe some nodes have very small gain , It has little effect on the results , however XGBoost There was also a split , It brings unnecessary expenses .leaft-wise The method is to select the node with the largest splitting profit among all the current leaf nodes , So recursively , Obviously leaf-wise It's easy to over fit , Because it's easy to fall into a higher depth , So we need to limit the maximum depth , So as to avoid over fitting .

（3）XGBoost Build histograms dynamically at each level , because XGBoost The histogram algorithm is not for a specific feature , Instead, all features share a histogram ( The weight of each sample is the second derivative ), So every layer has to rebuild the histogram , and LightGBM There is a histogram for each feature , So it's enough to build a histogram once .

（4）LightGBM Use histogram to do difference acceleration , The histogram of a child node can be obtained by subtracting the histogram of a sibling node from the histogram of the parent node , To speed up the calculation .

（5）LightGBM Support category features , There is no need to heat code alone .

（6）LightGBM The feature parallel and data parallel algorithms are optimized , In addition, a voting parallel scheme is added .

（7）LightGBM A gradient based one-sided sampling is used to reduce the training samples and keep the data distribution unchanged , Reduce the model accuracy caused by the change of data distribution .

（8） Feature binding is transformed into graph coloring problem , Reduce the number of features .

The good news ！

Xiaobai learns visual knowledge about the planet

Open to the outside world

 download 1：OpenCV-Contrib Chinese version of extension module 

 stay 「 Xiaobai studies vision 」 Official account back office reply ： Extension module Chinese course , You can download the first copy of the whole network OpenCV Extension module tutorial Chinese version , Cover expansion module installation 、SFM Algorithm 、 Stereo vision 、 Target tracking 、 Biological vision 、 Super resolution processing and other more than 20 chapters .


 download 2：Python Visual combat project 52 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：Python Visual combat project , You can download, including image segmentation 、 Mask detection 、 Lane line detection 、 Vehicle count 、 Add Eyeliner 、 License plate recognition 、 Character recognition 、 Emotional tests 、 Text content extraction 、 Face recognition, etc 31 A visual combat project , Help fast school computer vision .


 download 3：OpenCV Actual project 20 speak 
 stay 「 Xiaobai studies vision 」 Official account back office reply ：OpenCV Actual project 20 speak , You can download the 20 Based on OpenCV Realization 20 A real project , Realization OpenCV Learn advanced .


 Communication group 

 Welcome to join the official account reader group to communicate with your colleagues , There are SLAM、 3 d visual 、 sensor 、 Autopilot 、 Computational photography 、 testing 、 Division 、 distinguish 、 Medical imaging 、GAN、 Wechat groups such as algorithm competition （ It will be subdivided gradually in the future ）, Please scan the following micro signal clustering , remarks ：” nickname + School / company + Research direction “, for example ：” Zhang San  +  Shanghai Jiaotong University  +  Vision SLAM“. Please note... According to the format , Otherwise, it will not pass . After successful addition, they will be invited to relevant wechat groups according to the research direction . Please do not send ads in the group , Or you'll be invited out , Thanks for your understanding ~

原网站

版权声明
本文为[Xiaobai learns vision]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206101842004186.html