当前位置：网站首页>Delivery lead time lightweight estimation practice - Notes

Delivery lead time lightweight estimation practice - Notes

2022-06-12 18:40:00 【Tyan】

The authors ：Tyan Blog ：noahsnail.com | CSDN | Simple books

This article is a collection of notes for meituan's article study .

1. background

In the process of ordering takeout , Delivery time for riders , That is, how long can the rider reach the user's hands after getting off the vehicle near the user , It's a very important link . The following figure shows the time structure of an order in the whole distribution link , The rightmost part of the timeline describes the location of the delivery link in the whole delivery link . Delivery time measures the difficulty of delivery when the rider delivers the meal , Including from the rider to the user's building , The whole time to deliver the food to the user .

The measurement of delivery time is a very challenging thing , Because riders will encounter different problems when delivering meals to users , for example ： Riders deliver meals to multiple users in the building at one time , It's particularly difficult for riders to address specific buildings , Riders can only walk around the delivery building , There is no elevator in the old community , Office buildings can't go upstairs , Or it's hard to wait for the elevator and so on . Delivery time estimation needs the ability to describe the difficulty of delivery , In pricing 、 Scheduling is widely used in many scenarios . For example, according to the delivery difficulty to determine whether to adjust the rider postage , Determine whether to adjust the order of the delivery waybill according to the delivery difficulty , So as to avoid overtime and so on . in general , Delivery time estimation is an important part of basic service of distribution business .

Note: Estimated order delivery difficulty , Adjust the rider's salary according to the difficulty , There are high requirements for the accuracy of the delivery difficulty prediction model , Because it directly involves the income of riders , It also involves the operating costs of the company .

The following difficulties exist in the delivery time estimation ：

Less input , And most of them are non numerical data , At present, only the following dimensional features can be used for prediction ： Delivery address 、 Longitude and latitude of delivery point 、 Area 、 City , Adapting to conventional machine learning model needs to be reorganized and information is easily lost .
Computing performance is very demanding . Because it's basic services , Will be called by a large number of services , High performance required , It includes data processing and RPC Time for . And the standard is CPU Performance requirements in the environment , Instead of GPU Performance requirements under .

Sum up , Delivery time estimation problem , It's about using lightweight solutions to handle non numeric data in many forms , And extract the effective information , Get relatively accurate results . Under the same effect , Prefer better performance solutions .

Note: Time prediction requires high performance , Prefer better performance solutions .

The delivery time estimate iterates over three versions , They are tree models based on address structure 、 Vector recall scheme and lightweight End-to-End Deep learning network . At the same time, it introduces how to choose between performance and index , And the intermediate process of model strategy iteration .

2. Technology iteration path

2.1 Tree model

Technology selection The first and easiest thing to consider is the use of rules , The core idea is to use tree structure to measure address similarity , Accumulate structured data on similar delivery addresses as much as possible , Then use the local regression strategy , To get a relatively abundant regression logic , And those who fail to meet the requirements of the return strategy go for the bottom .

Observe the hierarchy of the address filled in by the user and the actual address , It's not hard to find out , An address can be made up of four levels of structure ： Address backbone （addr）、 Building number （building）、 Unit no. （unit）、 floor （floor）. In practice, the address trunk word may correspond to the name of a residential area or a school . Through analysis , The actual delivery time is positively correlated with the floor height , And the delivery time of different delivery addresses varies with the increase of floors , So we can use linear regression model to fit the relationship between floor information and delivery time , And the address trunk 、 Building number 、 The unit number serves as its hierarchical index . However, the address filled in by the user does not necessarily contain the complete four level structure , There will be a certain percentage of missing , So use this hierarchy to build a tree , Then make full use of the information known in the previous layer to estimate . When predicting , Just find the corresponding model according to the branch of the node , If missing , Use the previous structure to predict . For the address that does not meet the data amount required by the training model , Use the average delivery time of its region as the estimated result of the delivery time , This part can also be regarded as regional information , As the root node of the tree structure .

Note: Estimate the delivery time from the user's point of view , Cluster users with similar addresses .

Iterative path The whole idea is based on discrete feature training tree model , On the nodes of the tree, the linear regression model is trained based on the floor . Tree node training splitting rules ：（1） The amount of data is greater than the threshold ;（2） After the split MAE（ Mean absolute error ） The sum of is less than before the split . Considering the timeliness of data , Using weighted linear regression to increase the weight of recent data .

2.2 Tree model + Vector recall scheme

Technology selection Vector recall is one of the mainstream recall schemes , Widely used in the industry , In the use of LSH、PQ Product quantization and other common open source tools , High dimensional vector recall performance is usually in milliseconds .

And algorithmically , Tree model NLP The only way that the address resolution results can meet the requirements of the model is 70%+, The remaining 20%+ The address of can't be trained by the model and can only go through the degradation strategy . Using high-dimensional vector to express semantic similarity , That is, using vectors to express address similarity , Therefore, the model corresponding to similar data is used to replace similar but not recalled data , Carry on the address trunk Embedding after , Get rid of the low robustness of complete matching of main words .

Iterative path The whole technology path is clear and simple , It is using Word2Vec take charLevel The characters go on Embedding, Get the vector representation of the address , And blend in GPS Location information , Design the corresponding strategy .

Final effect

The recall rate of the overall strategy has been greatly improved , Promoted 12.20pp, For addresses that are not recalled by the previous version of the tree model , There has been a significant improvement in indicators , among ME falling 87.14s,MAE falling 38.13s,1min The absolute deviation rate decreases 14.01pp,2min The absolute deviation rate decreases 18.45pp,3min The absolute deviation rate decreases 15.90pp.

2.3 End-to-End Lightweight deep learning program

Technology selection On the basis of the tree model , Iterate to the vector recall scheme , The recall rate of the whole model has increased greatly , But still not 100%. Analysis found , The obstacle to a higher recall rate is NLP For address resolution coverage .

The starting point of the whole plan ： Considering the complexity of the model , Also just using address information , In the ascension model VC On the basis of Wei , Using other model schemes can at least balance the effect of tree model , If other information can be integrated on this basis , So for the baseline of the original model , There will be further improvements .

Consider not only the need to use address data , At the same time, we need to use GPS data 、 A lot of ID Class Embedding, For all kinds of non numerical types of processing flexibility considerations , Adopt a deep learning program , To ensure that multi-source and multi-type features can be optimized in the same optimization system .

Points to be considered in Engineering ： The delivery model serves as the base model , It is widely used in path construction 、 pricing 、ETA And so on , In the tree model version , The requirement for performance is the average response time 5ms, This scheme needs to consider the performance of the original business , It can not significantly increase the calculation time .

The difficulty of delivery model is that there are many non numerical features , The diversity of information acquisition forms , The current bottleneck is not the low complexity of the model . If information and fusion can be obtained lightly , There's no need to be right Fusion After the information to do heavier processing .

So the overall design idea is ： Using deep learning to fuse non numerical features , In a simple Fusion On the basis of , Get the output structure directly , Choice of components , Choose... As much as possible Flops Lower design . The intention behind the design is , Make full use of the original input information , On the basis of avoiding information loss as much as possible , Incorporate non numerical information into . And fully integrate information , Direct docking of required targets . The fusion component structure is selected to ensure high performance as much as possible , And have high learning efficiency . Here is the address for the selection of a more Robust Of LSTM, in the light of GPS Choose the custom bilinear Embedding, Both performance and effect .

Iterative path Start using end-to-end deep learning model , The first thing we need to solve here is coverage , Direct adoption LSTM Read charLevel Address data for , Direct output delivery time through full connection layer . As the first version of the data , This version of data is basically the same as the tree model , But for tree models not recalled 20% data , A great improvement .

When using charLevel When the address works , Start adding user addresses GPS Information about , because GPS For latitude and longitude information , Non numerical data , We use a bilinear interpolation method based on the geographical location grid Embedding. The scheme has certain expansibility , For different GPS All can reasonably get Embedding vector , At the same time, it has smooth characteristics , For many pairs with small offset GPS Point can support well .

The final solution will address Embedding after , as well as GPS Dot Embedding After transformation , Join the order time 、 City ID、 Area ID Wait for the feature , Then feature fusion and transformation , Get the time estimate output of the delivery model . The whole model is an end-to-end training , All parameters are Trainable.

Note: This scheme is relatively complicated , Suitable for large-scale use , Not suitable at the initial stage .

Extension components

Is confirming End-to-End When the path works , We started to build the extended components , Include custom loss functions 、 Data sampling correction 、 National Model unification and other operations , Get a series of positive effects , And develop online .

Feature importance analysis

For deep learning models , We have a series of feature importance assessment programs , This is done in sequence Feature Permutation The way , As a way to assess the importance of model features .

The characteristics and importance of this scheme ： Address of the user >GPS Longitude and latitude > Other features .

notes ： In other cases of distribution , Merchant GPS The importance of latitude and longitude > Importance of user address > user GPS The importance of latitude and longitude , There may be obvious differences under different learning objectives .

Final effect

End-to-End The final effect of deep learning model is more significant ： The most painful point for tree model and vector recall scheme , Coverage is completely addressed , Coverage increased to 100%.ME falling 4.96s,MAE falling 8.17s,1min The absolute deviation rate decreases 2.38pp,2min The absolute deviation rate decreases 5.08pp,3min The absolute deviation rate decreases 3.46pp. meanwhile , For the waybill that cannot be covered by previous tree model and vector recall scheme , Ascension is more obvious .

3. Model correlation analysis

It mainly analyzes the performance of vector recall scheme and deep learning scheme , In order to confirm the performance indicators online , Finally, it can ensure that the performance can meet the requirements .

3.1 Vector recall performance

Nearest neighbor search （Nearest Neighbor Search） It refers to finding the problem closest to the query point in a high-dimensional space . When the data sample is small , Through linear search can meet the demand , But as the amount of data increases , If it reaches millions 、 Hundreds of millions of hours , Tend to structure data to express vector information more accurately .

At this time, the nearest neighbor search is approximate ANN（Approximate Nearest Neighbor） It's a reference technology , It's able to do something similar to recall , And linear search , Balance efficiency and accuracy . At present, there are basically the following 3 Class mainstream method ： Tree based approach , Such as K-D Trees, etc ; Hash based method , for example LSH; Based on vector quantization method , for example PQ Product quantization . In the industrial retrieval system , Product quantization is an index method used more often .

Tools for vector recall , There are a lot of open source implementations , In the process of technology selection , You can refer to ANN-Benchmarks as well as Erikbern/ANN-Benchmarks Performance evaluation results in .

3.2 Sequence module performance

In order to evaluate the performance bottleneck of deep learning delivery model , First, the whole model is Profile, The whole calculation is mostly consumed in the processing part of the sequence module . The computing performance of the sequence module needs to be evaluated OP Acceleration of operators .

3.3 Vector effect analysis

Compare vector recall with deep learning model , In the middle of the two processes, high-dimensional vectors are generated . Compared to vector recall , Where is the improvement brought about by deep learning model ？ Supervised lstm Learned Embedding Vector and self supervised Word2Vec How different are the resulting vectors in the calculation of address similarity , Which is better or worse? ？

First , Let's analyze the first question ,End-to-End Where does the model upgrade come from ？ After testing, it is known that , The improvement of deep learning model mainly comes from redundant information in address （ Compared to vector recall ） The use of , Next is the addition of many new features . in addition , Compare two End-to-End Effect of model , Address affixes also contain information useful for matching addresses .

For the second question , Supervised End-to-End Learned Embedding vector , With self supervision Word2Vec How different are the resulting vectors in the calculation of address similarity , Which is better or worse? ？ Use address trunk instead of full address , As End-to-End Model input for training , Other information remains the same . Use the address trunk to train Embedding vector , Apply to the vector recall scheme .

Judging from the evaluation results , For different thresholds ,End-to-End The difference in performance is relative Word2Vec smaller . At the same threshold ,End-to-End Higher recall rate , But the effect is not as good as Word2Vec.

From the result of similar calculation ,End-to-End The model will put some semantically unrelated but delivery time similar addresses , Map to the same vector space , and Word2Vec It is to learn a more general text vector representation .

If you want to target more complex goals and introduce more information , have access to End-to-End frame ; Just calculating text similarity , From the results of the experiment ,Word2Vec Better . meanwhile , By looking at Case It can also be found ,End-to-End Pay more attention to the similarity of results , So we recall some vectors that are not semantically relevant . The difference between the two model objectives , This leads to differences in results .