当前位置:网站首页>Intensive literature reading series (I): Courier routing and assignment for food delivery service using reinforcement learning
Intensive literature reading series (I): Courier routing and assignment for food delivery service using reinforcement learning
2022-07-06 13:50:00 【zhugby】
source : Article on 2022 Published in the Journal of COMPUTERS & INDUSTRIAL ENGINEERING , The basic information and influence factors of periodicals are shown in the figure below :
Catalog
3.2 The question assumes / Prerequisite
4.1 Basic framework of reinforcement learning
6.3 Average harvest reward comparison results
6.4 Comparison results of average delivery orders
Abstract
Markov decision process is used to represent the takeout delivery service in the real world , Express the delivery problem as Deep reinforcement learning ask leg . The system assigns orders to the takeout , The delivery clerk picks up the order in the designated restaurant and delivers it to the destination , The goal is to have limited time and limited takeout , Maximize total meal delivery revenue . This article USES the Three methods Try to solve this problem :1.Q-learning_ single agent 2.DDQN_ single agent 3.DDQN_ many agent, Use simulation experiments to learn the three algorithms policy And with rule_based policy Contrast . Results found ,1. Allowing takeout staff to reject orders and move in idle state can effectively improve revenue ;2. Of the three methods DDQN_ single agent The algorithm has the best performance ,RL_based Methods are better than rule_based Policy.
1. Research contribution
- And the shortest path VRP The questions are different , This article maximizes the total expected return by reducing delivery time ; Different from the traditional delivery of goods , Do not know the order information in advance , Orders are generated randomly ;
- Problem: when the salesmen are idle, they can choose to stay where they are or go to the restaurant , It is expected to shorten the delivery time of the next order ; Allow the takeout to reject the order allocated for delivery , It is expected that the total revenue of the system ;
- Try to use machine learning method to solve the delivery path problem of takeout , It is beneficial to expand to the research of other problems in the future ;
- For large-scale problems , simplify Q-learning The algorithm uses single agent Training can significantly speed up training and get better results .
2. Literature review
The common method to solve the problem of food distribution in the existing literature is Linear programming and Approximate dynamic planning (ADP),RL Few algorithms are used , This paper uses reinforcement learning and deep reinforcement methods at the same time, and compares the performance of different methods ; Reject orders and with reservation ( Predict the next order generation location , The delivery clerk moves to the restaurant when he is free ) Very little is considered in the literature , This paper considers four attributes at the same time , Perfect model .
3. Problem description
3.1 Problem description
N A takeout takes food from the restaurant through m*m Grid map for customers , The goal is to find an action that maximizes the total income of the takeout policy Guide the decision-making behavior of the takeout .
3.2 The question assumes / Prerequisite
Orders are part of the environment , According to minimization sum( The distance between the takeout and the order ) In principle, the order is allocated to the takeout
After the order is assigned to the takeout, the takeout can refuse to deliver the order
The rejected order is returned to the order list and reassigned to other takeaway
If the order is rejected by all assigned takeaways within a certain period of time , Orders leave the system and will not be delivered .
Each grid on the map becomes the starting point of the order ( The restaurant ) And destination ( Customer point ) The probability is different ,(1) Restaurants are mostly concentrated in a certain area (2) Destinations with large orders tend to be concentrated in a certain area ( university )
3.3 Problem definition
To use RL Method to solve the delivery route problem of takeout , We need to turn this problem into RL problem , Definition RL Elements of :
State
in the light of , Each triplet records the i The current position of takeout , Takeaway i The starting point of the delivery order ( The restaurant ), And the destination of the order .、
The size of the state space :
Order starting space +2: The starting position of the order belongs to m*m Grid points are located outside the space , Two new statuses :1) Take the meal / On the way to deliver meals ;2) Idle no order status
Order destination space +1: The starting position of the order belongs to m*m Grid points are located outside the space , Added a new status , The delivery clerk is idle, no order delivery, no destination .
Besides , Takeout orders include five attributes : Order starting point 、 Order destination 、 Order duration 、 Rejected times 、 Deliver the order to the delivery clerk ID
Actions
For each takeout , Action options
Rewards <state,action>--reward
The delivery clerk moves every step after accepting the order reward yes -1, Move one step in idle state -0.1; Every time the order is successfully picked up and delivered, positive reward by m2; Failed to extract and deliver the order ( Pick up the wrong goods 、 Deliver the wrong goods ) Get negative reward/penalty by -m2; Stay idle and wait for nothing reward; The allocation order chooses to wait in place without moving to get negative reward/penalty by -m2; Once the order is allocated, it will be rejected -m/3; Reject the order in the process of delivering the order and get negative reward/penalty by -m2;
besides , If the order is not effectively allocated to the takeout during the service time, leave the system , Additional punishment . in real life , Every customer usually has a level of patience , If he / She didn't get the service within a certain time , He / She will leave the system .
4. resolvent
4.1 Basic framework of reinforcement learning
Reinforcement learning is agent In the process of interaction with the environment, the learning process to achieve a goal .agent Observe the environment environment Change according to your current situation state To make a action Every time you make one action,environment Will change ,agent Will get a new state, Then choose the new action Keep implementing .
4.2 Q-learning
Q-learning The detailed introduction of the algorithm can be seen blog Reinforcement learning series ( Two ):Q learning Algorithm introduction and python Realization Q learning solve TSP problem
Q-learning Algorithm shortcoming : Only applicable to small-scale problems , Large scale problems Q The table has the disaster of state explosion ;
Trick: Simplification problem , Homogeneous takeout ,Q-Learning Just train one agent/ The action choice strategy of takeout , Apply to all takeout staff ; namely Q-learning single agent.
4.3 DDQN
To make up for Q-learning Disadvantages of the algorithm , Replace with two deep Neural Networks Q surface , Neural network input state value , Output the corresponding action action Of Q value .
The two networks are divided into online prediction Networks Eval Network obtain Q The predicted value and target network are obtained Q True value of .
reference , The algorithm flow is summarized as follows :
Use random gradient descent algorithm to update network parameters , The possible problems are , When the gradient is too large , Network parameters change too much , The Internet is very unstable ; therefore , This paper adopts Gradient clipping strategy , Normalize the gradient vector to 0-1 Between , bring DDQN The model is more stable .
4.4 Rule Based algorithm
For comparative evaluation RL Performance of the algorithm . The delivery clerk moves to pick up and deliver goods according to the principle of the shortest path , Choose to wait in place when there is no task assigned , It is not allowed to reject the order , In order to focus on the analysis and emphasis on whether there is stay、reject The influence of action .
5. Simulation evaluation
Carry out the simulation experiment according to the simulation flow chart :
• Generate initial takeout location and order location (state)
• The delivery clerk according to Policy choice action Move
• Update environmental changes 【 Order list And the service time of all orders 】
• Calculation reward
• Once the order is delivered , Send new orders
• After reaching the number of iterations , collect data ( Number of dispatched orders 、reward The sum of the ) The standard to measure the performance of the algorithm
The scale of the problem
Different problem sizes ( Grid size 、 Number of takeout staff 、 Size of state space and action space ) We're going to do the experiment under the sun
Distribution heat map of restaurants and order destinations
6. experimental result
Comparison of the three methods :
1.Q-learning : Training sheet agent Of Policy, Will be applied to all takeout staff ;
2.DDQN: Training sheet agent Of Policy, Apply to all takeout staff ;
3.DDQN: More direct training agent Of Policy;
6.1 Q-learning single agent
in the light of Q-learning single agent, With the state space / The scale of the problem increases , The convergence speed of the algorithm will gradually slow down .
6.2 DDQN single agent vs DDQN
single agent Of reward Values converge faster
6.3 Average harvest reward comparison results
1)DDQN_ single agent>QL>Rule based learning
2)DDQN_ many agent Training on large-scale problems is too time-consuming
6.4 Comparison results of average delivery orders
There is little difference between the average number of orders delivered by several methods
6.5 Visual display
Scene one : The delivery clerk received the order after being idle for a period of time
chart a Use DDQN Allow takeout staff to move in their spare time , Comparison with the use rules , It can be seen that , The penalty of allowing the takeout to move in his spare time is smaller , The higher the reward you get , And fewer iterations / Shorter time to reach the destination .
Scene two : Allow the takeout to reject the order
chart a Of DDQN, The takeout refused orders far away from him , Waiting for new and more recent orders , And the traditional one that cannot be refused rule comparison ,Reward More valuable , Fewer iterations / Shorter time to reach the destination .
7. Deficiency and prospect
The state space only considers orders and takeout location information , You can join the takeout staff during working hours 、 Order preparation time and other attributes improve the model
The number of takeout workers considered in this paper is too small , Every additional takeout , The state space grows exponentially ; You can aggregate the attribute characteristics of salesmen and orders to reduce dimensions , Reduce the state space
边栏推荐
- 7-3 构造散列表(PTA程序设计)
- 简单理解ES6的Promise
- 这次,彻底搞清楚MySQL索引
- 抽象类和接口的区别
- [during the interview] - how can I explain the mechanism of TCP to achieve reliable transmission
- 透彻理解LRU算法——详解力扣146题及Redis中LRU缓存淘汰
- 1.初识C语言(1)
- Redis实现分布式锁原理详解
- MySQL lock summary (comprehensive and concise + graphic explanation)
- Reinforcement learning series (I): basic principles and concepts
猜你喜欢
A piece of music composed by buzzer (Chengdu)
Redis的两种持久化机制RDB和AOF的原理和优缺点
编写程序,模拟现实生活中的交通信号灯。
Have you encountered ABA problems? Let's talk about the following in detail, how to avoid ABA problems
Record a penetration of the cat shed from outside to inside. Library operation extraction flag
FAQs and answers to the imitation Niuke technology blog project (II)
Principles, advantages and disadvantages of two persistence mechanisms RDB and AOF of redis
SRC挖掘思路及方法
Difference and understanding between detected and non detected anomalies
The difference between cookies and sessions
随机推荐
[modern Chinese history] Chapter V test
【九阳神功】2021复旦大学应用统计真题+解析
Using spacedesk to realize any device in the LAN as a computer expansion screen
[面试时]——我如何讲清楚TCP实现可靠传输的机制
Why use redis
Read only error handling
String ABC = new string ("ABC"), how many objects are created
20220211-CTF-MISC-006-pure_ Color (use of stegsolve tool) -007 Aesop_ Secret (AES decryption)
7-7 7003 组合锁(PTA程序设计)
重载和重写的区别
3. Input and output functions (printf, scanf, getchar and putchar)
甲、乙机之间采用方式 1 双向串行通信,具体要求如下: (1)甲机的 k1 按键可通过串行口控制乙机的 LEDI 点亮、LED2 灭,甲机的 k2 按键控制 乙机的 LED1
Reinforcement learning series (I): basic principles and concepts
1. Preliminary exercises of C language (1)
Poker game program - man machine confrontation
ABA问题遇到过吗,详细说以下,如何避免ABA问题
1. C language matrix addition and subtraction method
7-9 制作门牌号3.0(PTA程序设计)
为什么要使用Redis
仿牛客技术博客项目常见问题及解答(一)