About weekly :
Reinforcement learning is one of the research hotspots in the field of artificial intelligence , Its research progress and achievements have also attracted a lot of attention . In order to help researchers and engineers understand the relevant progress and information in this field , Zhiyuan community combines the contents of the field , Written as chapter 52 period 《 Intensive learning weekly 》. This issue of the weekly has sorted out Reinforcement learning area Relevant latest paper recommendation and research review , For all of you .
Weekly adopts the mode of community cooperation , Welcome interested friends to participate in our work , Let's promote the sharing of reinforcement learning community 、 Learning and communication activities . You can scan the QR code at the end of the article to join the reinforcement learning community .
Contributors to this issue :( Li Ming , Liu Qing 、 Little fat )
About weekly subscriptions :
Here's good news ,《 Intensive learning weekly 》 Turn on “ Subscribe to the function ”, In the future, we will automatically push you the latest version of 《 Intensive learning weekly 》. Subscription method :
1, Register Zhiyuan community account
2, Click the author column in the upper left corner of the weekly interface “ Intensive learning weekly ”( Here's the picture ), Get into “ Intensive learning weekly ” Home page .
3, Click on “ Focus on TA”( Here's the picture )
4, You have completed 《 Intensive learning weekly 》 Subscribe , In the future, Zhiyuan community will automatically push you the latest version of 《 Intensive learning weekly 》!
Paper recommendation
Recommended this time 15 A related paper in the field of reinforcement learning , It mainly introduces Replace the convolutional neural network architecture with Swin Transformer Self attention structure to improve evaluation scores 、 Through reinforcement learning, language learning and comparative learning are integrated for UAV Mapless Navigation efficient decision 、 Based on trading a single asset Double Deep Q-Network The algorithm forms preliminary insights into the behavior of agents in the financial field 、 Determine the best charging location of electric vehicles based on reinforcement learning 、 Develop an enhanced self perceived driving recommendation system based on deep reinforcement learning , To promote the revision of policy measures for key traffic management controllers 、 Through depth Q The Internet (DQN) And Advantage Actor-Critic (A2C) The algorithm makes dynamic decisions according to traffic conditions 、 Deep reinforcement learning based on multi-agent (MDRL) Model to solve the problem of target location in multi-agent system 、 Through a new type of circulating neural unit STP Neuron (STPN) To maximize efficiency and computing power etc. .
title :Deep Reinforcement Learning with Swin Transformer( University of Oslo :Li Meng | be based on Swin-Transformer Deep reinforcement of learning )
brief introduction :Transformers It is a neural network model using multilayer self attention head . Focus on transformers Achieve in “key” and “query” Context embedding .Transformers In recent years, it has shown excellent performance in natural language processing tasks .Swin-Transformer Divide the image pixels into small pieces , And in a fixed size ( displacement ) Apply local self attention operation in the window . Decision converter has been successfully applied to offline reinforcement learning , And indicate from Atari The random walk sample of the game is enough for agents to learn to optimize behavior . However, online reinforcement learning and transformers Combined, it is more challenging . This paper discusses reinforcement learning strategies without modification , Instead, only the convolutional neural network architecture is replaced by Swin Transformer The possibility of self attention architecture . The goal is to change the way agents view the world , Not the way agents plan the world . And in the arcade learning environment 49 Experiments were carried out on three games . It turns out that , Use in reinforcement learning Swin Transformer Significantly higher evaluation scores were achieved in most games in the arcade learning environment .
Thesis link :https://arxiv.org/pdf/2206.15269.pdf
read more
title :Depth-CUPRL: Depth-Imaged Contrastive Unsupervised Prioritized Representations in Reinforcement Learning for Mapless Navigation of Unmanned Aerial Vehicles(FURG : Junior C. de Jesus | Depth-CUPRL: Unmanned aerial vehicle (uav) Mapless Unsupervised priority representation of depth image contrast in navigation reinforcement learning )
brief introduction : Reinforcement learning has shown impressive performance in video games through raw pixel imaging and continuous control tasks . However ,RL Observe in high dimension ( Such as the original pixel image ) Poor performance in . It is generally believed that , Based on physical state RL Strategy ( Such as laser sensor measurement ) It provides more effective sample results than pixel learning . This paper presents a new method of extracting information from depth map estimation , By church RL The agent executes the UAV mapless Navigation . And the unsupervised priority representation of depth imaging contrast in reinforcement learning is proposed (Depth-CUPRL), This method estimates the depth of the image by playing back the memory first . And combine language learning with comparative learning , It solves the problem of image-based language learning . Through the UAV (UAV) Analysis of results , We can conclude that , The depth of this article CUPRL Methods are effective for decision making , And in mapless The navigation ability is better than the most advanced pixel based method .
Thesis link :https://arxiv.org/pdf/2206.15211.pdf
read more
title :Conditionally Elicitable Dynamic Risk Measures for Deep Reinforcement Learning(University of Toronto:Anthony Coache | The condition of deep reinforcement learning can induce dynamic risk measurement )
brief introduction : This paper proposes a new framework to solve the problem of risk sensitive reinforcement learning (RL) problem , The agent optimizes the time consistent dynamic spectrum risk measurement . Based on the concept of conditional inducement , This method constructs ( Strictly consistent ) Scoring function , Used as a penalty factor in the estimation process . There are three main contributions :(i) Designed an effective method , Using depth neural network to estimate a class of dynamic spectral risk measures ,(ii) It is proved that these dynamic spectral risk measures can be approximated to any arbitrary accuracy by deep neural network , as well as (iii) Develop a risk sensitive actor - Critic algorithm , The algorithm uses a complete series and does not require any additional nested transformations . The conceptually improved reinforcement learning algorithm is compared with the nested simulation method , And its performance is explained in two cases : Statistical arbitrage and portfolio allocation on simulated and real data .
Thesis link :https://arxiv.org/pdf/2206.14666.pdf
read more
title :Traffic Management of Autonomous Vehicles using Policy Based Deep Reinforcement Learning and Intelligent Routing( Pakistan Institute of engineering and Applied Sciences (PIEAS):Anum Mushtaq | Autonomous vehicle traffic management based on strategy deep reinforcement learning and intelligent routing )
brief introduction : Deep reinforcement learning (DRL) Use a variety of unstructured data , send RL Be able to learn complex strategies in a high-dimensional environment . Based on autonomous vehicle (AV) Intelligent transportation system (ITS) Is policy based DRL Provides a great place . Deep learning architecture solves the computational challenges of traditional algorithms , And help to adopt and autonomous vehicle in the real world (AV) .AV The main challenge in implementation is , Without reliable and effective management , It may aggravate the traffic jam on the road . Considering the overall effect of each car , And use efficient and reliable technology , To optimize traffic flow management and reduce congestion . Therefore, this paper proposes an intelligent traffic control system , To deal with the complex traffic congestion scene at the intersection and behind the intersection . Based on DRL Signal control system , The system dynamically adjusts traffic signals according to the current congestion at intersections . To solve the problem of road congestion behind the intersection , This paper uses rerouting technology to balance the vehicle load on the road network . By breaking the data island , Will come from the sensor 、 The detector 、 All data of vehicles and roads are used in combination , To achieve sustainable results . Finally using SUMO Micro simulator for simulation . The results show that this method is of great significance .
Thesis link :https://arxiv.org/pdf/2206.14608.pdf
read more
title :DistSPECTRL: Distributing Specifications in Multi-Agent Reinforcement Learning Systems( Purdue university, :Joe Eappen | DistSPECTRL: Distribution specification in Multi-Agent Reinforcement Learning System )
brief introduction : Although significant progress has been made in specifying and learning goals for general network physics systems , However, applying such methods to distributed multi-agent systems still faces major challenges . This includes the need (a) Make specification primitive , Allow the expression and interaction of local and global goals ,(b) Tame the explosion in the state and action space to achieve effective learning , as well as (c) Minimize the coordination frequency and the set of participants participating in the global goals . So , This paper proposes to allow the natural combination of local and global objectives , A new framework for training multi-agent systems . This technology can learn expression strategies , Allow agents to operate local targets in an uncoordinated manner , At the same time, decentralized communication protocols are used to implement global communication protocols . The experimental results support the idea of this paper , That is, using specification guided learning can effectively realize complex multi-agent distributed planning problems .
Thesis link :https://arxiv.org/pdf/2206.13754.pdf
read more
title :Applications of Reinforcement Learning in Finance -- Trading with a Double Deep Q-Network(ZHAW:Frensi Zejnullahu | The application of reinforcement learning in financial transactions —— Double depth Q Network transaction )
brief introduction : This paper proposes a method for trading single assets Double Deep Q-Network Algorithm , namely E-mini S&P 500 Continuous futures contract . By using proven settings as the basis for an environment with multiple extensions . The functions of the transaction agent continue to expand , To include additional assets such as commodities , Thus, there are four models . Also deal with environmental conditions , Including costs and crises . The transaction agent first receives training in a specific period of time , And test on new data , And as a benchmark ( market ) Compare the long holding strategies . By analyzing the differences between various models and the samples relative to the environment / Out of sample performance . Experimental results show that , Trading agents follow appropriate behavior . It can adjust its strategy according to different situations , For example, neutral positions are more widely used when transaction costs exist . Besides , Net worth exceeds the benchmark , The performance of agents in the test set is better than that in the market . Finally using DDQN The algorithm provides preliminary insights into the behavior of agents in the financial field . The research results can be used for further development .
Thesis link :https://arxiv.org/pdf/2206.14267.pdf
read more
title :An optimization planning framework for allocating multiple distributed energy resources and electric vehicle charging stations in distribution networks( Jinshan University : Kayode E. Adetunji| Distribution network multi distributed energy and electric vehicle charging station optimal configuration planning framework )
brief introduction : Battery energy storage system (BESS) And other passive electronic units , To improve grid performance and mitigate the impact of high variability of renewable energy power . therefore , A planning framework has been developed to optimally allocate these devices to the distribution network . However , The current planning mechanism does not take into account the relative impact of different distribution devices in the planning framework . This paper proposes a new comprehensive planning framework , Distribute in the distribution network DG device 、 BESS Devices and electric vehicle charging stations (EVCS) facilities , At the same time, optimize its technology 、 Economic and environmental benefits . Its Using reorganization technology , Dynamically update through iteration DG and BESS Location of the device , Generate more solutions . An algorithm based on reinforcement learning is introduced to coordinate the charging of electric vehicles , The algorithm proposes the best charging position of electric vehicle related to other units . For the complexity of finding a larger solution space , A multi-stage hybrid optimization scheme is proposed to generate the optimal allocation variables . The multi-objective framework based on classification is further developed , To optimize multiple objective functions at the same time .
Thesis link :https://www.sciencedirect.com/sdfe/reader/pii/S0306261922008339/pdf
read more
title :Deep Reinforcement Learning for Personalized Driving Recommendations to Mitigate Aggressiveness and Riskiness: Modeling and Impact Assessment( National Technical University of Athens : Eleni G. Mantouka | Deep reinforcement learning of personalized driving advice to reduce aggression and risk : Modeling and impact assessment )
brief introduction : Most driving recommendation and assistance systems , For example, advanced driving assistance (ADAS), It is usually designed according to the behavior of ordinary drivers . However , A personalized driving system that can adapt to different driving styles and identify personal needs and preferences may be the key to improving driver sensitivity and adopting safer driving habits . This paper develops an enhanced self perceived driving recommendation system using deep reinforcement learning algorithm , The system generates personalized driving recommendations , To improve driving safety , At the same time, respect personal driving style and preferences . Evaluate the impact of applying the recommendation system through microscopic simulation ; Survey results show that , If all drivers follow the advice , Road safety has improved significantly , There have been some minor changes in the characteristics of traffic flow . this paper The output may be useful in the framework of advanced active cruise control systems , Can be used to develop enhanced behavior models , Even promote the revision of policies and measures that use driving behavior as a key controller of traffic management .
Thesis link :https://www.sciencedirect.com/sdfe/reader/pii/S0968090X22002029/pdf
read more
title :Understanding via Exploration: Discovery of Interpretable Features With Deep Reinforcement Learning( Central South University : Jiawen Wei | Understand through exploration : Find the interpretable features of deep reinforcement learning )
brief introduction : Understanding the environment through interaction has become one of the most important intellectual activities for human beings to master unknown systems . as everyone knows , Deep reinforcement learning (DRL) In many applications, effective control is achieved through human like exploration and utilization . Deep neural network (DNN) The opacity of often hides the key information related to control , This is essential for understanding the target system . This paper first proposes a new online feature selection framework , That is, attention feature selection based on two worlds (D-AFS) , To identify the contribution of input to the whole control process . With most DRL The world used in is different ,D-AFS It has both the real world and the distorted virtual world . Newly introduced attention based assessment (AR) The module realizes the dynamic mapping from the real world to the virtual world . The existing DRL The algorithm needs only a little modification , You can learn in the dual world . Through analysis DRL Response in two worlds ,D-AFS It can quantitatively identify the importance of each feature to control .
Thesis link :https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9810174
read more
title :The flying sidekick traveling salesman problem with stochastic travel time: A reinforcement learning approach( University of Tennessee : Zeyu Liu | The joint transportation of UAV and truck with random travel time : A reinforcement learning method )
brief introduction : As a new way of urban distribution , truck - UAV cooperative operation is becoming more and more popular , The truck takes the route of traveling salesman , The drone took off from the truck , Deliver the package to nearby customers . This problem is called flight partner traveling salesman problem (FSTSP), And put forward many algorithms to solve it . However , Few studies consider the randomness of road network travel time . This article will FSTSP Expand to random travel time , The problem is expressed as a Markov decision process (MDP). The model uses reinforcement learning (RL) Algorithmic solution , Including depth Q The Internet (DQN) and Advantage Actor-Critic (A2C) Algorithm , To overcome dimensional disasters . Use widely accepted manually generated data sets , Experiments show that reinforcement learning algorithm performs well in approximate optimization algorithm . In a system with random travel time FSTSP On , Reinforcement learning algorithm obtains flexible strategies , Make dynamic decisions according to different traffic conditions on the road .
Thesis link :https://www.sciencedirect.com/sdfe/reader/pii/S1366554522002034/pdf
read more
title :Data efficient reinforcement learning and adaptive optimal perimeter control of network traffic dynamics( Hong Kong Polytechnic University : C. Chen| Efficient reinforcement learning of data and dynamic adaptive optimal perimeter control of network traffic )
brief introduction : The existing data-driven and feedback flow control strategies do not consider the heterogeneity of real-time data measurement . Traditional traffic control reinforcement learning (RL) The method lacks data efficiency , Slow convergence , extremely Vulnerable to endogenous uncertainty . In this paper, we propose a holistic reinforcement learning (IRL) To learn macro traffic dynamics , To achieve adaptive optimal perimeter control . Main contributions :(a) Continuous time control with discrete gain update is developed , To adapt to discrete-time sensor data .(b) In order to reduce sampling complexity and use available data more effectively , Replay experience (ER) Technology introduction IRL Algorithm .(c) The proposed method is based on “ No model ” The method relaxes the requirements for model calibration , Through data-driven RL The algorithm achieves robustness to modeling uncertainty and improves real-time performance .(d) be based on IRL The convergence of the algorithm and the stability of the controlled traffic dynamics obtain Theoretical proof . The optimal control law is parameterized , Then through neural network (NN) Approaching , This reduces the computational complexity .
Thesis link :https://www.sciencedirect.com/sdfe/reader/pii/S0968090X22001929/pdf
read more
title :Clustering Experience Replay for the Effective Exploitation in Reinforcement Learning( University of electronic technology : Min Li| Replay clustering experience effectively used in reinforcement learning )
brief introduction : Reinforcement learning trains agents to make decisions through the transformation experience generated by different decisions . so Most of them strengthen learning Replay the explored conversion through unified sampling . however Its It's easy to ignore the transformation of the final exploration . Another method is to define the priority of each transformation through the estimation error in training , Then replay the conversion according to the priority . But it only updates the priority of the conversion replayed at the current training time step , Therefore, the conversion with lower priority will be ignored . This paper proposes a clustering experience playback CER With Effectively use the experience hidden in all the transitions explored in the current training .CER The transformation is clustered and replayed through the divide and conquer framework based on time division . First , It divides the whole training process into several stages . secondly , At the end of each phase , It USES k-means Cluster the transitions explored at this stage . Last , It constructs a conditional probability density function , To ensure that various transitions can be fully replayed in the current training .
Thesis link :https://www.sciencedirect.com/science/article/pii/S0031320322003569
read more
title :Target localization using Multi-Agent Deep Reinforcement Learning with Proximal Policy Optimization( Concordia University : Ahmed Alagha| Use multi-agent deep reinforcement learning with near end strategy optimization to locate targets )
brief introduction : Target location is based on the sensing agent ( robot 、 Unmanned aerial vehicle (uav) ) Read the collected sensor data to identify the target location , Investigate an area of interest . Existing solutions rely on sensory data collected through fusion and analysis or Predefined and data-driven survey paths . It exists Adaptability , Because increasing the complexity and dynamics of the environment requires further re modeling and supervision . This paper proposes several multi-agent deep reinforcement learning (MDRL) Model to solve the problem of target location in multi-agent system . take Actor-Critic Structure and convolutional neural network (CNN) Use it together , And use the near end strategy to optimize (PPO) optimized . The observed data of the agent is modeled as a two-dimensional heat map , Capture the location of all agents and sensor readings . Cooperation among agents is induced by team based rewards , And ensure the scalability of the number of agents by using the centralized learning method for decentralized execution , The scalability of observation size is realized by image down sampling and Gaussian filter .
Thesis link :https://www.sciencedirect.com/science/article/pii/S0167739X22002266
read more
title :Utility Theory for Sequential Decision Making( McGill University : Ahmed Alagha| ICML 2022: Utility theory of sequential decision making )
brief introduction : Von Neumann - Morgenstein (VNM) The utility theorem shows , Under some axioms of rationality , The decision is simplified to maximize the expectation of some utility functions . In this paper, memoryless preference will produce utility in the form of multiplication factors of reward and future return for each conversion , With Inspired the Markov decision process (MDPs) Generalization , It has this structure in the return of agents , namely Affine reward MDPs. In order to recover MDP The cumulative sum of scalar rewards commonly used in , Stronger constraints on preferences are needed . Stronger constraints simplify the utility function of the target seeking agent , Its form is the difference of some functions of the state , The author calls it potential function . The necessary and sufficient conditions of this paper are given in VNM Add an axiom to the axiom of reason , It unveils the mystery of the reward hypothesis based on rational agent design in reinforcement learning , It also inspires people involved in sequential decision-making AI New research directions .
Thesis link :https://arxiv.org/pdf/2206.13637.pdf
read more
title :Short-Term Plasticity Neurons Learning to Learn and Forget( Huawei & University College London : Hector Garcia Rodriguez| ICML 2022: Short term plasticity neuron learning and forgetting )
brief introduction : Short term plasticity (STP) It is a mechanism for storing and attenuating memory in the synapse of the cerebral cortex . This paper presents a new type of circulating neural unit , namely STP Neuron (STPN). The key mechanism is that synapses have States , It spreads over time through self circulating connections in synapses . This formula can train plasticity through the back propagation of time , So as to form a form of learning and forgetting in the short term . STPN Superior to all tested alternatives , namely RNN、LSTM、 Other models with fast weight and differentiability . and Strengthening learning (RL) And association retrieval 、 Maze exploration 、Atari Video games and MuJoCo Robots and other missions have confirmed this . Besides , This paper calculates , In neural morphology or biological circuits ,STPN It can minimize the energy consumption between models , Because it will dynamically inhibit a single synapse . Based on these , biological STP It may be a powerful evolutionary attractor , It can maximize efficiency and computing power .
Thesis link :https://arxiv.org/pdf/2206.14048.pdf
read more
If you're doing or focusing on reinforcement learning research 、 Implementation and application , Welcome to join “ Zhiyuan community - Reinforcement learning - Communication group ”. ad locum , You can :
Learn cutting edge knowledge 、 Solving puzzles
Share your experience 、 Show your talent
Participate in exclusive activities 、 Meet research partners
Please scan the QR code below to add .