[paper reading] icml2020: can autonomous vehicles identify, recover from, and adapt to distribution shifts?

2022-07-07 08:24:00

Column: January 6, 2022 7:18 PM
Last edited time: January 30, 2022 12:14 AM
Sensor/ organization : Oxford
Status: Finished
Summary: RIP out-of-distribution, How to consider the phenomenon of uncertainty identification Deal with
Type: ICML
Year: 2020
The amount of citation : 46

Reference Youtube Better than the paper

1. Motivation

  • link :out-of-distribution

    Deep neural networks usually use Closed world hypothesis Training assumes that the distribution of test data is similar to that of training data . However , When used in real-world tasks , This assumption does not hold , Resulting in a significant decline in their performance .

    Why does the model have OOD brittleness ?

    1. Neural network models may Heavily dependent on training data False clues and annotation artifacts in (spurious cues and annotation artifacts), and OOD The example is unlikely to contain the same false patterns as the examples in the distribution .
    2. Training data cannot cover all aspects of distribution , Therefore, the generalization ability of the model is limited .

Familiarize yourself with abbreviations :
OOD = out of distribution / Data that does not appear in the training set
RIP = robust imitative planning
Ada RIP = adaptive robust imitative planning

That is, if you encounter a scenario that you did not encounter during training at runtime / data , Although the variance of the model is relatively large , But this information is not used for processing , This paper mainly does :

  1. When testing the model , distinguish Scenes not encountered in the training set , namely OOD scene
  2. stay OOD Scene On the model Judge Whether to believe and which one to believe or choice AdaRIP sampling

Problem scenario

[Sugiyama & Kawanabe, 2012; Amodei et al., 2016; Snoek et al., 2019] It has been proved many times , When ML When the model is exposed to a new environment ( That is to say Deviate from the training set In the case of the observed distribution ) when , Because they cannot be generalized , Its reliability will drop sharply , Which leads to disastrous results

give an example : In this picture , Different Model given y 1 , y 3 \mathbf y^1, \mathbf y^3 y1,y3 Is pretty good , But this is because of this scene ( Large disc ) It did not appear in the training set, and the existing model evaluation did not consider completely , Therefore, this paper proposes RIP In this scenario, it will be given for follow path For better y 3 \mathbf y^3 y3, In the figure min k \text{min}_k mink It refers to the same track y i y_i yi Corresponding q1, q2, q3 The smallest

 Please add a picture description

RIP Consider the differences between the models , To avoid the OOD Overconfidence in tasks leads to catastrophic Path result expansion

Although there are other comparisons trick Methods , such as Directly restrict vehicles within the lane line , Based on perception 、e2e Method , But this is also vulnerable spurious correlations. You will also get non causal characteristics And lead to OOD The chaos of actions in the scene

  • Dolls about non-causal features that lead to confusion in OOD scenes (de Haan et al., 2019).

In the second half of the introduction, some people's work is cited The existing Baseline But they can't solve this out of distribution problem such as lbc, R2P2


stay conclusion Conclusion There is a concise version in Mainly :formulate out of distribution dataset The uncertainty of , Put forward RIP To solve this uncertain problem makes model robust, One last thing benchmark It is used to make your own model for everyone robust and OOD Performance under the event .

  1. Epistemic uncertainty-aware planning:RIP In fact, it can be regarded as a Simple quantification of epistemic uncertainty with deep ensembles enables detection of distribution shifts.

    By using Bayesian decision theory and robust control objectives , Shows how to take conservative actions in unfamiliar situations , This usually enables us to recover from changes in distribution ( Pictured 1)

    link :Monte-Carlo Dropout( Monte Carlo dropout),Aleatoric Uncertainty,Epistemic Uncertainty

  2. Uncertainty-driven online adaptation: Adaptive robust imitation programming (AdaRIP), Use RIP Cognitive uncertainty estimation to effectively query expert feedback , For immediate adaptation , Without affecting security . therefore ,AdaRIP Can be deployed in the real world : It can reason what it doesn't know , And in these cases Require manual guidance To ensure current security and improve future performance .

  3. Autonomous car novel-scene benchmark: One benchmark be used for Evaluate the robustness of autonomous driving to a set of distributed tasks . Evaluation indicators :

    1. testing OOD event , Measure by the correlation between violations and model uncertainty
    2. recover from distribution shift, Quantify by the percentage of successful maneuvers in the new scene
    3. Adapt effectively OOD scene , Provide online supervision

2. Method

The first is a few assumptions and formula explanation :

  1. Expert data : D = { ( x i , y i ) } i = 1 N \mathcal{D}=\left\{\left(\mathbf{x}^{i}, \mathbf{y}^{i}\right)\right\}_{i=1}^{N} D={ (xi,yi)}i=1N; among , x \mathbf x x It is a high-dimensional observation input , y \mathbf y y yes time-profiled Expert track , So expert strategy expert policy It can be expressed in this way : y ∼ π expert  ( ⋅ ∣ x ) \mathbf{y} \sim \pi_{\text {expert }}(\cdot \mid \mathbf{x}) yπexpert (x)

    Imitation learning will be used in the method approximate the unkonwn expert policy

  2. hypothesis Inverse Dynamics: Use PID Conduct Low-level control, This only needs to be aimed at the trajectory y = ( s 1 , ⋯   , s T ) \mathbf y=(s_1,\cdots,s_T) y=(s1,,sT) The action is made by low-level controller To output a t = I ( s t , s t + 1 ) , ∀ t = 1 , … , T − 1 a_{t}=\mathbb{I}\left(s_{t}, s_{t+1}\right), \forall t=1, \ldots, T-1 at=I(st,st+1),t=1,,T1

  3. Suppose the global planning has , Assume truth location information get

concise : The imitation learning results are expressed by Gaussian probability , Just learn the parameters of the distribution ; And then pass by aggregate and plan Make the final choice

 Please add a picture description

Figure 1 : From link youtube

among aggregate step That's the top ⊕ ⊕ operator , There are two types shown in the yellow box calculation in Figure 1 :

  • Take the worst of the strategies Choose the higher of the poor (RIP-WCM) worst case model
    Inspired by (Wald,1939)

  • The other is to add all Divide by quantity (RIP-MA) model averaging

    Inspired by Bayesian decision theory (Barber, 2012)

  • There is another kind in the paper RIP-BCM It's found by the author's experience → max ⁡ k log ⁡ q k \max_k \log q_k maxklogqk

 Please add a picture description

The formula says

2.1 Expert data

give experts plan Distribution of → Because usually it is a direct action however softmax The previous one should also count distribution Well ?

Bayesian Imitative Model

  • training via MLE

    In the data set D \mathcal{D} D Next distribution density models q ( y ∣ x ; θ ) q(\mathbf y|\mathbf x; \theta) q(yx;θ) A posteriori of p ( θ ∣ D ) p(\boldsymbol{\theta}|\mathcal{D}) p(θD) , That is, the model parameters learned first through the data set , Then take it as a priori , Get into probabilisitc imitative mode

    θ M L E = arg ⁡ max ⁡ θ E ( x , y ) ∼ D [ log ⁡ q ( y ∣ x ; θ ) ] (1) \boldsymbol{\theta}_{\mathrm{MLE}}=\underset{\boldsymbol{\theta}}{\arg \max } \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}}[\log q(\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta})] \tag{1} θMLE=θargmaxE(x,y)D[logq(yx;θ)](1)

  • using probabilisitc imitative model q ( y ∣ x ; θ ) q(\mathbf y|\mathbf x; \theta) q(yx;θ); Different from before (Rhinehart et al., 2020; Chen et al., 2019) Here is a prior distribution of model parameters p ( θ ) p(\boldsymbol\theta) p(θ) Used to substitute

    Observing x \mathbf x x Next , Experts make y \mathbf y y The probability of is → The author himself said Empirically I found it very effective

    q ( y ∣ x ; θ ) = ∏ t = 1 T p ( s t ∣ y < t , x ; θ ) = ∏ t = 1 T N ( s t ; μ ( y < t , x ; θ ) , Σ ( y < t , x ; θ ) ) (2) \begin{aligned}q(\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta}) &=\prod_{t=1}^{T} p\left(s_{t} \mid \mathbf{y}_{<t}, \mathbf{x} ; \boldsymbol{\theta}\right) \\&=\prod_{t=1}^{T} \mathcal{N}\left(s_{t} ; \mu\left(\mathbf{y}_{<t}, \mathbf{x} ; \boldsymbol{\theta}\right), \Sigma\left(\mathbf{y}_{<t}, \mathbf{x} ; \boldsymbol{\theta}\right)\right)\end{aligned} \tag{2} q(yx;θ)=t=1Tp(sty<t,x;θ)=t=1TN(st;μ(y<t,x;θ),Σ(y<t,x;θ))(2)

    among μ ( ⋅ ; θ ) \mu(\cdot ; \boldsymbol \theta) μ(;θ) , Σ ( ⋅ ; θ ) \Sigma(\cdot ; \boldsymbol \theta) Σ(;θ) Are the two RNN, Although the normal distribution is unimodal unimodality , But autoregression ( namely , Future samples depend on the sequential sampling of the past normal distribution ) Allow for multimodal distribution multi-model distribution Modeling

The whole process

  1. Use deep imitative models As a posterior p ( θ ∣ D ) p(θ|D) p(θD) A simple approximation of
  2. Consider one K A collection of components , Use θ k θ_k θk To refer to our first k A model q k q_k qk Parameters of
  3. Through maximum likelihood training ( See formula 1 and Frame diagram b part )

2.2 distinguish OOD

Main comparison posterior p ( θ ∣ D ) p(\boldsymbol{\theta}|\mathcal{D}) p(θD) Under each plan disagreement Use log ⁡ q ( y ∣ x ; θ ) \log q(\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta}) logq(yx;θ) The variance of To point out the same policy How different are the different trajectories

u ( y ) ≜ Var ⁡ p ( θ ∣ D ) [ log ⁡ q ( y ∣ x ; θ ) ] (3) u(\mathbf{y}) \triangleq \operatorname{Var}_{p(\boldsymbol{\theta} \mid \mathcal{D})}[\log q(\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta})]\tag{3} u(y)Varp(θD)[logq(yx;θ)](3)

Low variance proof in-distribution, The high square difference is OOD

  • How to determine this high and low threshold ?

2.3 Post Planning

Alternative planning strategies under cognitive uncertainty ( The red part of the picture below )

 Insert picture description here

Excerpt from youtube in ppt

First formulate Target location under cognitive uncertainty G Planning problems , That is, model parameters p ( θ ∣ D ) p(\boldsymbol{\theta}|\mathcal{D}) p(θD) A posteriori of , Optimization as a general goal (Barber, 2012), We call it robust imitation programming (RIP)

y RIP  G ≜ arg ⁡ max ⁡ y ⊕ θ ∈  supp  ( p ( θ ∣ D ) ) ⏞ aggregation operator  log ⁡ p ( y ∣ G , x ; θ ) ⏟ imitation posterior  = arg ⁡ max ⁡ y ⊕ θ ∈  supp  log ⁡ q ( y ∣ x ; θ ) ⏟ imitation prior  + log ⁡ p ( G ∣ y ) ⏟ goal likelihood  (4) \begin{aligned}\mathbf{y}_{\text {RIP }}^{\mathcal{G}} &\triangleq \underset{\mathbf{y}}{\arg \max } \overbrace{\underset{\boldsymbol{\theta} \in \text { supp }(p(\boldsymbol{\theta} \mid \mathcal{D}))}{\oplus}}^{\text {aggregation operator }} \log \underbrace{p(\mathbf{y} \mid \mathcal{G}, \mathbf{x} ; \boldsymbol{\theta})}_{\text {imitation posterior }} \\&=\underset{\mathbf{y}}{\arg \max } \underset{\boldsymbol{\theta} \in \text { supp }}{\oplus} \log \underbrace{q (\mathbf{y} \mid \mathbf{x} ; \boldsymbol{\theta})}_{\text {imitation prior }}+\log \underbrace{p(\mathcal{G} \mid \mathbf{y})}_{\text {goal likelihood }}\end{aligned}\tag{4} yRIP Gyargmaxθ supp (p(θD))aggregation operator logimitation posterior p(yG,x;θ)=yargmaxθ supp logimitation prior q(yx;θ)+loggoal likelihood p(Gy)(4)

among ⊕ ⊕ Is applied to a posteriori p ( θ ∣ D ) p(\boldsymbol{\theta}|\mathcal{D}) p(θD) The operator of ( See the definition above ), And the target likelihood is determined by, for example, the final target position s T G s^\mathcal{G}_T sTG Gaussian for center and pre specified tolerance p p p give p ( G ∣ y ) = N ( y T ; y T G , ϵ 2 I ) p(\mathcal{G} \mid \mathbf{y})=\mathcal{N}\left(\mathbf{y}_{T} ; \mathbf{y}_{T}^{\mathcal{G}}, \epsilon^{2} I\right) p(Gy)=N(yT;yTG,ϵ2I)

  • It's a bit like Gaussian process Because in the whole T Gaussian process distribution value in time

Just like the formula 4 in plan y R I P G \mathbf y_{RIP}^{G} yRIPG , What we maximize is mainly two parts : From expert data imitation prior, and Close to the final target point G

 Please add a picture description

  • What is pointed out in the original text about posterior p ( θ ∣ D ) p(\boldsymbol{\theta}|\mathcal{D}) p(θD) Yes our belief about the true expert model

    emmm But isn't this the model parameter trained through the data set ? Why is a right expert model Of true Probability ? You need to look at the code

    It means how much the corresponding model parameters are like this expert plan Do you ? → It's like this

Although the description Deep imitation model DML (Rhinehart et al., 2020) It's a ⊕ ⊕ selects a single θ k \theta_k θk from posterior The experiment partially proved So for OOD direct gg

2.4 AdaRIP

can we do better? → Experts step in The blue part

 Please add a picture description

3. experimental result

  • stay nuScenes: About out-of-distribution How to deal with it ? Do you divide data manually ?

    Because in carla in It can be seen that these scenarios are manually selected for testing

 Please add a picture description
stay nuScenes It's done on the dataset

 Please add a picture description

I put forward a benchmark CARNOVEL

4. Conclusion

  1. Put forward RIP Yes distribution shift Scene recognition and recovery
  2. AdaRIP Act after understanding the uncertainty , According to online experts feedback Carry out parameter adaptation
  3. Put forward a benchmark baseline Do this out of distribution problem

 Please add a picture description

Excerpt from the original


This article The author's youtube It's really easier to understand than the paper hhhh ppt It's really good , Each step is also very clear , About open-question I also put forward my own shortcomings in the video ( There are still many deficiencies Mainly involving real vehicles real-time)

  1. Real time cognitive uncertainty evaluator
  2. Live online planning
  3. Resistance to catastrophic forgetting in online adaptation → How to do incremental learning ?
    Resistance to catastrophic forgetting in online adaptation
  • But I feel that the first real-time method from the paper It can be done , And there is no comparison of time in the full text ? why open question Put forward ? Namely : The experimental part does not explain the real-time effect

Post the code later when you look at the code

  • During the group meeting , The little friend pointed out q 1 , q 2 , q 3 q_1, q_2, q_3 q1,q2,q3 If they are all from the same data set , The same training parameters , The model that should be trained is similar , There will be no diversity of choices Pictured 1
  • There is also the fact that the theme of this article should be adaptive … But I didn't do it … adapt Artificially , In the words of brothers : It turned out to be adding a layer embedding.

