当前位置：网站首页>Nerf neural radiation field eccv2020

Nerf neural radiation field eccv2020

2022-06-09 14:04:00 【tzc_ fly】

fig1

Catalog

NeRF Front content ： Rendering and viewing direction
Abstract
1.Introduction
2.Related Work
- 2.1.Neural 3D shape representations
- 2.2.View synthesis and image-based rendering
3.Neural Radiance Field Scene Representation
4.Volume Rendering with Radiance Fields
5.Optimizing a Neural Radiance Field
6. result
- 6.1. Data sets
- 6.2. experimental result

NeRF Front content ： Rendering and viewing direction

Rendering can be seen as ： Project a 3D scene onto a pixel image . For the same 3D scene , In different perspectives , The result of rendering is different .

When a camera or human eye fixed in a certain position is looking at a 3D scene , Some points in the scene are not on the camera light , We can use the absolute position of the scene to describe the position of this point $(x, y, z)$ , In order to get a reasonable result of rendering （ Imaging of the camera at this position ）, You also need to know the relationship between the camera light and the point , For example, the direction of observation $(θ, φ)$ ：
fig001
For example, the human eye or camera in the figure above is fixed , The light is solid black , spot G The relationship with the camera is $(\theta,\phi)$ .

Rendering result of a point in the scene , Depends on where the camera looks at the scene , And the position relationship between the point and the camera light .

Abstract

We propose a method to optimize the underlying continuous volume scene function by using sparse input atlas （continuous volumetric scene function） To achieve the latest results of new view synthesis of complex scenes . Our algorithm uses a fully connected depth network to represent the scene , The input of the network is a continuous 5D coordinate （ Space location （ $x$ , $y$ , $z$ ） And the direction of observation （ $θ$ , $φ$ ））, The output is the volume density of the spatial location （volume density） Emission radiance related to the scene （view-dependent emitted radiance）. We query by following the camera light 5D Coordinates to compose the view , And use the classic volume rendering technology to project the output color and density into the image . Because volume rendering is naturally differentiable , So the only input needed to optimize the scene representation is a set of images with known camera external and internal parameters . We describe how to effectively optimize the neural radiation field , To render a new view of the scene with complex geometry and appearance , And shows that it is better than the previous neural rendering （neural rendering） And view composition （view synthesis） Results of work . View synthesis results are best viewed in video form , So we recommend checking out our supplemental video , In order to make a convincing comparison .

1.Introduction

fig2

chart 1： We propose a method to optimize continuity from a set of input images 5D Representation of neural radiation field （ Volume density and view dependent color at any continuous position ） Methods . We use volume rendering technology to accumulate the sampling point information represented by this scene along the ray , Render the scene from any angle . ad locum , We visualized the synthetic drums randomly captured in the hemispherical region （Synthetic Drums） Of the scene 100 Input view sets , It also shows the optimized NeRF Represents two new views rendered in .

In this work , We solved the long-standing problem of view composition in a new way , By directly optimizing continuous 5D The parameters of the scene representation are used to minimize the difference between the rendered image and the input image error.

We represent the static scene as a continuous 5D function , This function outputs every point in space $(x, y, z)$ In every direction $(θ, φ)$ Radiation emitted from （radiance, It's color ）, And the volume density at each point （ Its effect is similar to that of differential opacity , Control the passage of light $(x, y, z)$ Accumulated radiation amount ）. Our approach starts with a single 5D coordinate $(x, y, z, θ, φ)$ Back to the volume density and the angle of view RGB Color , A deep fully connected neural network without any convolution layer is optimized to represent this function .

To use... From a particular perspective NeRF（neural radiance field） Rendering , We ：

1. Yanshen camera's light in the scene rays, To generate a set of sampled 3D points ;
2. These points and their corresponding two-dimensional observation directions are used as inputs to the neural network , To generate the color and density of each point ;
3. These colors and densities are integrated using classic volume rendering techniques （ Sum up ） To 2D Image .

Because this process is naturally differentiable , We can use gradient descent to optimize this model ： Minimize the error between the images in the input set and the corresponding images rendered from our representation . By minimizing errors in multiple views , The network can be encouraged to predict other views of the scene by assigning high volume density and accurate colors to locations containing real scene content . chart 2 It shows the whole pipeline.

fig3

chart 2：NeRF Scene representation and differentiable rendering passes . We sample by following the light of the camera 5D coordinate （ Position and viewing direction , The sampling points are shown in figure a Black spots of ）, Input this information into $F_{\Theta}$ Generate color and bulk density （ chart b Sampling point of ）, Use volume rendering technology to synthesize these values into a graph c namely 2D Images . The rendering function is differentiable , So we can minimize the composite image and GT Observe the... Between images error To optimize the scene representation （ chart d）.

We found that , For complex scenes , The basic implementation of optimized neural radiation field representation does not converge to a sufficiently high resolution representation , And inefficient in the number of samples required per camera ray . We use Location code Input 5D Coordinates to solve these problems , Position coding makes MLP Can represent higher frequency functions , And we propose a Layered sampling process , To reduce the number of queries required to fully sample this high-frequency scene representation （ That is, you need fewer input views ）.

Our method inherits the body representation （volumetric representations） The advantages of ： Both can represent complex real-world geometry and appearance , And it is very suitable for gradient based optimization using projected images . It is important to , Our method overcomes the problem of modeling complex scenes at high resolution , The high storage cost of discretized voxel mesh . All in all , Our technical contribution is ：

A continuous scene with complex geometry and materials is represented as 5D The method of neural radiation field , Parameterize to basic MLP The Internet .
A differential rendering process based on classical volume rendering technology , We use it to render results with standard RGB Image optimization scene representation . This includes a layered sampling strategy , Is used to MLP More capacity is allocated to the visible content in the scene .
Enter each into 5D Coordinates are mapped to higher dimensional space by position coding , So that we can successfully optimize the neural radiation field to represent the high-frequency content in the scene .

We prove , Our neural radiation field method is superior to the most advanced view synthesis method in quantity and quality , Including the nerve 3D Represents the work that matches the scene , And the work of training the depth convolution network to be represented by the prediction volume . As far as we know , In this paper, a continuous neural scene representation method is proposed for the first time , It can be captured by nature RGB Image rendering a new high-resolution view of a real scene .

2.Related Work

One of the most promising directions in computer vision is to use MLP Code the scene ,MLP An implicit representation that maps directly from a 3D spatial location to a scene shape . However , up to now , These methods cannot be expressed in discrete terms （ Such as triangular mesh triangle meshes Or voxel mesh voxel grids） The technique of representing scenes reproduces real scenes with complex geometry with the same fidelity . In this section , We will review these two works , And compare it with our method , Our approach enhances the ability of neural scene representation , To produce the latest results for rendering complex real scenes .

2.1.Neural 3D shape representations

Recent work has investigated the optimization of $x y z$ Coordinates are mapped to a deep network of signed distance functions . However , These models are accessed ground truth Limitations of requirements for 3D geometry , These geometries are usually from ShapeNet And so on （ High requirements for labels , Limited to 3D data labels ）. In the following work, the differential rendering function is developed to relax the GT 3D Shape requirements , This function allows only 2D Image optimization neural implicit shape representation .

Although these techniques can represent Geometry , But so far , They are limited to simple shapes with low geometric complexity , And rendering is too smooth . We show that , Optimize the network to encode 5D Radiation field （ have 2D View dependent appearance 3D Volume ） The alternative strategy of can represent higher resolution geometry and appearance , To render a new photorealistic view of a complex scene .

2.2.View synthesis and image-based rendering

Dense sampling of a given view , Simple light field sampling and interpolation techniques can be used （light field sample interpolation techniques） Rebuild a new photo realistic view . For the new view synthesis with sparse view sampling , The computer vision community has made significant progress in predicting traditional geometric and appearance representations from observed images . However , Existing strategies require that a template grid with a fixed topology be provided as an initialization before optimization , This usually does not apply to unconstrained real scenes .

Another kind of method uses body to represent （volumetric representations） From a set of inputs RGB Image synthesis tasks for high quality photorealistic views . Volume method （volumetric approaches） Can truly represent complex shapes and materials , Suitable for gradient based optimization , And compared with the grid based method , The visual interference is small . Early volumetric methods used observed images to directly color voxel meshes . lately , There are several ways to train deep networks using large datasets of multiple scenarios . Although these volume techniques have achieved impressive results in new view synthesis , But because of its discrete sampling , Its ability to zoom to higher resolution images is fundamentally limited by the complexity of time and space - Rendering higher resolution images requires finer 3D Spatial sampling . We avoid this problem by encoding continuous volumes within the parameters of a deeply fully connected neural network , This not only produces higher quality rendering than previous volumetric methods , And only a small part of the storage cost represented by the sampling volume is needed to realize .

3.Neural Radiance Field Scene Representation

We express a continuous scene as 5D Vector valued functions , The input is 3D Location $\textbf{x}=(x,y,z)$ and 2D Direction of view $(\theta,\phi)$ , The output is the emission color $\textbf{c}=(r,g,b)$ And bulk density $\sigma$ . In practice , We express the direction of view as 3D Unit vector $\textbf{d}$ . We use it MLP The Internet $F_{\Theta}:(\textbf{x},\textbf{d})\rightarrow(\textbf{c},\sigma)$ Approximate this continuous 5D The scene shows , We optimize its weight $Θ$ , From each input 5D Coordinates are mapped to their corresponding bulk density and viewing direction of the emission color .

We predict the bulk density by limiting the network $σ$ As a location $\textbf{x}$ Function of , It also allows prediction RGB Color $\textbf{c}$ As a function of position and viewing direction , Encourage the learned scene representation to have multi view consistency . To achieve this ,MLP $F_{\Theta}$ First use 8 Full connection layer processing input 3D coordinate $\textbf{x}$ , Output $\sigma$ and 256 The eigenvectors of the dimensions . Then the feature vector is connected with the viewing direction of the camera light , And pass it to another fully connected layer , This layer outputs view dependent RGB Color .

See chart 3, To get how our method works Entered viewing direction To represent an example of the non Lambert effect . Pictured 4 Shown , A model trained without view dependencies （ Only $\textbf{x}$ As input ） It is difficult to express specular reflection .

fig4

chart 3： Visualization of the emitted radiation depending on the angle of view . Our neural radiation field represents the output RGB Color （ Use spatial location $\textbf{x}$ And the direction of observation $\textbf{d}$ Of 5D function ）. ad locum , We are Ship The neural representation of the scene visualizes the color distribution of the sample directions of two spatial positions . stay a and b in , We show two fixed... From two different camera positions 3D The appearance of points ： One on the side of the boat （ Orange illustration ）, One on the water （ Blue illustration ）. Our method predicts these two 3D A change in the specular appearance of a point , stay c in , We show how this behavior continues to spread across the hemisphere in the entire viewing direction .

fig5

chart 4： ad locum , We can see it intuitively , How does our complete model benefit from representing the emitted radiation depending on the view direction and transmitting the input coordinates through high-frequency position coding . Removing view dependencies prevents the model from re creating specular reflections on the bulldozer treads . Deleting position coding will greatly reduce the ability of the model to represent high-frequency geometry and texture （ The ability to show detail ）, This results in an overly smooth appearance .

4.Volume Rendering with Radiance Fields

our 5D The neural radiation field represents the scene as the volume density of any point in space and directional emission radiation . We use the principle of classical volume rendering to render the color of any light passing through the scene . Bulk density $\sigma(\textbf{x})$ Can be interpreted as light in $\textbf{x}$ Probability of termination at . Camera light $\textbf{r}(t)=\textbf{o}+t\textbf{d}$ Expected color $C(\textbf{r})$ by （ We can set the near bound of the light $t_{n}$ And the far side $t_f$ ）： $C(\textbf{r})=\int_{t_{n}}^{t_{f}}T(t)\sigma(\textbf{r}(t))\textbf{c}(\textbf{r}(t),\textbf{d})dt,where:T(t)=exp(-\int_{t_{n}}^{t}\sigma(\textbf{r}(s))ds)$ function $T (t)$ Means to move from... Along a ray of light $t_n$ To $t$ Cumulative transmittance of , That is, the light from $t_n$ To $t$ The probability of propagating without hitting any other particle . This decaying $T$ It can avoid excessive introduction of information on the back of objects during rendering .

We estimate this continuous integral numerically . We will $t_n,t_f]$ Divided into $N$ Areas evenly spaced , Then a sampling point is randomly selected from each region .

5.Optimizing a Neural Radiance Field

In the last section , We have described the core components needed to model the scene as a neural radiation field and render a new view from that representation . However , We observed that , These components are not sufficient to achieve state-of-the-art quality . We introduced two improvements , To support the representation of high-resolution complex scenes . The first is the position code of input coordinates , be conducive to MLP It's a high frequency function , The second is the hierarchical sampling process , Allows us to effectively sample high-frequency representations .

5.1. Location code

Although neural networks are general function approximators , But what we found was that , If the Internet $F_Θ$ Direct manipulation $xyzθ\phi$ Enter coordinates , Will result in poor rendering performance in representing high-frequency changes in color and geometry . therefore , Before passing input to the network , Use high frequency function to map input to higher dimensional space , It can better fit the data containing high-frequency changes .

Insert a content ： Recall the dendrite network , Its good performance may lie in the use of hadam product to forcibly encode the input information in high frequency , Therefore, it can indirectly fit the high-frequency function .

Of course , Dendrite networks do not have nonlinear mapping , High frequency functions do approximate nonlinearity , And it is a learnable nonlinear .

We use these findings in the context of neural scene representation , And indicate that it will $F_Θ$ Re expressed as a combination of two functions $F_{\Theta}=F_{\Theta}'\circ\gamma$ , $\gamma$ It's from $R$ Space to $R^{2L}$ A function of space , $F_{\Theta}'$ Or a MLP. Formally , The position coding function is ： $\gamma(p)=(sin(2^{0}\pi p),cos(2^{0}\pi p),...,sin(2^{L-1}\pi p),cos(2^{L-1}\pi p))$ function $γ$ Apply to $\textbf{x}$ Each of the three coordinate values in （ Note that the values have been normalized to $[- 1, 1]$ ） And the direction of observation $\textbf{d}$ . about $\gamma(\textbf{x}),L=10$ , about $\gamma(\textbf{d}),L=4$ .

In fashion Transformer A similar mapping is used in the architecture , Here it is called position coding . However ,Transformers Use it for a different purpose , That is to say, in the sequence token Discrete position of , As input to an architecture that does not contain any sequential concepts . contrary , We use these functions to map continuous input coordinates to higher dimensional space , To make our MLP It is easier to approximate higher frequency functions .

5.2. layered volume sampling

Our rendering strategy is based on each camera ray $N$ The neural radiation field network is intensively evaluated at query points , This strategy is inefficient ： Free space and occluded areas that do not contribute to the rendered image are still sampled repeatedly . We drew inspiration from early volume rendering work , A hierarchical representation is proposed , Improve rendering efficiency by allocating sampling points proportionally to the desired effect of the final rendering .

We don't just use a single network to represent the scene , Instead, optimize both networks at the same time ： One “coarse”, One “fine”. We first use basic hierarchical sampling to sample a set of $N_c$ Sampling at locations , And evaluate the corresponding “coarse” The Internet . Considering this “coarse” The output of the network , Then we generate a more reasonable point sample along each ray , The sampling point is inclined to the relevant part of the volume . So , We start with coarse The Internet $\widehat{C}_{c}(\textbf{r})$ Generate composite colors , Then we sample points with high volume density on the light , sampling $N_{f}$ As the second set of sampling points , Calculate on the union of the first and second sets of samples “fine” The Internet , And calculate the final rendered color of the light $\widehat{C}_{f}(\textbf{r})$ .

5.3. Implementation details

We optimize a separate neural continuous volume representation network for each scene . This only needs the RGB Image data set , Corresponding camera external and internal parameters , And scene boundaries . In each optimization iteration , We randomly sample a batch of camera rays from all pixel sets in the data set , Then according to the 5.2 Stratified sampling as described in section .5 from coarse Network queries $N_c$ sample , from fine Web search $N_c+N_f$ Samples . then , We use the volume rendering process , Render the color of each ray . Our loss is just coarse and fine The square error between the rendered pixel and the real pixel color ： $L=\sum_{r\in R}[||\widehat{C}_{c}(\textbf{r})-C(\textbf{r})||_{2}^{2}+||\widehat{C}_{f}(\textbf{r})-C(\textbf{r})||_{2}^{2}]$ among , $R$ by batch Ray set in , $C(\textbf{r}),\widehat{C}_{c}(\textbf{r}),\widehat{C}_{f}(\textbf{r})$ by GT,coarse Network prediction ,fine Network prediction RGB Color .

6. result

6.1. Data sets

Synthetic renderings of objects
We first show the experimental results of two synthetic rendering datasets （“Diffuse Synthetic 360°” and “Realistic Synthetic 360°”）.DeepVoxels The dataset contains four... With simple geometry Lambertian object . From the point of view of the upper hemisphere （479 Viewpoints as input ,1000 Viewpoints are used to test ） With 512×512 Pixel rendering 2D Images . Besides , We also generate our own datasets , It contains path tracking images of eight objects , These objects display complex geometry and real non Lambert materials . Six views sampled from the upper hemisphere are rendered , Two views sampled from the entire sphere render . We render each scene 100 Views as input ,200 Views for testing , All views are 800×800 Pixels .

Real images of complex scenes
We show the results of a complex real scene captured using a roughly forward image （“Real ForwardFacing”）. The data set consists of eight scenes captured by handheld mobile phones , A total of 20 To 62 Images , One eighth of the images were used for testing . All images are 1008×756 Pixels .