当前位置：网站首页>HDR image reconstruction from a single exposure using deep CNN reading notes

HDR image reconstruction from a single exposure using deep CNN reading notes

2022-07-06 22:14:00 【Cassia tora】

HDR image reconstruction from a single exposure using deep CNNs Reading notes

The paper was published in 2017 Year of TOG.

1 Abstract

problem ：
Low dynamic range （LDR） Device capture high dynamic range （HDR） Scene images are prone to overexposure , Overexposed areas will lose texture details , Bring challenges to image viewing or computer vision tasks .

present situation ：
Most of the existing HDR Image reconstruction methods require a set of different exposures LDR Image as input .

Methods of this paper ：
It solves the problem of estimating the missing information in the saturated region of the image , In order to be able to From single exposure LDR High quality image reconstruction HDR Images .
（1）LDR The input image is converted by the encoder network , Represented by compact features that generate image spatial context .
（2） The encoded image is fed to HDR Decoder network , To rebuild HDR Images .
The network is equipped with Jump connection , Can be found in LDR Encoder and HDR Transmit data between decoder domains , In order to make full use of high-resolution image details in reconstruction .

2 HDR Reconstruction Model

2.1 Formulation and constraints of the problem (Problem formulation and constraints)

Final HDR Reconstruct pixels $\hat{H}_{(i,c)}$ , Is to use mixed values (blend value) $α_i$ Pixel level blending (blending) Calculated ,
Insert picture description here
$i$ ： Spatial index
$c$ ： Color channel
$D_{(i,c)}$ ： Input LDR Image pixels
$\hat{y} _{(i,c)}$ ：CNN Output （ In the logarithmic field ）：
$f^{(-1)}$ ： Inverse camera curve , Transform the input into a linear field .
blend (blending) It's a linear slope , From threshold $τ$ Start with the pixel value of , To the end of the maximum pixel value （ Blending means that the input image remains unchanged in the unsaturated region ）,
Insert picture description here
This article USES the $τ = 0.95$ , The input is defined in $[0, 1]$ Within the scope of .
Linear mixing (linear blending) It prevents band artifacts between the predicted highlight and its surrounding environment （ $α$ It is also used to define the loss function in training , As the first 2.4 Section ）.
The description of the mixing component is shown in the figure below （ Since the focus of hybrid prediction is the reconstruction around the saturated region , Therefore, artifacts may appear in other image areas ( chart (b))）：
Insert picture description here

2.2 Hybrid dynamic range automatic encoder (Hybrid dynamic range autoencoder)

The complete automatic encoder architecture is shown in the figure ：
Insert picture description here

LDR encoder： For the input LDR Image convolution and maximum pooling , The resulting $W / 32 \times H / 32 \times 512$ The low dimensional latent image of (latent image representation)（ $W$ and $H$ Image width and height ）.
HDR decoder： Use $4 \times 4$ Deconvolution of Realize bilinear up sampling , Jump and connect the result of up sampling with the corresponding layer of the encoder （ Better restore image details ）, Convolute the jump connection results ; Repeat the above operation , Finally rebuild the high dimension HDR Images .
（1） Because the goal of this paper is to reconstruct a larger image than the image actually used in training , So the potential means Not a fully connected layer , It is Low resolution multi-channel image . This full convolution network (FCN) It can be predicted at any resolution , The resolution is a multiple of the reduction factor of the automatic encoder .
（2） Because the encoder is directly on LDR Input image for operation , The decoder is responsible for generating HDR data , therefore The decoder works in the log domain （ This is achieved by using the loss function , This function compares the network output with HDR gt Compare the logarithm of the image ）.
（3） All layers of the network Use ReLU Activation function , After each layer of the decoder Use batch normalization layer .

2.3 Domain transformation and jump connection (Domain transformation and skip-connections)

The layer by layer convolution pooling of input images will lead to the loss of many high-resolution information of early layers , The decoder can use this information to reconstruct the high-frequency details of the saturated region , Therefore, jump connection is introduced , Used to transmit data between high-level and low-level features in encoder and decoder .
The automatic encoder in this paper uses hopping connection to transmit each layer of the encoder to the corresponding layer of the decoder . Because the encoder and decoder process different types of data （ See the first 2.2 section ）, Connections include domain transformations and logarithmic transformations described by inverse camera curves , take LDR The display value maps to logarithm HDR Express . This article uses gamma function $f^{-1} (x)=x^γ$ To complete the linearization of jump connection , among $γ = 2$ .
This paper connects two layers along the feature dimension , That is the two one. $W \times H \times K$ Dimension layer connection is $W \times H \times 2 K$ layer . Then the decoder linearly combines these features , Equivalent to passing $1 \times 1$ The convolution layer of will $2 K$ The number of features is reduced to $K$ . complete LDR To HDR Jump connection is defined as ：
Insert picture description here
$h_i^E,h_i^D$ ： Encoder layer and decoder layer tensors $y^E,y^D∈R^(W×H×K)$ All characteristic channels of $k∈{1,...,K}$ Slice on
$\tilde{h} _i^D$ ： Decoder eigenvector , It has a connection vector from jump $h_i^E$ Fused information
$b$ ： Deviation of feature fusion
$σ$ ： Activation function , This article USES the ReLU function
（ Use small constants in domain transformations $ϵ$ To avoid zero in logarithmic transformation .）
Given $K$ Features , $h^E and h^D$ yes $1 \times K$ Vector , $W$ It's a $2 K \times K$ The weight matrix of , It will $2 K$ The features in series are mapped to $K$ dimension . It is initialized to perform the addition of encoder and decoder features , Set the weight to
Insert picture description here
Adding jump connections can better reconstruct image texture details , As shown in the figure ：

2.4 HDR Loss function (HDR loss function)

Direct loss $L(\hat{y},H)$ ：
In this system ,HDR The decoder is designed to run in the log domain . therefore , In the logarithm of a given prediction HDR Images $\hat{y}$ And linear gt Images H Under the circumstances , Direct loss in logarithm HDR The value is formulated ,
Insert picture description here
$N$ ： sizes
$H_{(i,c)}$ ： $H_{(i,c)}∈\mathbb{R}^+$
$ϵ$ ： Small constant , Eliminate singularity at zero pixel value
Weber-Fechner The law implies Logarithmic relationship between physical brightness and perceived brightness , Therefore, the loss formulated in the logarithmic domain makes the perception error roughly evenly distributed over the entire brightness range .

I/R Loss $L_{IR}(\hat{y} ,H)$ ：
It is meaningful to deal with the illuminance and reflectance components separately , Therefore, this paper proposes another loss function , Deal with illumination and reflectivity respectively . Lighting component $I$ Describe global changes , And responsible for high dynamic range ; Reflectivity $R$ Store information about details and colors , This has a low dynamic range , $H_{(i,c)}=I_iR_{(i,c)}$ . Through logarithmic brightness $L^{\hat{y}}$ Gaussian low pass filter $G_σ$ To approximate logarithmic illuminance , Through the logarithm of prediction HDR Images $\hat{y}$ And logarithmic illuminance to approximate logarithmic reflectance ,
Insert picture description here
$L^{\hat{y}}$ ： Linear combination of color channels , $L^{\hat{y}}_i=log⁡(∑_cw_c exp⁡(\hat{y}_{i,c} ) )$ , among $w=\{0.213,0.715,0.072\}$ .
The standard deviation of Gaussian filter is set to $σ = 2$ .
Use $I$ and $R$ The resulting loss function is defined as ：
Insert picture description here
$y$ ： $y = l o g (H + ϵ)$
$λ$ ： Equilibrium parameters , Importance of balancing illumination and reflectivity .
This article USES the $λ = 0.5$ .
Use different $λ$ The prediction example results of value optimization are shown in the figure ：
Insert picture description here
Use I/R Loss , In a large saturated region , It tends to produce less artifacts , As shown in the figure （ One possible explanation is , Gaussian low-pass filter in loss function may have regularization effect , Because it makes the loss in the pixel affected by its neighborhood ）：
Insert picture description here

3 HDR Image Dataset

The following figure shows two typical LDR Data sets and 125K Graphic HDR Average histogram of data set .LDR The data are about 2.5M and 200K Of Places and Flickr Image composition ,HDR The data is obtained from HDR Captured in the data set .
Insert picture description here
LDR The histogram shows a relatively uniform distribution of pixel values , Except for the obvious peak close to the maximum , Indicates information lost due to saturation .HDR In the histogram , Pixels are not saturated , It is represented by the long tail of exponential decay .

Virtual camera ：
Use randomly selected camera calibration to capture multiple random areas of the scene . These areas are selected as image cropping with random size and location , Then flip randomly and resample to 320×320 Pixels . Camera calibration includes exposure 、 Camera curve 、 Parameters such as white balance and noise level . This provides an expanded set of LDR And corresponding HDR Images , Used as training input and gt value .

4 Training

Initialize weights in the network , This paper uses different strategies for different parts of the network .
（1） Due to the use from VGG16 Convolution layer of network , So you can Places The pre training weights that can be used for large-scale image classification are used in the database to initialize the encoder .
（2） Use the decoder to deconvolute for bilinear up sampling , And use the fusion of jump connection layers to perform feature addition .
（3） For potential image representation （ The right side of the network structure diagram ） And final feature reduction （ The upper left corner of the network structure diagram ） Convolution in , This article USES the Xavier initialization .
（4） Use Adam Optimizer on I/R Minimize the loss function , The learning rate is $5×10^{−5}$ . In all 800K Step back propagation , Batch size is 8, stay NVIDIA Titan X GPU It takes about 6 God .

4.1 simulation HDR Data pre training (Pre-training on simulated HDR data)

Because of the existing HDR Limited data , The author of this paper through large-scale simulation HDR The entire network is pre trained on the dataset to use migration learning . This article chooses Places Subset of images in the database , It is required that the image should not contain saturated image areas . Give all Places Collection of images P, This subset $\mathbb{S}⊂\mathbb{P}$ Is defined as
Insert picture description here
$p_D$ ： Image histogram
$ξ$ ： This article USES the $ξ=50/256^2$ （ That is, if it is less than 50 Pixel （ Images $256^2$ A pixel 0.076%） Has a maximum value of , Then use the image in the training set ）.
The green dotted line in the figure below can be seen by comparing with the orange implementation , A subset of $\mathbb{S}$ The average histogram on does not show the original set $\mathbb{P}$ Saturated pixel peak ：
Insert picture description here
By placing the image $D∈\mathbb{S}$ Conduct $H=sf^{-1} (D)$ Linearize and increase exposure , Create a simulation HDR Training data set .
simulation HDR The data set is prepared in the same way as the 3 Same as in section , But the resolution is 224×224 Pixels , No need to resample .CNN Use ADAM The optimizer trains , The learning rate is $2×10^{-5}$ , common perform 3.2M Step , Batch is 4.
This pre training of synthetic data results in a significant improvement in performance . As shown in the figure below , Sometimes underestimated small highlights can be better restored , And less artifacts are introduced in the larger saturated region ：
Insert picture description here

5 Result

5.1 Test error (Test errors)

The following table shows the influence of different training strategies on different errors ：
Insert picture description here
（1） No jump connection CNN It can significantly reduce the input MSE, But adding jump connections can reduce the error 24%, And create images with significantly improved details .
（2） Compare Direct loss and I/R loss,I/R loss stay I/R and Direct MSE Aspect shows low error , Less 5.8%.
（3） Through pre training and I/R loss, Achieve the best training performance , Compared with no pre training , The error is reduced 10.7%.

5.2 And groung truth Compare (Comparisons to ground truth)

The following figure shows the image reconstruction and ground truth Comparison , among input Yes, it will HDR The image is converted by virtual camera LDR Images .
Insert picture description here
For visualization CNN Reconstructed information , The following figure shows the learning residuals of the image $\hat{r}=max⁡(0,\hat{H}-1)$ （ Top left ）, as well as ground truth residual $r = m a x (0, H - 1)$ （ The upper right ）. The prediction of complex lighting areas is convincing （ The lower left ）, But for a very strong spotlight , The brightness is underestimated （ The lower right ）.
Insert picture description here

5.3 Reconstruction of real camera images (Reconstruction with real-world cameras)

Use the... Of this article HDR Reconstruct the model to deal with the real camera LDR Images , The results are shown in the following figure ：
Insert picture description here
In order to further explore the possibility of reconstructing daily images , The following figure shows a group of pictures taken in various situations iPhone Images ：

5.4 Change the cut point (Changing clipping point)

The following figure shows the prediction of exposure time of different virtual cameras （ The figure below shows the proportion of pixels in the saturated area in the image , Zoom the image , Make the image have the same exposure after cutting ）, It can be seen that more details can be obtained in the reconstruction of shorter exposure images .
Insert picture description here

5.5 And iTMOs Compare (Comparison to iTMOs)

this paper HDR The comparison between reconstruction and three existing inverse tone mapping methods is shown in the following figure ：
Insert picture description here

6 Discuss

6.1 Limit

The following figure shows an example of a difficult scenario that this method is difficult to reconstruct ：
（1） The first line has a large area with saturation in all color channels , Therefore, it is impossible to infer the structure and details .
（2） The second line shows when an area in the image has extreme intensity , This method will underestimate this extreme intensity .
Insert picture description here