From the data perspective, image denoising via learning-based methods can be modeled as a data mapping from the noisy image to the clean image. The learnability of data mapping depends on the complexity of noise distribution, the volume of paired data, and the quality of labeled data. Accordingly, we develop a learnability enhancement strategy for low-light raw image denoising by reforming paired real data according to noise modeling.
Our main contributions are summarized as follows:
The underdeveloped data quality is also one of the culprits for the fragile learnability of data mapping between the paired real data.
The underdevelopment is reflected in four data flaws: spatial misalignment, intensity misalignment, noisy ground truth, and insufficient diversity, leading to incorrect data mapping, biased data mapping, poor convergence performance, and overfitting denoising models, respectively.
Existing denoising datasets suffer from significant problems with at least one of these flaws. As a result, the existing datasets are difficult to meet the needs of low-light denoising from a data perspective.
Our motivation is to develop the image acquisition protocol and build a high-quality dataset for low-light raw image denoising from a data perspective.
Image Acquisition Protocol
[MetaInfo]: We build a high-quality dataset using the Redmi K30 smartphone with the IMX686 sensor. The Low-light Raw Image Denoising~(LRID) dataset contains 138 scenes, including 82 indoor and 56 outdoor scenes, with a total of 5754 images.
[Setup]: We first captured 25 long-exposure images at ISO-100 and immediately captured several groups of short-exposure images at ISO-6400. Finally, a pair of long-exposure images before and after the original ISP of the smartphone is captured for real-world low-light image enhancement. We use a program to remotely control the smartphone, and the interval of image acquisition is very short (about 0.01s per frame), which means that misalignment between short-exposure frames is negligible.
[Indoor Scenes]: The indoor scenes are captured in enclosed spaces with various color temperatures and illumination setups. There are five groups of short-exposure images for each scene, and the exposure time ratios of long- and short-exposure images are 64, 128, 256, 512, and 1024, respectively. The total exposure time of the long-exposure images is about 25s.
[Outdoor Scenes]: The outdoor scenes are captured at midnight with the calm wind (within 0.5m/s). There are three groups of short-exposure images for each scene, and the exposure time ratios of the long- and short-exposure images are 64, 128, and 256, respectively. The total exposure time of the long-exposure images is about 64s.
Our image acquisition protocol is superior due to addressing four data flaws that affect learnability.
The maximum total exposure time and minimum exposure time are limited by spatial alignment (Section 4.1.1) and intensity alignment (Section 4.1.2), respectively. The image capture setup of long-exposure images and short-exposure images are designed for clean ground truth (Section 4.1.3) and sufficient diversity (Section 4.1.4), respectively.
Ground Truth Estimation
Our ground truth estimation pipeline. We first preprocess each input raw image, and then fuse multiple frames as output.
Results in the confidence-based multi-frame fusion processing. The blue region represents the confidence masks. The green region represents the fusion results without alignment. The yellow region represents the fusion results with alignment. The final result of our ground truth estimation is ``Ours" with alignment.
Comparison
Quantitative results (PSNR/SSIM) of different methods on the ELD dataset, SID dataset, and our LRID dataset. The red color indicates the best results and the blue color indicates the second-best results.
Denoising models trained with synthetic data are unable to completely remove complicated real noise.
P-G is far from the real noise model, resulting in limited performance.
ELD considers more noise sources but still deviates from the real noise model, resulting in color bias and residual noise.
Although SFRN sampled real read noise, the patch-wise method cannot inherently address the mapping dilemma caused by dark shading, resulting in residual FPN.
The paired real data, despite containing real noise, is so fragile in learnability that the denoising model cannot learn the precise and accurate data mapping, resulting in blurry results and color bias.
By applying the learnability enhancement strategy to the paired real data, the denoising performance is significantly improved in both quantitative results and visual quality. Our work performs clean denoising results with the clearest texture and the most exact colors.
Our methods demonstrate superior denoising performance on our dataset, with the clearest texture and most exact colors. The performance aligns with our results on public datasets, indicating the high generalizability of our methods.
On our dataset, paired real data generally outperforms the best synthetic data (SFRN), which indicates that the data quality of synthetic data is inferior to that of paired real data on our dataset. This result differs from the findings of other noise modeling studies on public datasets. We attribute this discrepancy to the fact that our image acquisition protocol has addressed various data flaws. When data is well-annotated (i.e., the ground truth is clean and well-aligned) and sufficient in quantity, the learnability of data mapping is well-guaranteed. Under these conditions, the data quality of paired real data should surpass that of synthetic data. The observation indicates the superiority of our dataset.
Ablation Study
Ablation study of different learnability enhancement modules on the ELD dataset, SID dataset, and LRID Dataset. ``*" indicates that the module uses the implementation from the preliminary version.
Representative visual result comparison of different data schemes. ``*" indicates that the module uses the implementation from the preliminary version. Our full learnability enhancement strategy (Paired + SNA + DSC) promotes more exact color and clearer details compared to other baselines.
Overall, the best performance is achieved using the complete learnability enhancement strategy. SNA focuses on promoting mapping precision, while DSC focuses on promoting mapping precision. The combination of these two modules results in the best performance in both quantitative and visual quality. Benefiting from the development of SNA and DSC, our learnability enhancement strategy successfully refreshes the state-of-the-art in our preliminary work
Our ablation studies of noise diversity and scene diversity show that data quality improves with an increasing number of noisy images per scene and an increasing percentage of scenes. The increase of data quality gradually saturates when the number of noisy images is close to our value, indicating that our dataset has sufficient noise diversity and scene diversity.
Extension of DSC on Noise Modeling
The quantitative comparison of noise models with or without DSC on various datasets.
Dark shading, the comprehensive modeling of temporal stable noise, is also an extension of physical-based noise modeling.
The extended application of DSC on noise modeling can significantly improve the performance of existing noise modeling methods.
In general, if the original noise model does not adequately consider the temporal stable noise model, the quantitative results of the developed noise model will be significantly improved.
It is worthwhile to be highlighted that SFRN requires High-Bit Recovery~(HBR) to work, however, HBR and DSC conflict in implementation.
To address this conflict between DSC and HBR, we propose a novel approximation algorithm.
This approximation algorithm can effectively avoid extra errors in the quantization process while maintaining the original computational complexity.
The noise modeling method corrected by DSC results in a more realistic noise model, leading to a high denoising performance. DSC brings significant improvements to noise modeling methods in quantitative results on various datasets.
Compared with the noise model without DSC, DSC promotes denoised images with fewer artifacts and more exact colors. Our extensive experiments demonstrate the potential widespread usage of our methods.
Generalizability
Dark shading calibrated by different cameras leads to different denoising performances, however, their quantitative results are close, which is still significantly higher than previous works. This comparison demonstrates the high consistency of dark shading calibrated by different cameras with the same sensor, which indicates that our DSC is feasible under the above configuration.
According to our observation, dark shading with noticeable patterns widely exists in sensors of various mainstream sensors. The sensors widely used on smartphones and surveillance cameras also contain noticeable dark shading, which indicates that our DSC is indispensable.
1. Overheating will cause the DSC to be less effective, while the neural network is robust to dark shading differences within normal operating temperatures.
2. To prevent the parameters of dark shading from being inappropriate due to sensor circuit switching, it is necessary to calibrate dark shading according to different circuits.