Text in the Dark: Extremely Low-Light Text Image Enhancement

1Universiti Malaya, Kuala Lumpur, Malaysia
2Chalmers University of Technology, Gothenburg, Sweden
3The University of Adelaide, Adelaide, Australia
4National Tsing Hua University, Hsinchu, Taiwan
*Equal Contribution

Abstract

Text extraction in extremely low-light images is challenging. Although existing low-light image enhancement methods can enhance images as preprocessing before text extraction, they do not focus on scene text. Further research is also hindered by the lack of extremely low-light text datasets. Thus, we propose a novel extremely low-light image enhancement framework with an edge-aware attention module to focus on scene text regions. Our method is trained with text detection and edge reconstruction losses to emphasize low-level scene text features. Additionally, we present a Supervised Deep Curve Estimation model to synthesize extremely low-light images based on the public ICDAR15 (IC15) dataset. We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to benchmark extremely low-light scene text tasks. Extensive experiments prove our model outperforms state-of-the-art methods on all datasets.

Text in the Dark Dataset

In order to deal with the lack of low-light scene text datasets, this work has proposed a new large-scale low-light OCR dataset called Text in the Dark. The images used in the Text in the Dark dataset are sourced from two commonly used low-light datasets: the extremely low-light See in the Dark (SID) dataset and the ordinary Low-Light (LOL) dataset. Specifically, the SID dataset has two subsets: SID-Sony, captured by Sony 𝛼7S II, and SID-Fuji, captured by Fujifilm X-T2. In short, Text in the Dark dataset consists of three subsets, namely SID-Sony-Text, SID-Fuji-Text and LOL-Text. We show the annotated image of each subset in the following figures:

SID-Sony-Text.
SID-Fuji-Text.
LOL-Text.

In this work, we included 878/810 short-exposure images and 211/176 long-exposure images at a resolution of 4240x2832/6000x4000 from SID-Sony and SID-Fuji, respectively. Besides, the LOL dataset provides low/normal-light image pairs taken from real scenes by controlling exposure time and ISO. There are 485 and 15 images at a resolution of 600x400 in the training and test sets. We closely annotated text instances in the SID and LOL datasets following the common ICDAR15 standard. Statistics of Text in the Dark dataset are as follow:

Detailed statistics of the Text in the Dark dataset.

Extremely Low-Light Text Image Enhancement

Our novel Extremely Low-Light Text image enhancement model consists of a U-Net accommodating extremely low-light images and edge maps using two independent encoders. During model training, instead of channel attention, the encoded edges guide the spatial attention sub-module in the proposed Edge-Aware Attention (Edge-ATT) to attend to edge pixels related to text representations. Besides the image enhancement losses, our model incorporates text detection and edge reconstruction losses into the training process. This integration effectively guides the model's attention towards text-related features and regions, facilitating improved image textual content analysis. As a pre-processing step, we also introduced a novel augmentation technique, Text-CP to increase the presence of non-overlapping and unique text instances in training images, thereby promoting comprehensive learning of texts.

Illustration of the architecture of proposed Extremely Low-Light Text Image Enhancement framework.

The figure below highlights two core modules of our frameworks, the edge decoder and Edge-Att. These two modules enable the framework to attend to rich images and edge features simultaneously, which led to significant H-Mean improvement. Subfigure (a): Visual representation of our edge decoder, wherein A and B represent the output from the corresponding convolution blocks in Figure 2 and S denotes the scaling of the image. Subfigure (b): Illustration of the proposed Edge-Aware Attention module.

Illustration of edge decoder and Edge-Att modules.

We also propose a novel data augmentation method, Text-Aware Copy Paste (Text-CP). It considers each text box's location and size by leveraging uniform and Gaussian distributions derived from the dataset, as shown below:

Illustration of Text-CP module.

Experiment Results

All low-light image enhancement methods are trained and tested on the datasets detailed in Section 5. They are then evaluated in terms of intensity metrics (PSNR, SSIM), perceptual similarity (LPIPS), and text detection (H-Mean). For the SID-Sony-Text, SIDFuji-Text, and LOL-Text datasets, which are annotated with text bounding boxes only, we used well-known and commonly used scene text detectors (CRAFT and PAN) to analyze the enhanced images. The table below shows the quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for low-light image enhancement methods on SID-Sony-Text, SID-Fuji-Text, and LOL-Text datasets. Please note that TRAD, ZSL, UL, and SL stand for traditional methods, zeroshot learning, unsupervised learning, and supervised learning respectively. Scores in bold are the best of all.

Quantitative result table.

In the following figure, we show the comparison with state-of-the-art methods on the SID-Sony-Text dataset. The figure is arranged in the following manner: for each column, the first row displays enhanced images marked with blue boxes as regions of interest. The second row displays zoomed-in regions of enhanced images overlaid with red text detection boxes from CRAFT. Column (a) displays the low-light image. Columns (b) to (o) show image enhancement results from all related methods. The last cell displays ground truth images.

Quanlitative result figure.

We show that while GAN-enhanced images tend to be less noisy, the text regions are blurry, making text detection challenging. Moreover, our model achieves the highest PSNR and SSIM scores on both SID-Sony-Text and SID-Fuji-Text datasets, showing that our enhanced images are the closest to the image quality of ground truth images. In short, better text detection is achieved on our enhanced images through the improvement of overall image quality and preservation of fine details within text regions.

BibTeX

If you wish to cite the ICPR 2022 Extremely Low-Light Image Enhancement with Scene Text Restoration paper:

@inproceedings{icpr2022_ellie,
author={Hsu, Po-Hao and Lin, Che-Tsung and Ng, Chun Chet and Long Kew, Jie
and Tan, Mei Yih and Lai, Shang-Hong and Chan, Chee Seng and Zach, Christopher},
booktitle={2022 26th International Conference on Pattern Recognition (ICPR)},
title={Extremely Low-Light Image Enhancement with Scene Text Restoration},
year={2022},
pages={317-323}}

If you wish to cite the latest version of the ICText dataset and AGCL:

Our paper is currently under review. We will update this section when it is published.