Extremely low-light text images pose significant challenges for scene text detection. Existing methods enhance these images using low-light image enhancement techniques before text detection. However, they fail to address the importance of low-level features, which are essential for optimal performance in downstream scene text tasks. Further research is also limited by the scarcity of extremely low-light text datasets. To address these limitations, we propose a novel, text-aware extremely low-light image enhancement framework. Our approach first integrates a Text-Aware Copy-Paste (Text-CP) augmentation method as a preprocessing step, followed by a dual-encoder-decoder architecture enhanced with Edge-Aware attention modules. We also introduce text detection and edge reconstruction losses to train the model to generate images with higher text visibility. Additionally, we propose a Supervised Deep Curve Estimation (Supervised-DCE) model for synthesizing extremely low-light images, allowing training on publicly available scene text datasets such as IC15. To further advance this domain, we annotated texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets. The proposed framework is rigorously tested against various traditional and deep learning-based methods on the newly labeled SID-Sony-Text, SID-Fuji-Text, LOL-Text, and synthetic extremely low-light IC15 datasets. Our extensive experiments demonstrate notable improvements in both image enhancement and scene text tasks, showcasing the model's efficacy in text detection under extremely low-light conditions.
In order to deal with the lack of low-light scene text datasets, this work has proposed a new large-scale low-light OCR dataset called Text in the Dark. The images used in the Text in the Dark dataset are sourced from two commonly used low-light datasets: the extremely low-light See in the Dark (SID) dataset and the ordinary Low-Light (LOL) dataset. Specifically, the SID dataset has two subsets: SID-Sony, captured by Sony 𝛼7S II, and SID-Fuji, captured by Fujifilm X-T2. In short, Text in the Dark dataset consists of three subsets, namely SID-Sony-Text, SID-Fuji-Text and LOL-Text. We show the annotated image of each subset in the following figures:
Our novel image enhancement model consists of a U-Net accommodating extremely low-light images and edge maps using two independent encoders. During model training, instead of channel attention, the encoded edges guide the spatial attention sub-module in the proposed Edge-Aware Attention (Edge-Att) to attend to edge pixels related to text representations. Besides the image enhancement losses, our model incorporates text detection and edge reconstruction losses into the training process. This integration effectively guides the model's attention towards text-related features and regions, facilitating improved image textual content analysis. As a pre-processing step, we introduced a novel augmentation technique called Text-Aware Copy Paste (Text-CP) to increase the presence of non-overlapping and unique text instances in training images, thereby promoting comprehensive learning of text representations.
The figure below highlights two core modules of our frameworks, the edge decoder and Edge-Att. These two modules enable the framework to attend to rich images and edge features simultaneously, which led to significant H-Mean improvement. Subfigure (a): Visual representation of our edge decoder, wherein A and B represent the output from the corresponding convolution blocks in Figure 2 and S denotes the scaling of the image. Subfigure (b): Illustration of the proposed Edge-Aware Attention module.
We also propose a novel data augmentation method, Text-CP. It considers each text box's location and size by leveraging uniform and Gaussian distributions derived from the dataset, as shown below:
All low-light image enhancement methods are trained and tested on the datasets detailed in Section 5. They are then evaluated in terms of intensity metrics (PSNR, SSIM), perceptual similarity (LPIPS), and text detection (H-Mean). For the SID-Sony-Text, SIDFuji-Text, and LOL-Text datasets, which are annotated with text bounding boxes only, we used well-known and commonly used scene text detectors (CRAFT and PAN) to analyze the enhanced images. The table below shows the quantitative results of PSNR, SSIM, LPIPS, and text detection H-Mean for low-light image enhancement methods on SID-Sony-Text, SID-Fuji-Text, and LOL-Text datasets. Please note that TRAD, ZSL, UL, and SL stand for traditional methods, zeroshot learning, unsupervised learning, and supervised learning respectively. Scores in bold are the best of all.
In the following figure, we show the comparison with state-of-the-art methods on the SID-Sony-Text dataset. The figure is arranged in the following manner: for each column, the first row displays enhanced images marked with blue boxes as regions of interest. The second row displays zoomed-in regions of enhanced images overlaid with red text detection boxes from CRAFT. Column (a) displays the low-light image. Columns (b) to (o) show image enhancement results from all related methods. The last cell displays ground truth images.
We show that while GAN-enhanced images tend to be less noisy, the text regions are blurry, making text detection challenging. Moreover, our model achieves the highest PSNR and SSIM scores on both SID-Sony-Text and SID-Fuji-Text datasets, showing that our enhanced images are the closest to the image quality of ground truth images. In short, better text detection is achieved on our enhanced images through the improvement of overall image quality and preservation of fine details within text regions.
If you wish to cite the ICPR 2022 Extremely Low-Light Image Enhancement with Scene Text Restoration paper:
@inproceedings{icpr2022_ellie,
author={Hsu, Po-Hao and Lin, Che-Tsung and Ng, Chun Chet and Long Kew, Jie
and Tan, Mei Yih and Lai, Shang-Hong and Chan, Chee Seng and Zach, Christopher},
booktitle={2022 26th International Conference on Pattern Recognition (ICPR)},
title={Extremely Low-Light Image Enhancement with Scene Text Restoration},
year={2022},
pages={317-323}}
If you wish to cite the latest version of the Text in the Dark dataset and our proposed ELITE framework:
@article{LIN2025117222,
title = {Text in the dark: Extremely low-light text image enhancement},
journal = {Signal Processing: Image Communication},
volume = {130},
pages = {117222},
year = {2025},
issn = {0923-5965},
doi = {https://doi.org/10.1016/j.image.2024.117222},
url = {https://www.sciencedirect.com/science/article/pii/S0923596524001231},
author = {Che-Tsung Lin and Chun Chet Ng and Zhi Qin Tan and Wan Jun Nah and Xinyu Wang and
Jie Long Kew and Pohao Hsu and Shang Hong Lai and Chee Seng Chan and Christopher Zach}}