What is it about?
Coloring black-and-white videos is challenging because there are often many ways a scene can be colored. One approach for image colorization is to use text captions; however, this is too complicated for videos. Our work, called RAGCol, uses the latest advances in machine learning to address this challenge. RAGCol combines video colorization with external knowledge to ground the colorization in real-world knowledge. We test the methodology on a range of videos, where it outperforms the previous best method.
Featured Image
Photo by Megan Lee on Unsplash
Why is it important?
Colorization allows users to feel more connected to the past, but only if done correctly. Current colorizers, which mostly rely on neural networks, are prone to mistakes and inaccurate colorizations. This work limits this issue by leveraging external knowledge. This work is relevant in the colorization application but also has broader potential in other domains to make artificial intelligence more accurate, trustworthy and robust.
Perspectives
As someone deeply interested in history, this work excites me because it offers a new and improved method for restoring archival material. Enhancing the quality and accuracy of historical video colorization will enable better dissemination and, therefore, connection of people with culture and history. This is particularly relevant for material from a time that may not receive as much attention as it deserves.
Rory Ward
National University of Ireland - Galway
Read the Original
This page is a summary of: RAGCol: RAG-Based Automatic Video Colorization Through Text Caption Generation and Knowledge Enrichment, March 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3672608.3707748.
You can read the full text:
Resources
FRCol: Face Recognition Based Speaker Video Colorization
Automatic video colorization has recently gained attention for its ability to adapt old movies for today’s modern entertainment industry. However, there is a significant challenge: limiting unnatural color hallucination. Generative artificial intelligence often generates erroneous results, which in colorization manifests as unnatural colorizations. In this work, we propose to ground our automatic video colorization system in relevant exemplars by leveraging a face database, which we retrieve from using facial recognition technology. This retrieved exemplar guides the colorization of the latent-diffusion-based speaker video colorizer. We dub our system FRCol. We focus on speakers as humans have evolved to pay particular attention to certain aspects of colorization, with human faces being one of them. We improve the previous state-of-the-art (SOTA) DeOldify by an average of 13% on the standard metrics of PSNR, SSIM, FID, and FVD on the Grid and Lombard Grid datasets. Our user study also consolidates these results where FRCol was preferred to contemporary colorizers 81% of the time.
ControlCol: Controllability in Automatic Speaker Video Colorization
Adding color to black-and-white speaker videos automatically is a highly desirable technique. It is an artistic process that requires interactivity with humans for the best results. Many existing automatic video colorization systems provide little opportunity for the user to guide the colorization process. In this work, we introduce a novel automatic speaker video colorization system which provides controllability to the user while also maintaining high colorization quality relative to state-of-the-art techniques. We name this system ControlCol. ControlCol performs 3.5% better than the previous state-of-the-art DeOldify on the Grid and Lombard Grid datasets when PSNR, SSIM, FID and FVD are used as metrics. This result is also supported by our human evaluation, where in a head-to-head comparison, ControlCol is preferred 90% of the time to DeOldify.
LatentColorization: Latent Diffusion-Based Speaker Video Colorization
While current research predominantly focuses on image-based colorization, the domain of video-based colorization remains relatively unexplored. Many existing video colorization techniques operate frame-by-frame, often overlooking the critical aspect of temporal coherence between successive frames. This approach can result in inconsistencies across frames, leading to undesirable effects like flickering or abrupt color transitions between frames. To address these challenges, we combine the generative capabilities of a fine-tuned latent diffusion model with an autoregressive conditioning mechanism to ensure temporal consistency in automatic speaker video colorization. We demonstrate strong improvements on established quality metrics compared to existing methods, namely, PSNR, SSIM, FID, FVD, NIQE and BRISQUE. Specifically, we achieve an 18% improvement in performance when FVD is employed as the evaluation metric. Furthermore, we performed a subjective study, where users preferred LatentColorization to the existing state-of-the-art DeOldify 80% of the time. Our dataset combines conventional datasets and videos from television/movies. A short demonstration of our results can be seen in some example videos available at https://youtu.be/vDbzsZdFuxM.
Knowledge-Guided Colorization: Overview, Prospects and Challenges
Automatic image colorization is notorious for being an ill-posed problem, i.e., multiple plausible colorizations exist for any given black-and-white image. Current approaches to this task revolve around deep neural network-based systems, which do not incorporate knowledge into their colorizations. We present Knowledge-Guided Colorization as a possible solution to the above-mentioned problems. KnowledgeGuided Colorization combines a deep learning-based colorization system and a knowledge graph to inform its colorizations. This is the first time these two techniques have been combined for colorization. The prospects of knowledge-guided colorization are promising, with various potential application scenarios. However, several associated challenges are also highlighted in this research.
Towards Temporal Stability in Automatic Video Colourisation
Much research has been carried out into the automatic restoration of archival images. This research ranges from colourisation, to damage restoration, and super-resolution. Conversely, video restoration has remained largely unexplored. Most efforts to date have involved extending a concept from image restoration to video, in a frame-by-frame manner. These methods result in poor temporal consistency between frames. This manifests itself as temporal instability or flicker. The purpose of this work is to improve upon this limitation. This improvement will be achieved by employing a hybrid approach of deep-learning and exemplar based colourisation. Thus, informing current frame colourisation about its neighbouring frame’s colourisations and therefore alleviating the inter-frame discrepancy issues. This paper has two main contributions. Firstly, a novel end-to-end automatic video colourisation technique with enhanced flicker reduction capabilities is proposed. Secondly, six automatic exemplar acquisition algorithms are compared. The combination of these algorithms and techniques allow for an 8.5% increase in non-referenced image quality over the previous state of the art.
Contributors
The following have contributed to this page







