Image Captioning with External Knowledge

Author: Sofia Nikiforova
LOT Number: 641
ISBN: 978-94-6093-426-1
Pages: 147
Year: 2023
1st promotor: Prof. dr. Y.S. Vinter Seggev
€31.00
Download this book as a free Open Access fulltext PDF

In modern automatic image captioning, generating straightforward visual descriptions of images is largely a solved problem. One of the biggest challenges that still remains is incorporating information that cannot be inferred from the image alone: its context and related real world knowledge. In this dissertation, we tackle this challenge by developing a new method of enriching an otherwise standard captioning pipeline with contextually relevant image-external knowledge.

 

Our method starts with identifying the subset of data from external sources that is relevant to a given image. The retrieved data is integrated into the caption generation process, aiming to influence the resulting caption and extend it beyond a purely visual description. Based on this general method, we develop three neural image captioning models. The first model addresses a specific problem of generating references to the geographic context of the image. The second model expands to broad encyclopedic knowledge about the depicted geographic entities. Finally, the third model generalizes beyond the geographic domain and applies our method to diverse images from newspaper articles. The evaluation of the models shows that our method is indeed effective for producing contextualized and informative captions with factually accurate references to relevant external knowledge.

In modern automatic image captioning, generating straightforward visual descriptions of images is largely a solved problem. One of the biggest challenges that still remains is incorporating information that cannot be inferred from the image alone: its context and related real world knowledge. In this dissertation, we tackle this challenge by developing a new method of enriching an otherwise standard captioning pipeline with contextually relevant image-external knowledge.

 

Our method starts with identifying the subset of data from external sources that is relevant to a given image. The retrieved data is integrated into the caption generation process, aiming to influence the resulting caption and extend it beyond a purely visual description. Based on this general method, we develop three neural image captioning models. The first model addresses a specific problem of generating references to the geographic context of the image. The second model expands to broad encyclopedic knowledge about the depicted geographic entities. Finally, the third model generalizes beyond the geographic domain and applies our method to diverse images from newspaper articles. The evaluation of the models shows that our method is indeed effective for producing contextualized and informative captions with factually accurate references to relevant external knowledge.

Categories