Academic literature on the topic 'Explainable Image Captioning (XIC)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Explainable Image Captioning (XIC).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Explainable Image Captioning (XIC)":

1

Han, Seung-Ho, Min-Su Kwon, and Ho-Jin Choi. "EXplainable AI (XAI) approach to image captioning." Journal of Engineering 2020, no. 13 (July 1, 2020): 589–94. http://dx.doi.org/10.1049/joe.2019.1217.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Fei, Zhengcong, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei, and Xiaolin Wei. "Uncertainty-Aware Image Captioning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (June 26, 2023): 614–22. http://dx.doi.org/10.1609/aaai.v37i1.25137.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to construct the training pairs. During inference, we devise an uncertainty-adaptive parallel beam search technique that yields an empirically logarithmic time complexity. Extensive experiments on the MS COCO benchmark reveal that our approach outperforms the strong baseline and related methods on both captioning quality as well as decoding speed.
3

Liu, Haixia, and Tim Brailsford. "Reproducing “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”." Journal of Physics: Conference Series 2589, no. 1 (September 1, 2023): 012012. http://dx.doi.org/10.1088/1742-6596/2589/1/012012.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Abstract This paper replicates the experiment presented in the work of Xu et al. [1], and examines errors in the generated captions. The analysis of the identified errors aims to provide deeper insight into the underlying causes. This study also encompasses subsequent experiments aiming at investigating the feasibility of rectifying these errors via a post-processing stage. Image recognition and object detection models, as well as a language probability computational model were explored. The findings presented in this paper aim to contribute towards the overarching objective of Explainable Artificial Intelligence (XAI), thereby providing potential pathways to improve image captioning.
4

Biswas, Rajarshi, Michael Barz, and Daniel Sonntag. "Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking." KI - Künstliche Intelligenz 34, no. 4 (July 8, 2020): 571–84. http://dx.doi.org/10.1007/s13218-020-00679-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.
5

Ghosh, Swarnendu, Teresa Gonçalves, and Nibaran Das. "Im2Graph: A Weakly Supervised Approach for Generating Holistic Scene Graphs from Regional Dependencies." Future Internet 15, no. 2 (February 10, 2023): 70. http://dx.doi.org/10.3390/fi15020070.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Conceptual representations of images involving descriptions of entities and their relations are often represented using scene graphs. Such scene graphs can express relational concepts by using sets of triplets ⟨subject—predicate—object⟩. Instead of building dedicated models for scene graph generation, our model tends to extract the latent relational information implicitly encoded in image captioning models. We explored dependency parsing to build grammatically sound parse trees from captions. We used detection algorithms for the region propositions to generate dense region-based concept graphs. These were optimally combined using the approximate sub-graph isomorphism to create holistic concept graphs for images. The major advantages of this approach are threefold. Firstly, the proposed graph generation module is completely rule-based and, hence, adheres to the principles of explainable artificial intelligence. Secondly, graph generation can be used as plug-and-play along with any region proposition and caption generation framework. Finally, our results showed that we could generate rich concept graphs without explicit graph-based supervision.
6

Naresh, Naresh, Gunikhan .., and V. Balaji. "AMR-XAI-DWT: Age-Related Macular Regenerated Classification using X-AI with Dual Tree CWT." Fusion: Practice and Applications 15, no. 2 (2024): 17–35. http://dx.doi.org/10.54216/fpa.150202.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Age-related macular degeneration (AMD) is the leading cause of permanent vision loss, and drusen is an early clinical sign in the progression of AMD. Early detection is key since that's when treatment is most effective. The eyes of someone with AMD need to be checked often. Ophthalmologists may detect illness by looking at a color picture of the fundus taken using a fundus camera. Ophthalmologists need a system to help them diagnose illness since the global elderly population is growing rapidly and there are not enough specialists to go around. Since drusen vary in size, form, degree of convergence, and texture, it is challenging to detect and locate them in a color retinal picture. Therefore, it is difficult to develop a Modified Continual Learning (MCL) classifier for identifying drusen. To begin, we use X-AI (Explainable Artificial Intelligence) in tandem with one of the Dual Tree Complex Wavelet Transform models to create captions summarizing the symptoms of the retinal pictures throughout all of the different stages of diabetic retinopathy. An Adaptive Neuro Fuzzy Inference System (ANFIS) is constructed using all nine of the pre-trained modules. The nine image caption models are evaluated using a variety of metrics to determine their relative strengths and weaknesses. After compiling the data and comparing it to many existing models, the best photo captioning model is selected. A graphical user interface was also made available for rapid analysis and data screening in bulk. The results demonstrated the system's potential to aid ophthalmologists in the early detection of ARMD symptoms and the severity level in a shorter amount of time.
7

Yong, Gunwoo, Meiyin Liu, and SangHyun Lee. "Explainable Image Captioning to Identify Ergonomic Problems and Solutions for Construction Workers." Journal of Computing in Civil Engineering 38, no. 4 (July 2024). http://dx.doi.org/10.1061/jccee5.cpeng-5744.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Pan, Yingwei, Yehao Li, Ting Yao, and Tao Mei. "Bottom-up and Top-down Object Inference Networks for Image Captioning." ACM Transactions on Multimedia Computing, Communications, and Applications, January 19, 2023. http://dx.doi.org/10.1145/3580366.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Bottom-up and top-down attention mechanism has led to the revolutionizing of image captioning techniques, which enables object-level attention for multi-step reasoning over all the detected objects. However, when humans describe an image, they often apply their own subjective experience to focus on only a few salient objects that are worthy of mention, rather than all objects in this image. The focused objects are further allocated in linguistic order, yielding the “object sequence of interest” to compose an enriched description. In this work, we present Bottom-up and Top-down Object inference Networks (BTO-Net), that novelly exploits the object sequence of interest as top-down signals to guide image captioning. Technically, conditioned on the bottom-up signals (all detected objects), a LSTM-based object inference module is first learnt to produce the object sequence of interest, which acts as the top-down prior to mimic the subjective experience of humans. Next, both of the bottom-up and top-down signals are dynamically integrated via attention mechanism for sentence generation. Furthermore, to prevent the cacophony of intermixed cross-modal signals, a contrastive learning-based objective is involved to restrict the interaction between bottom-up and top-down signals, and thus leads to reliable and explainable cross-modal reasoning. Our BTO-Net obtains competitive performances on COCO benchmark, particularly 134.1 \(\% \) CIDEr on COCO Karpathy test split. Source code is available at https://github.com/YehLi/BTO-Net.
9

Ilinykh, Nikolai, and Simon Dobnik. "What Does a Language-And-Vision Transformer See: The Impact of Semantic Information on Visual Representations." Frontiers in Artificial Intelligence 4 (December 3, 2021). http://dx.doi.org/10.3389/frai.2021.767971.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
Neural networks have proven to be very successful in automatically capturing the composition of language and different structures across a range of multi-modal tasks. Thus, an important question to investigate is how neural networks learn and organise such structures. Numerous studies have examined the knowledge captured by language models (LSTMs, transformers) and vision architectures (CNNs, vision transformers) for respective uni-modal tasks. However, very few have explored what structures are acquired by multi-modal transformers where linguistic and visual features are combined. It is critical to understand the representations learned by each modality, their respective interplay, and the task’s effect on these representations in large-scale architectures. In this paper, we take a multi-modal transformer trained for image captioning and examine the structure of the self-attention patterns extracted from the visual stream. Our results indicate that the information about different relations between objects in the visual stream is hierarchical and varies from local to a global object-level understanding of the image. In particular, while visual representations in the first layers encode the knowledge of relations between semantically similar object detections, often constituting neighbouring objects, deeper layers expand their attention across more distant objects and learn global relations between them. We also show that globally attended objects in deeper layers can be linked with entities described in image descriptions, indicating a critical finding - the indirect effect of language on visual representations. In addition, we highlight how object-based input representations affect the structure of learned visual knowledge and guide the model towards more accurate image descriptions. A parallel question that we investigate is whether the insights from cognitive science echo the structure of representations that the current neural architecture learns. The proposed analysis of the inner workings of multi-modal transformers can be used to better understand and improve on such problems as pre-training of large-scale multi-modal architectures, multi-modal information fusion and probing of attention weights. In general, we contribute to the explainable multi-modal natural language processing and currently shallow understanding of how the input representations and the structure of the multi-modal transformer affect visual representations.

Dissertations / Theses on the topic "Explainable Image Captioning (XIC)":

1

Elguendouze, Sofiane. "Explainable Artificial Intelligence approaches for Image Captioning." Electronic Thesis or Diss., Orléans, 2024. http://www.theses.fr/2024ORLE1003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
Abstract:
L'évolution rapide des modèles de sous-titrage d'images, impulsée par l'intégration de techniques d'apprentissage profond combinant les modalités image et texte, a conduit à des systèmes de plus en plus complexes. Cependant, ces modèles fonctionnent souvent comme des boîtes noires, incapables de fournir des explications transparentes de leurs décisions. Cette thèse aborde l'explicabilité des systèmes de sous-titrage d'images basés sur des architectures Encodeur-Attention-Décodeur, et ce à travers quatre aspects. Premièrement, elle explore le concept d'espace latent, s'éloignant ainsi des approches traditionnelles basées sur l'espace de représentation originel. Deuxièmement, elle présente la notion de caractère décisif, conduisant à la formulation d'une nouvelle définition pour le concept d'influence/décisivité des composants dans le contexte de sous-titrage d'images explicable, ainsi qu'une approche par perturbation pour la capture du caractère décisif. Le troisième aspect vise à élucider les facteurs influençant la qualité des explications, en mettant l'accent sur la portée des méthodes d'explication. En conséquence, des variantes basées sur l'espace latent de méthodes d'explication bien établies telles que LRP et LIME ont été développées, ainsi que la proposition d'une approche d'évaluation centrée sur l'espace latent, connue sous le nom d'Ablation Latente. Le quatrième aspect de ce travail consiste à examiner ce que nous appelons la saillance et la représentation de certains concepts visuels, tels que la quantité d'objets, à différents niveaux de l'architecture de sous-titrage
The rapid advancement of image captioning models, driven by the integration of deep learning techniques that combine image and text modalities, has resulted in increasingly complex systems. However, these models often operate as black boxes, lacking the ability to provide transparent explanations for their decisions. This thesis addresses the explainability of image captioning systems based on Encoder-Attention-Decoder architectures, through four aspects. First, it explores the concept of the latent space, marking a departure from traditional approaches relying on the original representation space. Second, it introduces the notion of decisiveness, leading to the formulation of a new definition for the concept of component influence/decisiveness in the context of explainable image captioning, as well as a perturbation-based approach to capturing decisiveness. The third aspect aims to elucidate the factors influencing explanation quality, in particular the scope of explanation methods. Accordingly, latent-based variants of well-established explanation methods such as LRP and LIME have been developed, along with the introduction of a latent-centered evaluation approach called Latent Ablation. The fourth aspect of this work involves investigating what we call saliency and the representation of certain visual concepts, such as object quantity, at different levels of the captioning architecture

Book chapters on the topic "Explainable Image Captioning (XIC)":

1

Beddiar, Romaissa, and Mourad Oussalah. "Explainability in medical image captioning." In Explainable Deep Learning AI, 239–61. Elsevier, 2023. http://dx.doi.org/10.1016/b978-0-32-396098-4.00018-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Explainable Image Captioning (XIC)":

1

Tseng, Ching-Shan, Ying-Jia Lin, and Hung-Yu Kao. "Relation-Aware Image Captioning for Explainable Visual Question Answering." In 2022 International Conference on Technologies and Applications of Artificial Intelligence (TAAI). IEEE, 2022. http://dx.doi.org/10.1109/taai57707.2022.00035.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Elguendouze, Sofiane, Marcilio C. P. de Souto, Adel Hafiane, and Anais Halftermeyer. "Towards Explainable Deep Learning for Image Captioning through Representation Space Perturbation." In 2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022. http://dx.doi.org/10.1109/ijcnn55064.2022.9892275.

Full text
APA, Harvard, Vancouver, ISO, and other styles

To the bibliography