Увійти

Готові списки джерел за темами / Image Captioning (IC)

Добірка наукової літератури з теми "Image Captioning (IC)"

Автор: Grafiati

Опубліковано: 7 липня 2024

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Image Captioning (IC)".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Зміст

Статті в журналах
Дисертації
Тези доповідей конференцій

Статті в журналах з теми "Image Captioning (IC)":

1

Li, Jingyu, Zhendong Mao, Hao Li, Weidong Chen, and Yongdong Zhang. "Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning." ACM Transactions on Multimedia Computing, Communications, and Applications, December 25, 2023. http://dx.doi.org/10.1145/3638558.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Image captioning (IC), bringing vision to language, has drawn extensive attention. A crucial aspect of IC is the accurate depiction of visual relations among image objects. Visual relations encompass two primary facets: content relations and structural relations. Content relations, which comprise geometric positions content ( i.e. , distances and sizes) and semantic interactions content ( i.e. , actions and possessives), unveil the mutual correlations between objects. In contrast, structural relations pertain to the topological connectivity of object regions. Existing Transformer-based methods typically resort to geometric positions to enhance the visual relations, yet only using the shallow geometric content is unable to precisely cover actional content correlations and structural connection relations. In this paper, we adopt a comprehensive perspective to examine the correlations between objects, incorporating both content relations ( i.e. , geometric and semantic relations) and structural relations, with the aim of generating plausible captions. To achieve this, firstly, we construct a geometric graph from bounding box features and a semantic graph from the scene graph parser to model the content relations. Innovatively, we construct a topology graph that amalgamates the sparsity characteristics of the geometric and semantic graphs, enabling the representation of image structural relations. Secondly, we propose a novel unified approach to enrich image relation representations by integrating semantic, geometric, and structural relations into self-attention. Finally, in the language decoding stage, we further leverage the semantic relation as prior knowledge to generate accurate words. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our model, with improvements of CIDEr from 128.6% to 136.6%. Codes have been released at https://github.com/CrossmodalGroup/ER-SAN/tree/main/VG-Cap.

2

Yu, Mengying, and Aixin Sun. "Dataset versus reality: Understanding model performance from the perspective of information need." Journal of the Association for Information Science and Technology, August 18, 2023. http://dx.doi.org/10.1002/asi.24825.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

AbstractDeep learning technologies have brought us many models that outperform human beings on a few benchmarks. An interesting question is: can these models well solve real‐world problems with similar settings (e.g., identical input/output) to the benchmark datasets? We argue that a model is trained to answer the same information need in a similar context (e.g., the information available), for which the training dataset is created. The trained model may be used to solve real‐world problems for a similar information need in a similar context. However, information need is independent of the format of dataset input/output. Although some datasets may share high structural similarities, they may represent different research tasks aiming for answering different information needs. Examples are question–answer pairs for the question answering (QA) task, and image‐caption pairs for the image captioning (IC) task. In this paper, we use the QA task and IC task as two case studies and compare their widely used benchmark datasets. From the perspective of information need in the context of information retrieval, we show the differences in the dataset creation processes and the differences in morphosyntactic properties between datasets. The differences in these datasets can be attributed to the different information needs and contexts of the specific research tasks. We encourage all researchers to consider the information need perspective of a research task when selecting the appropriate datasets to train a model. Likewise, while creating a dataset, researchers may also incorporate the information need perspective as a factor to determine the degree to which the dataset accurately reflects the real‐world problem or the research task they intend to tackle.

Дисертації з теми "Image Captioning (IC)":

1

Elguendouze, Sofiane. "Explainable Artificial Intelligence approaches for Image Captioning." Electronic Thesis or Diss., Orléans, 2024. http://www.theses.fr/2024ORLE1003.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

L'évolution rapide des modèles de sous-titrage d'images, impulsée par l'intégration de techniques d'apprentissage profond combinant les modalités image et texte, a conduit à des systèmes de plus en plus complexes. Cependant, ces modèles fonctionnent souvent comme des boîtes noires, incapables de fournir des explications transparentes de leurs décisions. Cette thèse aborde l'explicabilité des systèmes de sous-titrage d'images basés sur des architectures Encodeur-Attention-Décodeur, et ce à travers quatre aspects. Premièrement, elle explore le concept d'espace latent, s'éloignant ainsi des approches traditionnelles basées sur l'espace de représentation originel. Deuxièmement, elle présente la notion de caractère décisif, conduisant à la formulation d'une nouvelle définition pour le concept d'influence/décisivité des composants dans le contexte de sous-titrage d'images explicable, ainsi qu'une approche par perturbation pour la capture du caractère décisif. Le troisième aspect vise à élucider les facteurs influençant la qualité des explications, en mettant l'accent sur la portée des méthodes d'explication. En conséquence, des variantes basées sur l'espace latent de méthodes d'explication bien établies telles que LRP et LIME ont été développées, ainsi que la proposition d'une approche d'évaluation centrée sur l'espace latent, connue sous le nom d'Ablation Latente. Le quatrième aspect de ce travail consiste à examiner ce que nous appelons la saillance et la représentation de certains concepts visuels, tels que la quantité d'objets, à différents niveaux de l'architecture de sous-titrage
The rapid advancement of image captioning models, driven by the integration of deep learning techniques that combine image and text modalities, has resulted in increasingly complex systems. However, these models often operate as black boxes, lacking the ability to provide transparent explanations for their decisions. This thesis addresses the explainability of image captioning systems based on Encoder-Attention-Decoder architectures, through four aspects. First, it explores the concept of the latent space, marking a departure from traditional approaches relying on the original representation space. Second, it introduces the notion of decisiveness, leading to the formulation of a new definition for the concept of component influence/decisiveness in the context of explainable image captioning, as well as a perturbation-based approach to capturing decisiveness. The third aspect aims to elucidate the factors influencing explanation quality, in particular the scope of explanation methods. Accordingly, latent-based variants of well-established explanation methods such as LRP and LIME have been developed, along with the introduction of a latent-centered evaluation approach called Latent Ablation. The fourth aspect of this work involves investigating what we call saliency and the representation of certain visual concepts, such as object quantity, at different levels of the captioning architecture

Тези доповідей конференцій з теми "Image Captioning (IC)":

1

Guo, Qilin, Yajing Xu, and Sheng Gao. "Recorrect Net: Visual Guidance for Image Captioning." In 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC). IEEE, 2021. http://dx.doi.org/10.1109/ic-nidc54101.2021.9660494.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Li, Jingyu, Zhendong Mao, Shancheng Fang, and Hao Li. "ER-SAN: Enhanced-Adaptive Relation Self-Attention Network for Image Captioning." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/151.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Image captioning (IC), bringing vision to language, has drawn extensive attention. Precisely describing visual relations between image objects is a key challenge in IC. We argue that the visual relations, that is geometric positions (i.e., distance and size) and semantic interactions (i.e., actions and possessives), indicate the mutual correlations between objects. Existing Transformer-based methods typically resort to geometric positions to enhance the representation of visual relations, yet only using the shallow geometric is unable to precisely cover the complex and actional correlations. In this paper, we propose to enhance the correlations between objects from a comprehensive view that jointly considers explicit semantic and geometric relations, generating plausible captions with accurate relationship predictions. Specifically, we propose a novel Enhanced-Adaptive Relation Self-Attention Network (ER-SAN). We design the direction-sensitive semantic-enhanced attention, which considers content objects to semantic relations and semantic relations to content objects attention to learn explicit semantic-aware relations. Further, we devise an adaptive re-weight relation module that determines how much semantic and geometric attention should be activated to each relation feature. Extensive experiments on MS-COCO dataset demonstrate the effectiveness of our ER-SAN, with improvements of CIDEr from 128.6% to 135.3%, achieving state-of-the-art performance. Codes will be released \url{https://github.com/CrossmodalGroup/ER-SAN}.