To see the other types of publications on this topic, follow the link: Visual grounding of text.

Journal articles on the topic 'Visual grounding of text'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual grounding of text.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Chao Wang, Chao Wang, Wei Luo Chao Wang, Jia-Rui Zhu Wei Luo, Ying-Chun Xia Jia-Rui Zhu, Jin He Ying-Chun Xia, and Li-Chuan Gu Jin He. "End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning." 電腦學刊 35, no. 1 (February 2024): 083–95. http://dx.doi.org/10.53106/199115992024023501006.

Full text
Abstract:
<p>Visual grounding locates target objects or areas in the image based on natural language expression. Most current methods extract visual features and text embeddings independently, and then carry out complex fusion reasoning to locate target objects mentioned in the query text. However, such independently extracted visual features often contain many features that are irrelevant to the query text or misleading, thus affecting the subsequent multimodal fusion module, and deteriorating target localization. This study introduces a combined network model based on the transformer architecture, which realizes more accurate visual grounding by using query text to guide visual feature generation and multi-stage fusion reasoning. Specifically, the visual feature generation module reduces the interferences of irrelevant features and generates visual features related to query text through the guidance of query text features. The multi-stage fused reasoning module uses the relevant visual features obtained by the visual feature generation module and the query text embeddings for multi-stage interactive reasoning, further infers the correlation between the target image and the query text, so as to achieve the accurate localization of the object described by the query text. The effectiveness of the proposed model is experimentally verified on five public datasets and the model outperforms state-of-the-art methods. It achieves an improvement of 1.04%, 2.23%, 1.00% and +2.51% over the previous state-of-the-art methods in terms of the top-1 accuracy on TestA and TestB of the RefCOCO and RefCOCO+ datasets, respectively.</p> <p>&nbsp;</p>
APA, Harvard, Vancouver, ISO, and other styles
2

Regneri, Michaela, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. "Grounding Action Descriptions in Videos." Transactions of the Association for Computational Linguistics 1 (December 2013): 25–36. http://dx.doi.org/10.1162/tacl_a_00207.

Full text
Abstract:
Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used. In this paper, we consider the problem of grounding sentences describing actions in visual information extracted from videos. We present a general purpose corpus that aligns high quality videos with multiple natural language descriptions of the actions portrayed in the videos, together with an annotation of how similar the action descriptions are to each other. Experimental results demonstrate that a text-based model of similarity between actions improves substantially when combined with visual information from videos depicting the described actions.
APA, Harvard, Vancouver, ISO, and other styles
3

Zhan, Yang, Yuan Yuan, and Zhitong Xiong. "Mono3DVG: 3D Visual Grounding in Monocular Images." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (March 24, 2024): 6988–96. http://dx.doi.org/10.1609/aaai.v38i7.28525.

Full text
Abstract:
We introduce a novel task of 3D visual grounding in monocular RGB images using language descriptions with both appearance and geometry information. Specifically, we build a large-scale dataset, Mono3DRefer, which contains 3D object targets with their corresponding geometric text descriptions, generated by ChatGPT and refined manually. To foster this task, we propose Mono3DVG-TR, an end-to-end transformer-based network, which takes advantage of both the appearance and geometry information in text embeddings for multi-modal learning and 3D object localization. Depth predictor is designed to explicitly learn geometry features. The dual text-guided adapter is proposed to refine multiscale visual and geometry features of the referred object. Based on depth-text-visual stacking attention, the decoder fuses object-level geometric cues and visual appearance into a learnable query. Comprehensive benchmarks and some insightful analyses are provided for Mono3DVG. Extensive comparisons and ablation studies show that our method significantly outperforms all baselines. The dataset and code will be released.
APA, Harvard, Vancouver, ISO, and other styles
4

Zhang, Qianjun, and Jin Yuan. "Semantic-Aligned Cross-Modal Visual Grounding Network with Transformers." Applied Sciences 13, no. 9 (May 4, 2023): 5649. http://dx.doi.org/10.3390/app13095649.

Full text
Abstract:
Multi-modal deep learning methods have achieved great improvements in visual grounding; their objective is to localize text-specified objects in images. Most of the existing methods can localize and classify objects with significant appearance differences but suffer from the misclassification problem for extremely similar objects, due to inadequate exploration of multi-modal features. To address this problem, we propose a novel semantic-aligned cross-modal visual grounding network with transformers (SAC-VGNet). SAC-VGNet integrates visual and textual features with semantic alignment to highlight important feature cues for capturing tiny differences between similar objects. Technically, SAC-VGNet incorporates a multi-modal fusion module to effectively fuse visual and textual descriptions. It also introduces contrastive learning to align linguistic and visual features on the text-to-pixel level, enabling the capture of subtle differences between objects. The overall architecture is end-to-end without the need for extra parameter settings. To evaluate our approach, we manually annotate text descriptions for images in two fine-grained visual grounding datasets. The experimental results demonstrate that SAC-VGNet significantly improves performance in fine-grained visual grounding.
APA, Harvard, Vancouver, ISO, and other styles
5

Shen, Haozhan, Tiancheng Zhao, Mingwei Zhu, and Jianwei Yin. "GroundVLP: Harnessing Zero-Shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (March 24, 2024): 4766–75. http://dx.doi.org/10.1609/aaai.v38i5.28278.

Full text
Abstract:
Visual grounding, a crucial vision-language task involving the understanding of the visual context based on the query expression, necessitates the model to capture the interactions between objects, as well as various spatial and attribute information. However, the annotation data of visual grounding task is limited due to its time-consuming and labor-intensive annotation process, resulting in the trained models being constrained from generalizing its capability to a broader domain. To address this challenge, we propose GroundVLP, a simple yet effective zero-shot method that harnesses visual grounding ability from the existing models trained from image-text pairs and pure object detection data, both of which are more conveniently obtainable and offer a broader domain compared to visual grounding annotation data. GroundVLP proposes a fusion mechanism that combines the heatmap from GradCAM and the object proposals of open-vocabulary detectors. We demonstrate that the proposed method significantly outperforms other zero-shot methods on RefCOCO/+/g datasets, surpassing prior zero-shot state-of-the-art by approximately 28% on the test split of RefCOCO and RefCOCO+. Furthermore, GroundVLP performs comparably to or even better than some non-VLP-based supervised models on the Flickr30k entities dataset. Our code is available at https://github.com/om-ai-lab/GroundVLP.
APA, Harvard, Vancouver, ISO, and other styles
6

Liu, Shilong, Shijia Huang, Feng Li, Hao Zhang, Yaoyuan Liang, Hang Su, Jun Zhu, and Lei Zhang. "DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 1728–36. http://dx.doi.org/10.1609/aaai.v37i2.25261.

Full text
Abstract:
In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a 1D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves 91.04% and 83.51% in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone.
APA, Harvard, Vancouver, ISO, and other styles
7

Cheng, Zesen, Kehan Li, Peng Jin, Siheng Li, Xiangyang Ji, Li Yuan, Chang Liu, and Jie Chen. "Parallel Vertex Diffusion for Unified Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (March 24, 2024): 1326–34. http://dx.doi.org/10.1609/aaai.v38i2.27896.

Full text
Abstract:
Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.
APA, Harvard, Vancouver, ISO, and other styles
8

Feng, Steven Y., Kevin Lu, Zhuofu Tao, Malihe Alikhani, Teruko Mitamura, Eduard Hovy, and Varun Gangal. "Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10618–26. http://dx.doi.org/10.1609/aaai.v36i10.21306.

Full text
Abstract:
We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.
APA, Harvard, Vancouver, ISO, and other styles
9

Jia, Meihuizi, Lei Shen, Xin Shen, Lejian Liao, Meng Chen, Xiaodong He, Zhendong Chen, and Jiaqi Li. "MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 7 (June 26, 2023): 8032–40. http://dx.doi.org/10.1609/aaai.v37i7.25971.

Full text
Abstract:
Multimodal named entity recognition (MNER) is a critical step in information extraction, which aims to detect entity spans and classify them to corresponding entity types given a sentence-image pair. Existing methods either (1) obtain named entities with coarse-grained visual clues from attention mechanisms, or (2) first detect fine-grained visual regions with toolkits and then recognize named entities. However, they suffer from improper alignment between entity types and visual regions or error propagation in the two-stage manner, which finally imports irrelevant visual information into texts. In this paper, we propose a novel end-to-end framework named MNER-QG that can simultaneously perform MRC-based multimodal named entity recognition and query grounding. Specifically, with the assistance of queries, MNER-QG can provide prior knowledge of entity types and visual regions, and further enhance representations of both text and image. To conduct the query grounding task, we provide manual annotations and weak supervisions that are obtained via training a highly flexible visual grounding model with transfer learning. We conduct extensive experiments on two public MNER datasets, Twitter2015 and Twitter2017. Experimental results show that MNER-QG outperforms the current state-of-the-art models on the MNER task, and also improves the query grounding performance.
APA, Harvard, Vancouver, ISO, and other styles
10

Shi, Zhan, Yilin Shen, Hongxia Jin, and Xiaodan Zhu. "Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 2 (June 28, 2022): 2253–61. http://dx.doi.org/10.1609/aaai.v36i2.20123.

Full text
Abstract:
Phrase grounding is a multi-modal problem that localizes a particular noun phrase in an image referred to by a text query. In the challenging zero-shot phrase grounding setting, the existing state-of-the-art grounding models have limited capacity in handling the unseen phrases. Humans, however, can ground novel types of objects in images with little effort, significantly benefiting from reasoning with commonsense. In this paper, we design a novel phrase grounding architecture that builds multi-modal knowledge graphs using external knowledge and then performs graph reasoning and spatial relation reasoning to localize the referred nouns phrases. We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2~3% in accuracy, resulting in a significant improvement under the standard evaluation metrics.
APA, Harvard, Vancouver, ISO, and other styles
11

Geng, Wenjia, Yong Liu, Lei Chen, Sujia Wang, Jie Zhou, and Yansong Tang. "Learning Multi-Scale Video-Text Correspondence for Weakly Supervised Temporal Article Gronding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (March 24, 2024): 1896–904. http://dx.doi.org/10.1609/aaai.v38i3.27959.

Full text
Abstract:
Weakly Supervised temporal Article Grounding (WSAG) is a challenging and practical task in video understanding. Specifically, given a video and a relevant article, whose sentences are at different semantic scales, WSAG aims to localize corresponding video segments for all “groundable” sentences. Compared to other grounding tasks, e.g., localizing one target segment with respect to a given sentence query, WSAG confronts an essential obstacle rooted in the intricate multi-scale information inherent within both textual and visual modalities. Existing methods overlook the modeling and alignment of such structured information present in multi-scale video segments and hierarchical textual content. To this end, we propose a Multi-Scale Video-Text Correspondence Learning (MVTCL) framework, which enhances the grounding performance in complex scenes by modeling multi-scale semantic correspondence both within and between modalities. Specifically, MVTCL initially aggregates video content spanning distinct temporal scales and leverages hierarchical textual relationships in both temporal and semantic dimensions via a semantic calibration module. Then multi-scale contrastive learning module is introduced to generate more discriminative representations by selecting typical contexts and performing inter-video contrastive learning. Through the multi-scale semantic calibration architecture and supervision design, our method achieves new state-of-the-art performance on existing WSAG benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
12

Bu, Yuqi, Jiayuan Xie, Liuwu Li, Qiong Liu, and Yi Cai. "Bridging the Gap between Expression and Scene Text for Referring Expression Comprehension (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (June 28, 2022): 12921–22. http://dx.doi.org/10.1609/aaai.v36i11.21597.

Full text
Abstract:
Referring expression comprehension aims at grounding the object in an image referred to by the expression. Scene text that serves as an identifier has a natural advantage in referring to objects. However, existing methods only consider the text in the expression, but ignore the text in the image, leading to a mismatch. In this paper, we propose a novel model that can recognize the scene text. We assign the extracted scene text to its corresponding visual region and ground the target object guided by expression. Experimental results on two benchmarks demonstrate the effectiveness of our model.
APA, Harvard, Vancouver, ISO, and other styles
13

Wang, Haowei, Jiayi Ji, Yiyi Zhou, Yongjian Wu, and Xiaoshuai Sun. "Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 2 (June 26, 2023): 2528–36. http://dx.doi.org/10.1609/aaai.v37i2.25350.

Full text
Abstract:
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is computationally expensive. In this paper, we propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG), which directly generates masks for referents. Specifically, we propose two innovative designs, i.e., Locality-Perceptive Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly handle the many-to-many relationship between textual expressions and visual objects. LPA embeds the local spatial priors into attention modeling, i.e., a pixel may belong to multiple masks at different scales, thereby improving segmentation. To help understand the complex semantic relationships, SAL proposes a bidirectional contrastive objective to regularize the semantic consistency inter modalities. Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the two-stage model. Meanwhile, the generalization ability of EPNG is also validated by zero-shot experiments on other grounding tasks. The source codes and trained models for all our experiments are publicly available at https://github.com/Mr-Neko/EPNG.git.
APA, Harvard, Vancouver, ISO, and other styles
14

Ayyubi, Hammad, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, et al. "Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (March 24, 2024): 17664–72. http://dx.doi.org/10.1609/aaai.v38i16.29718.

Full text
Abstract:
Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve
APA, Harvard, Vancouver, ISO, and other styles
15

Bruni, E., N. K. Tran, and M. Baroni. "Multimodal Distributional Semantics." Journal of Artificial Intelligence Research 49 (January 23, 2014): 1–47. http://dx.doi.org/10.1613/jair.4135.

Full text
Abstract:
Distributional semantic models derive computational representations of word meaning from the patterns of co-occurrence of words in text. Such models have been a success story of computational linguistics, being able to provide reliable estimates of semantic relatedness for the many semantic tasks requiring them. However, distributional models extract meaning information exclusively from text, which is an extremely impoverished basis compared to the rich perceptual sources that ground human semantic knowledge. We address the lack of perceptual grounding of distributional models by exploiting computer vision techniques that automatically identify discrete “visual words” in images, so that the distributional representation of a word can be extended to also encompass its co-occurrence with the visual words of images it is associated with. We propose a flexible architecture to integrate text- and image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
APA, Harvard, Vancouver, ISO, and other styles
16

Li, Mingxiao, Zehao Wang, Tinne Tuytelaars, and Marie-Francine Moens. "Layout-Aware Dreamer for Embodied Visual Referring Expression Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (June 26, 2023): 1386–95. http://dx.doi.org/10.1609/aaai.v37i1.25223.

Full text
Abstract:
In this work, we study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment and localize a remote object described by a concise high-level natural language instruction. When facing such a situation, a human tends to imagine what the destination may look like and to explore the environment based on prior knowledge of the environmental layout, such as the fact that a bathroom is more likely to be found near a bedroom than a kitchen. We have designed an autonomous agent called Layout-aware Dreamer (LAD), including two novel modules, that is, the Layout Learner and the Goal Dreamer to mimic this cognitive decision process. The Layout Learner learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation, which effectively introduces layout common sense of room-to-room transitions to our agent. To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand. Our agent achieves new state-of-the-art performance on the public leaderboard of REVERIE dataset in challenging unseen test environments with improvement on navigation success rate (SR) by 4.02% and remote grounding success (RGS) by 3.43% comparing to previous previous state of the art. The code is released at https://github.com/zehao-wang/LAD.
APA, Harvard, Vancouver, ISO, and other styles
17

Lei, Yang, Peizhi Zhao, Pijian Li, Yi Cai, and Qingbao Huang. "Linking People across Text and Images Based on Social Relation Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (June 26, 2023): 1260–68. http://dx.doi.org/10.1609/aaai.v37i1.25209.

Full text
Abstract:
As a sub-task of visual grounding, linking people across text and images aims to localize target people in images with corresponding sentences. Existing approaches tend to capture superficial features of people (e.g., dress and location) that suffer from the incompleteness information across text and images. We observe that humans are adept at exploring social relations to assist identifying people. Therefore, we propose a Social Relation Reasoning (SRR) model to address the aforementioned issues. Firstly, we design a Social Relation Extraction (SRE) module to extract social relations between people in the input sentence. Specially, the SRE module based on zero-shot learning is able to extract social relations even though they are not defined in the existing datasets. A Reasoning based Cross-modal Matching (RCM) module is further used to generate matching matrices by reasoning on the social relations and visual features. Experimental results show that the accuracy of our proposed SRR model outperforms the state-of-the-art models on the challenging datasets Who's Waldo and FL: MSRE, by more than 5\% and 7\%, respectively. Our source code is available at https://github.com/VILAN-Lab/SRR.
APA, Harvard, Vancouver, ISO, and other styles
18

Reddy, Revant Gangi, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, et al. "MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11200–11208. http://dx.doi.org/10.1609/aaai.v36i10.21370.

Full text
Abstract:
Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
APA, Harvard, Vancouver, ISO, and other styles
19

Xu, Lingjing, Yang Gao, Wenfeng Song, and Aimin Hao. "Weakly Supervised Multimodal Affordance Grounding for Egocentric Images." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6324–32. http://dx.doi.org/10.1609/aaai.v38i6.28451.

Full text
Abstract:
To enhance the interaction between intelligent systems and the environment, locating the affordance regions of objects is crucial. These regions correspond to specific areas that provide distinct functionalities. Humans often acquire the ability to identify these regions through action demonstrations and verbal instructions. In this paper, we present a novel multimodal framework that extracts affordance knowledge from exocentric images, which depict human-object interactions, as well as from accompanying textual descriptions that describe the performed actions. The extracted knowledge is then transferred to egocentric images. To achieve this goal, we propose the HOI-Transfer Module, which utilizes local perception to disentangle individual actions within exocentric images. This module effectively captures localized features and correlations between actions, leading to valuable affordance knowledge. Additionally, we introduce the Pixel-Text Fusion Module, which fuses affordance knowledge by identifying regions in egocentric images that bear resemblances to the textual features defining affordances. We employ a Weakly Supervised Multimodal Affordance (WSMA) learning approach, utilizing image-level labels for training. Through extensive experiments, we demonstrate the superiority of our proposed method in terms of evaluation metrics and visual results when compared to existing affordance grounding models. Furthermore, ablation experiments confirm the effectiveness of our approach. Code:https://github.com/xulingjing88/WSMA.
APA, Harvard, Vancouver, ISO, and other styles
20

Scarlini, Bianca, Tommaso Pasini, and Roberto Navigli. "Visual Definition Modeling: Challenging Vision & Language Models to Define Words and Objects." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11267–75. http://dx.doi.org/10.1609/aaai.v36i10.21377.

Full text
Abstract:
Architectures that model language and vision together havereceived much attention in recent years. Nonetheless, most tasks in this field focus on end-to-end applications without providing insights on whether it is the underlying semantics of visual objects or words that is captured. In this paper we draw on the established Definition Modeling paradigm and enhance it by grounding, for the first time, textual definitions to visual representations. We name this new task Visual Definition Modeling and put forward DEMETER and DIONYSUS, two benchmarks where, given an image as context, models have to generate a textual definition for a target being either i) a word that describes the image, or ii) an object patch therein. To measure the difficulty of our tasks we finetuned six different baselines and analyzed their performances, which show that a text-only encoder-decoder model is more effective than models pretrained for handling inputs of both modalities concurrently. This demonstrates the complexity of our benchmarks and encourages more research on text generation conditioned on multimodal inputs. The datasets for both benchmarks are available at https://github.com/SapienzaNLP/visual-definition-modeling as well as the code to reproduce our models.
APA, Harvard, Vancouver, ISO, and other styles
21

Mi, Li, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, and Devis Tuia. "ConVQG: Contrastive Visual Question Generation with Multimodal Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (March 24, 2024): 4207–15. http://dx.doi.org/10.1609/aaai.v38i5.28216.

Full text
Abstract:
Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.
APA, Harvard, Vancouver, ISO, and other styles
22

Xu, Yifang, Yunzhuo Sun, Zien Xie, Benxiang Zhai, and Sidan Du. "VTG-GPT: Tuning-Free Zero-Shot Video Temporal Grounding with GPT." Applied Sciences 14, no. 5 (February 25, 2024): 1894. http://dx.doi.org/10.3390/app14051894.

Full text
Abstract:
Video temporal grounding (VTG) aims to locate specific temporal segments from an untrimmed video based on a linguistic query. Most existing VTG models are trained on extensive annotated video-text pairs, a process that not only introduces human biases from the queries but also incurs significant computational costs. To tackle these challenges, we propose VTG-GPT, a GPT-based method for zero-shot VTG without training or fine-tuning. To reduce prejudice in the original query, we employ Baichuan2 to generate debiased queries. To lessen redundant information in videos, we apply MiniGPT-v2 to transform visual content into more precise captions. Finally, we devise the proposal generator and post-processing to produce accurate segments from debiased queries and image captions. Extensive experiments demonstrate that VTG-GPT significantly outperforms SOTA methods in zero-shot settings and surpasses unsupervised approaches. More notably, it achieves competitive performance comparable to supervised methods. The code is available on GitHub.
APA, Harvard, Vancouver, ISO, and other styles
23

Huang, Pin-Hao, Han-Hung Lee, Hwann-Tzong Chen, and Tyng-Luh Liu. "Text-Guided Graph Neural Networks for Referring 3D Instance Segmentation." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (May 18, 2021): 1610–18. http://dx.doi.org/10.1609/aaai.v35i2.16253.

Full text
Abstract:
This paper addresses a new task called referring 3D instance segmentation, which aims to segment out the target instance in a 3D scene given a query sentence. Previous work on scene understanding has explored visual grounding with natural language guidance, yet the emphasis is mostly constrained on images and videos. We propose a Text-guided Graph Neural Network (TGNN) for referring 3D instance segmentation on point clouds. Given a query sentence and the point cloud of a 3D scene, our method learns to extract per-point features and predicts an offset to shift each point toward its object center. Based on the point features and the offsets, we cluster the points to produce fused features and coordinates for the candidate objects. The resulting clusters are modeled as nodes in a Graph Neural Network to learn the representations that encompass the relation structure for each candidate object. The GNN layers leverage each object's features and its relations with neighbors to generate an attention heatmap for the input sentence expression. Finally, the attention heatmap is used to "guide" the aggregation of information from neighborhood nodes. Our method achieves state-of-the-art performance on referring 3D instance segmentation and 3D localization on ScanRefer, Nr3D, and Sr3D benchmarks, respectively.
APA, Harvard, Vancouver, ISO, and other styles
24

Hunter, yaTande Whitney V. "The “Ring Shout”: A Corporeal Conjuring of Black-Togetherness." Dance Research Journal 55, no. 2 (August 2023): 44–57. http://dx.doi.org/10.1017/s0149767723000268.

Full text
Abstract:
This article explores the Ring Shout as a corporeal conjuring of Black-togetherness. Theoretically, I embrace the notion of assembly in ways that offer new comprehension around both implicit and explicit modes of embodiment in constant play within Black cultural modes. I turn to the research of Katrina Hazzard-Donald, Dr. Yvonne Daniel, and M. Jacqui Alexander for theoretical grounding regarding diasporic Afro-spiritualities, while artists such as Talley Beatty, Reggie Wilson, and Chief Xian aTunde Adjuah (formerly Christian Scott) provide landmarks for the artistic and aesthetic discourse of the text. I introduce a concept, AfrOist, as a navigation through and toward a recontextualization of centralized Africanist tendencies. With this shift, cultural inheritances are remembered and claimed.
APA, Harvard, Vancouver, ISO, and other styles
25

Yang, Ze, Wei Wu, Huang Hu, Can Xu, Wei Wang, and Zhoujun Li. "Open Domain Dialogue Generation with Latent Images." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (May 18, 2021): 14239–47. http://dx.doi.org/10.1609/aaai.v35i16.17675.

Full text
Abstract:
We consider grounding open domain dialogues with images. Existing work assumes that both an image and a textual context are available, but image-grounded dialogues by nature are more difficult to obtain than textual dialogues. Thus, we propose learning a response generation model with both image-grounded dialogues and textual dialogues by assuming that the visual scene information at the time of a conversation can be represented by an image, and trying to recover the latent images of the textual dialogues through text-to-image generation techniques. The likelihood of the two types of dialogues is then formulated by a response generator and an image reconstructor that are learned within a conditional variational auto-encoding framework. Empirical studies are conducted in both image-grounded conversation and text-based conversation. In the first scenario, image-grounded dialogues, especially under a low-resource setting, can be effectively augmented by textual dialogues with latent images; while in the second scenario, latent images can enrich the content of responses and at the same time keep them relevant to contexts.
APA, Harvard, Vancouver, ISO, and other styles
26

Zhang, Zhengkun, Xiaojun Meng, Yasheng Wang, Xin Jiang, Qun Liu, and Zhenglu Yang. "UniMS: A Unified Framework for Multimodal Summarization with Knowledge Distillation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11757–64. http://dx.doi.org/10.1609/aaai.v36i10.21431.

Full text
Abstract:
With the rapid increase of multimedia data, a large body of literature has emerged to work on multimodal summarization, the majority of which target at refining salient information from textual and image modalities to output a pictorial summary with the most relevant images. Existing methods mostly focus on either extractive or abstractive summarization and rely on the presence and quality of image captions to build image references. We are the first to propose a Unified framework for Multimodal Summarization grounding on BART, UniMS, that integrates extractive and abstractive objectives, as well as selecting the image output. Specially, we adopt knowledge distillation from a vision-language pretrained model to improve image selection, which avoids any requirement on the existence and quality of image captions. Besides, we introduce a visual guided decoder to better integrate textual and visual modalities in guiding abstractive text generation. Results show that our best model achieves a new state-of-the-art result on a large-scale benchmark dataset. The newly involved extractive objective as well as the knowledge distillation technique are proven to bring a noticeable improvement to the multimodal summarization task.
APA, Harvard, Vancouver, ISO, and other styles
27

Callens, Johan. "Staging the Televised (Nation)." Theatre Research International 28, no. 1 (February 17, 2003): 61–78. http://dx.doi.org/10.1017/s0307883303000154.

Full text
Abstract:
The performative uses which Mark Ravenhill's Faust (Faust Is Dead) (1997) and Anna Deavere Smith's Twilight: Los Angeles, 1992 (1993), no matter how different, have made of the televised 1992 Los Angeles riots, underwrite Hal Foster's thesis that the 1990s have been confronted with a ‘return of the real’ in art and theory, through the insistence upon a renewed grounding in actual bodies and social sites, after the 1970s paradigm of art-as-text (Foucault) and the 1980s' art-as-simulacrum (Baudrillard). As such, Ravenhill's play and Smith's docudrama permit a commentary on the terrorist attacks in New York on 11 September 2001, when two planes crashed into the World Trade Center thereby further exploding the nation's semblance of reality and the false immunity it fosters.
APA, Harvard, Vancouver, ISO, and other styles
28

Saddiqa, Ayesha, Fatima Sajid Chauhan, and Adeen Asif. "Coke Studio: Adaptation of Folk Songs for Bridging Cultural and Generation Gap." Global Social Sciences Review IX, no. I (March 30, 2024): 145–57. http://dx.doi.org/10.31703/gssr.2024(ix-i).13.

Full text
Abstract:
Coke Studio (CS, Pakistan) showcases a fusion of various musical genres, from traditional classical, folk, Sufi, ghazal, and qawwali to contemporary hip-hop, rock, and pop. This study explores the adaptation of folk songs within CS through Cardwell's 'meta-text' theory (2002), contending that the essence of the original text is retained in subsequent adaptations. Additionally, it examines the rhizome-like nature of these adaptations. Employing multimodality, the study analyzes folk songs from Coke Studio Season 10, utilizing auditory, visual, and spatial elements to create a cohesive artifact with broad semiotic appeal in today's globalized world. CS's immense popularity underscores its ability to balance aesthetic concerns with technological advancements. Furthermore, the study positions CS as a platform for rediscovering, reforming, and sustaining cultural heritage, catering to the new generation. By blending traditional folk with rock elements and appealing to audiences of all ages, CS bridges the gap between generations, fostering a 'third space' music as proposed by Bhabha (1994). This music is now intertwined with contemporary youth culture, contributing to the emergence of a new national identity. Thus, the adaptations of folk songs in Coke Studio serve as a contemporary reinterpretation of history and cultural heritage, connecting youth with their past while grounding them in the present.
APA, Harvard, Vancouver, ISO, and other styles
29

Urzha, Anastasia V. "The Foregrounding Function of Praesens Historicum in Russian Translated Adventure Narratives (20th Century)." Slovene 5, no. 1 (2016): 226–48. http://dx.doi.org/10.31168/2305-6754.2016.5.1.9.

Full text
Abstract:
This research focuses on the functioning of praesens historicum forms which Russian translators use to substitute for English narrative forms referring to past events. The study applies the Theory of Grounding and Russian Communicative Functional Grammar to the comparative discourse analysis of English-language adventure stories and novels created in the 19th and 20th centuries and their Russian translations. The Theory of Grounding is still not widely used in Russian translation studies, nor have its concepts and fruitful ideas been related to the achievements of Russian Narratology and Functional Grammar. This article presents an attempt to find a common basis in these academic traditions as they relate to discourse analysis and to describe the role of praesens historicum forms in Russian translated adventure narratives. The corpus includes 22 original texts and 72 Russian translations, and the case study involves six Russian translations of The Adventures of Tom Sawyer, focusing on the translation made by Korney Chukovsky, who employed historic present more often than in other translations of the novel. It is shown that the translation strategy of substituting the original English-language past forms with Russian present forms is realized in foregrounded and focalized segments of the text, giving them additional saliency. This strategy relates the use of historic present to the functions of deictic words and words denoting visual or audial perception, locating the deictic center of the narrative in the spacetime of the events and allowing the reader to join the focalizing WHO (a narrator or a hero). Translations that regularly mark the foreground through the use of the historic present and accompanying lexical-grammatical means are often addressed to young readers.
APA, Harvard, Vancouver, ISO, and other styles
30

Fedorenko, Svitlana V., and Kateryna B. Sheremeta. "U.S. UNIVERSITY WEBSITES AS SPECIFIC MULTIMODAL TEXTS." Alfred Nobel University Journal of Philology 2, no. 26/2 (December 26, 2023): 9–26. http://dx.doi.org/10.32342/2523-4463-2023-2-26/2-1.

Full text
Abstract:
The aim of the article was to study the specifics of the interfaces of the U.S. university websites as multimodal heterogeneous texts that synthesize elements of educational, scientific and advertising discourses. The overall objectives to achieve the established goal were as follows: to identify and distinguish the types of multimodal means on the U.S. university website, which contribute to its genre mixing and genre embedding; to establish the nature of the interaction of verbal, non-verbal and para-verbal components of the U.S. university websites, and to determine their pragmatic features. The methodological basis of the research was a complex of the following methods: analysis (to study multimodal components of the university website as a specific multimodal text), synthesis (to identify the features of the integration of multimodal means of the websites of American universities), observation (for the selection of fragments with verbal means that actualize the visual content and the selection of visual fragments to actualize the verbal content), the method of discourse analysis (to highlight specific fragments of websites that arouse the interest of the authors of this articleб and have a meaningful content), structural method (to analyze the university website as a whole structure, which is provided by separate means of cohesion), functional method (to clarify the pragmatic potential of multimodal elements of the university website, which are means of communication between the university and the reader of its website). It also employed the system functional (drawing on the provisions of linguistic metafunctions, and focusing on the categories of the grammar of visual design) and the socio-semiotic (grounding on the interrelationship of modes, their compatibility and social needs for which they serve, making meanings) approaches. The chosen methodology made it possible to conduct a study of the multimodality of the websites of the U.S. universities, realized as a symbiosis of verbal, non-verbal and paraverbal resources. The multimedia corpus of the research consists of the websites of five American universities (Massachusetts Institute of Technology, Harvard University, University of Pennsylvania, Yale University and Рrinceton University). The main conclusion that can be drawn is that the complex discursive nature of the websites under study is determined by the features inherent in advertising (the benefits of services to influence the choice of the recipient), educational (the talk about the educational process and educational services) and scientific (information of a scientific nature is provided) discourses. All universities under study employ semiotic landscapes at their disposal to portray attractive brands on their websites. Being the most important way to ensure fast and effective communication of educational institutions with their target audience, the discourse of university websites has a pronounced pragmatic orientation. The purpose of the analyzed type of heterogeneous discourse is to create an image of an “ideal” educational institution, attract potential students, researchers, sponsors, and disseminate the latest achievements in the field of science and education. The concept of multimodality of the websites of the analyzed U.S. universities as specific multimodal texts is manifested in visual content through a number of paragraphemic and infographic elements, the synthesis of which is due to the combination of language tools, visual content and web technologies of modern website construction. The most common visual content exploited on the U.S. university websites embraces: unique photographs and “color” mode (photos of the university and its students, classrooms, laboratories, events, etc.), which helps to clearly illustrate the educational services offered, and give the desired emotional mood; infographics and data visualization, which is an effective way to combine text, pictures and design to present complex information (infographics do not always completely replace the text, more often it is its addition or retelling); video interviews with students, graduates, videos about studying at a university are one of the means to convince potential students to make an admission decision. Using video is a fairly popular form of visual content. With the help of video, the universities can not only diversify the content of their websites, but also satisfy the needs of those users who prefer visual content. Placing various videos on website pages allows solving the problems of reinforcing textual content, strengthening the arguments “for” admission and attracting applicants to university educational programs. In such a way, on the basis of the interaction of different discourses (advertising, educational and scientific) and various semiotic systems, a single visual-structural and functionally complete image of an attractive and popular university is achieved among readers of its website.
APA, Harvard, Vancouver, ISO, and other styles
31

Shaikhlislamova, E. R., L. K. Karimova, S. A. Gallyamova, F. A. Urmantseva, D. R. Iskhakova, and R. A. Alakaeva. "Grounding of using treatment-and-rehabilitation complex for patients with occupational lumbosacral radiculopathy." Perm Medical Journal 35, no. 2 (April 15, 2018): 85–92. http://dx.doi.org/10.17816/pmj35285-92.

Full text
Abstract:
Aim. To ground and estimate the efficiency of using nonmedicamentous methods of treatment in patients with occupational radiculopathy of lumbosacral level. Materials and methods. Sixty seven patients, diagnosed occupational lumbosacral radiculopathy, were examined; patients’ neurological status, manifestation of pain syndrome by 10-score visual analog scale and neuromuscular state by stimulation electroneuromyography data prior to and after treatment were assessed. Results. Treatment-and-rehabilitation complex included magnetic-laser therapy, acupuncture, combined with curative gymnastics. By the completion of treatment, positive dynamics regarding nearly all neurological states, including reduced manifestation of painful sensations, increased volume of active movements in the lumbar spine, decreased hyperesthesia rate and hypoesthesia intensity in the zone of injured roots, decreased occurrence of Lasegue test was observed. Dynamics of electroneuromyographic parameters was expressed by improved conduction of nerve impulses through the peripheral low extremity nerves and roots L4, L5, S1 of cerebrospinal nerves. Conclusions. Application of the offered complex provided positive effect by the end of treatment in 82.3 % of cases.
APA, Harvard, Vancouver, ISO, and other styles
32

Koliushko, D. G., S. S. Rudenko, and A. N. Saliba. "Method of integro-differential equations for interpreting the results of vertical electrical sounding of the soil." Electrical Engineering & Electromechanics, no. 5 (October 18, 2021): 67–70. http://dx.doi.org/10.20998/2074-272x.2021.5.09.

Full text
Abstract:
The paper is devoted to the problem of determining the geoelectric structure of the soil within the procedure of testing the grounding arrangements of existing power plants and substations to the required depth in conditions of dense development. To solve the problem, it was proposed to use the Schlumbergers method , which has a greater sounding depth compared to the Wenner electrode array. The purpose of the work is to develop a mathematical model for interpreting the results of soil sounding by the Schlumberger method in the form of a four-layer geoelectric structure. Methodology. To construct a mathematical model, it is proposed to use the solution of a particular problem about the field of a point current source, which, like the observation point, is located in the first layer of a four-layer soil. Based on this expressions, a system of linear algebraic equations of the 7-th order with respect to the unknown coefficients ai and bi was compiled. On the basis of its analytical solution, an expression for the potential of the electric field was obtained for conducting VES (the point current source and the observation point are located only on the soil surface). Results. Comparison of the results of soil sounding by the Schlumberger installation and the interpretation of its results for the same points shows a sufficient degree of approximation: the maximum relative error does not exceed 9.7 % (for the second point), and the average relative error is 3.6 %. Originality. Based on the obtained expression, a test version of the program was implemented in Visual Basic for Applications to interpret the results of VES by the Schlumberger method. To check the obtained expressions, the interpretation of the VES results was carried out on the territory of a 150 kV substation of one of the mining and processing plants in the city of Kriviy Rih. Practical significance. The developed mathematical model will make it possible to increase the sounding depth, and, consequently, the accuracy of determining the standardized parameters of the grounding arrangements of power stations and substations.
APA, Harvard, Vancouver, ISO, and other styles
33

ШКАРБАН, Інна. "LINGUISTIC ASPECT OF MODALITY IN MODERN MATH DISCOURSE IN ENGLISH." Проблеми гуманітарних наук. Серія Філологія, no. 49 (June 8, 2022): 231–36. http://dx.doi.org/10.24919/2522-4565.2022.49.33.

Full text
Abstract:
The article reveals linguistic aspect of modality in modern math discourse in English, critically outlines a number of actual problematic issues in the area, such as the distinction between epistemic modality and evidentiality marked by formal logics philosophical grounding. General reference to previous scholarly activity in math modality research proves that it is largely based on propositional aspects of meaning. The math text corpus analysis aims to extract a set of modalities that are indispensable for formulating modal deductive reasoning. However, from a linguistic perspective academic math discourse requires natural language premise selection in the processes of mathematical reasoning and argumentation. It is presumed that two different self-attention cognition layers are focused at the same time on the proper classical symbolic logic and mathematical elements (formal language), while the other attends to natural language. Defining the semantic meanings of math discourse modality markers involves the interpretation phase. Thus, objectivity is generally associated with evidential adverbs which are markers of the evidence verification concerning the speaker’s assessment of the truth value of the proposition. Modal auxiliaries of high, medium and low modality, semimodal verbs and conditionals involve ascribing a justification value in the set of possible logical inference making. The formal logical structure of mathematical reasoning explains the non-intuitive possibility of a deductive proof. It has been grounded that a linguistic category of modality in math discourse indispensably presupposes the universal truth of knowledge, high level of logical formalization in propositional verification status, formulaic nature of the argumentation, i.e. synthesis of hypothetical preconditions, theoretical knowledge and subjectivity of reasoning leading to a new hypothesis verification and visual exemplification of the empirical deductive processes in particular by linguistic means of modality expression.
APA, Harvard, Vancouver, ISO, and other styles
34

Huang, Jianqiang, Yu Qin, Jiaxin Qi, Qianru Sun, and Hanwang Zhang. "Deconfounded Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 1 (June 28, 2022): 998–1006. http://dx.doi.org/10.1609/aaai.v36i1.19983.

Full text
Abstract:
We focus on the confounding bias between language and location in the visual grounding pipeline, where we find that the bias is the major visual reasoning bottleneck. For example, the grounding process is usually a trivial languagelocation association without visual reasoning, e.g., grounding any language query containing sheep to the nearly central regions, due to that most queries about sheep have ground-truth locations at the image center. First, we frame the visual grounding pipeline into a causal graph, which shows the causalities among image, query, target location and underlying confounder. Through the causal graph, we know how to break the grounding bottleneck: deconfounded visual grounding. Second, to tackle the challenge that the confounder is unobserved in general, we propose a confounder-agnostic approach called: Referring Expression Deconfounder (RED), to remove the confounding bias. Third, we implement RED as a simple language attention, which can be applied in any grounding method. On popular benchmarks, RED improves various state-of-the-art grounding methods by a significant margin. Code is available at: https://github.com/JianqiangH/Deconfounded_VG.
APA, Harvard, Vancouver, ISO, and other styles
35

Khalil, Esam N. "Grounding in Text Structure." Australian Journal of Linguistics 22, no. 2 (October 2002): 173–90. http://dx.doi.org/10.1080/07268600120122599.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Dr. Bamadeb Mishra, Ananya Mishra, Dr Rabindra Kumar Swain,. "Moderating effect of CSR on the Direction and Strength of the Relationship between Corporate Governance Mechanism and Integrated Reporting Quality: An Evidence from India." European Economic Letters (EEL) 14, no. 1 (March 13, 2024): 1395–411. http://dx.doi.org/10.52783/eel.v14i1.1195.

Full text
Abstract:
Grounding upon the perspective of agency theory, the current study intends to inspect the association between Governance Practices of Companies, in the form of board attributes and Quality of Integrated Reporting among selected listed Indian companies. Further the study also aims to evaluate, whether the association between the board’s attributes and IR quality is moderated by the Corporate Social Responsibility (CSR). For the purpose of testing hypothesis, the study has considered a sample of 25 top Indian corporates of energy industry listed in the BSE-500 group. The period of the study extends from 2017-18 to 2021-22. Pooled OLS regression analysis is used to test the impact of corporate governance practices on the IR quality and evaluate if CSR moderates the strength and direction of their relationship. To ascertain the IR disclosure score of the sample companies, a checklist is developed built on IR Framework devised by the International Integrated Reporting Council (IIRC) and the technique of visual content analysis is applied. Further board size, firm size, leverage, ROE, ROA, Market to Book value ratio are considered as the control variables to strengthen the panel data model. The study found that board characteristics had a positive relationship with IRQ. It is also revealed that CSR positively moderates the association between Corporate Governance mechanism and IRQ. The significance of Corporate Governance in the process of managerial decision making is the major theoretical development of this research.
APA, Harvard, Vancouver, ISO, and other styles
37

Dong, Wenjian, Mayu Otani, Noa Garcia, Yuta Nakashima, and Chenhui Chu. "Cross-Lingual Visual Grounding." IEEE Access 9 (2021): 349–58. http://dx.doi.org/10.1109/access.2020.3046719.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Jiang, Wenhui, Yibo Cheng, Linxin Liu, Yuming Fang, Yuxin Peng, and Yang Liu. "Comprehensive Visual Grounding for Video Description." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (March 24, 2024): 2552–60. http://dx.doi.org/10.1609/aaai.v38i3.28032.

Full text
Abstract:
The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations, whereas the captioning accuracy often suffers from degenerated object appearances on the annotated area such as motion blur and video defocus. Moreover, these methods seldom consider the complex interactions among entities. In this paper, we propose a comprehensive visual grounding network to improve video captioning, by explicitly linking the entities and actions to the visual clues across the video frames. Specifically, the network consists of spatial-temporal entity grounding and action grounding. The proposed entity grounding encourages the attention mechanism to focus on informative spatial areas across video frames, albeit the entity is annotated in only one frame of a video. The action grounding dynamically associates the verbs to related subjects and the corresponding context, which keeps fine-grained spatial and temporal details for action prediction. Both entity grounding and action grounding are formulated as a unified task guided by a soft grounding supervision, which brings architecture simplification and improves training efficiency as well. We conduct extensive experiments on two challenging datasets, and demonstrate significant performance improvements of +2.3 CIDEr on ActivityNet-Entities and +2.2 CIDEr on MSR-VTT compared to state-of-the-arts.
APA, Harvard, Vancouver, ISO, and other styles
39

Liu, Yongfei, Bo Wan, Xiaodan Zhu, and Xuming He. "Learning Cross-Modal Context Graph for Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11645–52. http://dx.doi.org/10.1609/aaai.v34i07.6833.

Full text
Abstract:
Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic ambiguities. Prior works typically focus on learning representations of individual phrases with limited context information. To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task. In particular, we introduce a modular graph neural network to compute context-aware representations of phrases and object proposals respectively via message propagation, followed by a graph-based matching module to generate globally consistent localization of grounding phrases. We train the entire graph neural network jointly in a two-stage strategy and evaluate it on the Flickr30K Entities benchmark. Extensive experiments show that our method outperforms the prior state of the arts by a sizable margin, evidencing the efficacy of our grounding framework. Code is available at https://github.com/youngfly11/LCMCG-PyTorch.
APA, Harvard, Vancouver, ISO, and other styles
40

Wang, Ning, Jiajun Deng, and Mingbo Jia. "Cycle-Consistency Learning for Captioning and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 5535–43. http://dx.doi.org/10.1609/aaai.v38i6.28363.

Full text
Abstract:
We present that visual grounding and image captioning, which perform as two mutually inverse processes, can be bridged together for collaborative training by careful designs. By consolidating this idea, we introduce CyCo, a cyclic-consistent learning framework to ameliorate the independent training pipelines of visual grounding and image captioning. The proposed framework (1) allows the semi-weakly supervised training of visual grounding; (2) improves the performance of fully supervised visual grounding; (3) yields a general captioning model that can describe arbitrary image regions. Extensive experiments show that our fully supervised grounding model achieves state-of-the-art performance, and the semi-weakly supervised one also exhibits competitive performance compared to the fully supervised counterparts. Our image captioning model has the capability to freely describe image regions and meanwhile shows impressive performance on prevalent captioning benchmarks.
APA, Harvard, Vancouver, ISO, and other styles
41

Tomaškovičová, Soňa, Thomas Ingeman-Nielsen, Anders V. Christiansen, Inooraq Brandt, Torleif Dahlin, and Bo Elberling. "Effect of electrode shape on grounding resistances — Part 2: Experimental results and cryospheric monitoring." GEOPHYSICS 81, no. 1 (January 1, 2016): WA169—WA182. http://dx.doi.org/10.1190/geo2015-0148.1.

Full text
Abstract:
Although electric resistivity tomography (ERT) is now regarded as a standard tool in permafrost monitoring, high grounding resistances continue to limit the acquisition of time series over complete freeze-thaw cycles. In an attempt to alleviate the grounding resistance problem, we have tested three electrode designs featuring increasing sizes and surface area, in the laboratory and at three different field sites in Greenland. Grounding resistance measurements showed that changing the electrode shape (using plates instead of rods) reduced the grounding resistances at all sites by 28%–69% during unfrozen and frozen ground conditions. Using meshes instead of plates (the same rectangular shape and a larger effective surface area) further improved the grounding resistances by 29%–37% in winter. Replacement of rod electrodes of one entire permanent permafrost monitoring array by meshes resulted in an immediate reduction of the average grounding resistance by 73% from [Formula: see text] to [Formula: see text] (unfrozen conditions); in addition, the length of the acquisition period during the winter season was markedly prolonged. Grounding resistance time series from the three ERT monitoring stations in Greenland showed that the electrodes were rarely perfectly grounded and that grounding resistances exceeding [Formula: see text] may occur in severe cases. We concluded that the temperature, electrode shape, and lithology at the sites have a marked impact on electrode performance. Choosing an optimized electrode design may be the deciding factor for successful data acquisition, and should therefore be considered when planning a long-term monitoring project.
APA, Harvard, Vancouver, ISO, and other styles
42

Shridhar, Mohit, Dixant Mittal, and David Hsu. "INGRESS: Interactive visual grounding of referring expressions." International Journal of Robotics Research 39, no. 2-3 (January 2, 2020): 217–32. http://dx.doi.org/10.1177/0278364919897133.

Full text
Abstract:
This article presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The key question here is to ground referring expressions: understand expressions about objects and their relationships from image and natural language inputs. INGRESS allows unconstrained object categories and rich language expressions. Further, it asks questions to clarify ambiguous referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expressions, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred objects. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans. The INGRESS source code is available at https://github.com/MohitShridhar/ingress .
APA, Harvard, Vancouver, ISO, and other styles
43

Baroni, Marco. "Grounding Distributional Semantics in the Visual World." Language and Linguistics Compass 10, no. 1 (December 8, 2015): 3–13. http://dx.doi.org/10.1111/lnc3.12170.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Sui, Zezhou, Mian Zhou, Zhikun Feng, Angelos Stefanidis, and Nan Jiang. "Language-Led Visual Grounding and Future Possibilities." Electronics 12, no. 14 (July 20, 2023): 3142. http://dx.doi.org/10.3390/electronics12143142.

Full text
Abstract:
In recent years, with the rapid development of computer vision technology and the popularity of intelligent hardware, as well as the increasing demand for human–machine interaction in intelligent products, visual localization technology can help machines and humans to recognize and locate objects, thereby promoting human–machine interaction and intelligent manufacturing. At the same time, human–machine interaction is constantly evolving and improving, becoming increasingly intelligent, humanized, and efficient. In this article, a new visual localization model is proposed, and a language validation module is designed to use language information as the main information to increase the model’s interactivity. In addition, we also list the future possibilities of visual localization and provide two examples to explore the application and optimization direction of visual localization and human–machine interaction technology in practical scenarios, providing reference and guidance for relevant researchers and promoting the development and application of visual localization and human–machine interaction technology.
APA, Harvard, Vancouver, ISO, and other styles
45

Zhang, Junqian, Long Tu, Yakun Zhang, Liang Xie, Minpeng Xu, Dong Ming, Ye Yan, and Erwei Yin. "An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention." Electronics 12, no. 24 (December 14, 2023): 5007. http://dx.doi.org/10.3390/electronics12245007.

Full text
Abstract:
Visual grounding aims to recognize and locate the target in the image according to human intention, which provides a new intelligent interaction idea and method for augmented reality (AR) and virtual reality (VR) devices. However, existing vision language grounding adopts language modals for visual grounding, but it performs ineffectively for images containing multiple similar objects. Gaze interaction is an important interaction mode in AR/VR devices, and it provides an advanced solution to the inaccurate vision language grounding cases. Based on the above questions and analysis, a vision language grounding framework fused with gaze intention is proposed. Firstly, we collect the manual gaze annotations using the AR device and construct a novel multi-modal dataset, RefCOCOg-Gaze, combining it with the proposed data augmentation methods. Secondly, an attention-based multi-modal feature fusion model is designed, providing a baseline framework for vision language grounding with gaze intention (VLG-Gaze). Through a series of precisely designed experiments, we analyze the proposed dataset and framework qualitatively and quantitatively. Comparing with the state-of-the-art vision language grounding model, our proposed scheme improves the accuracy by 5.3%, which indicates the significance of gaze fusion in multi-modal grounding tasks.
APA, Harvard, Vancouver, ISO, and other styles
46

Zohourianshahzadi, Zanyar, and Jugal K. Kalita. "Neural Twins Talk and Alternative Calculations." International Journal of Semantic Computing 15, no. 01 (March 2021): 93–116. http://dx.doi.org/10.1142/s1793351x21500045.

Full text
Abstract:
Inspired by how the human brain employs more neural pathways when increasing the focus on a subject, we introduce a novel twin cascaded attention model that outperforms a state-of-the-art image captioning model that was originally implemented using one channel of attention for the visual grounding task. Visual grounding ensures the existence of words in the caption sentence that are grounded into a particular region in the input image. After a deep learning model is trained on visual grounding task, the model employs the learned patterns regarding the visual grounding and the order of objects in the caption sentences, when generating captions. We report the results of our experiments in three image captioning tasks on the COCO dataset. The results are reported using standard image captioning metrics to show the improvements achieved by our model over the previous image captioning model. The results gathered from our experiments suggest that employing more parallel attention pathways in a deep neural network leads to higher performance. Our implementation of Neural Twins Talk (NTT) is publicly available at: https://github.com/zanyarz/NeuralTwinsTalk .
APA, Harvard, Vancouver, ISO, and other styles
47

Zeldovich, G. M. "The mutual similarity of meanings and structures in a literary text." Slovo.ru: Baltic accent 11, no. 1 (2020): 87–100. http://dx.doi.org/10.5922/2225-5346-2020-1-5.

Full text
Abstract:
This paper discusses a discourse grounding strategy that has not been described before. It is shown that the fragments of a literary text that are perceived as impressive, aphoristic, etc., tend to have a set of recurrent features. Firstly, in such fragments, there often is mutual re­flectedness of meanings (it emerges in metaphors, similes, parallelisms, or juxtapositions of contradictory notions). Second, mutual reflectedness goes through pronounced detrivializa­tion, i.e it is emphasised using special means, one of which is the ostentatious intricacy of the text usually achieved through amphiboly, or intended ambiguity. Thirdly, there is usually a strong anaphoric link between such fragments and the preceding text, i. e. a link between sub­jects and/or objects (this does not exclude adjunct-based links). Fourthly, the type of discourse relation between such fragments and the previous text is highly predictable. The main conclu­sion drawn in the article is that the described set of properties, which is instrumental in discourse grounding, is widely used in literature, on the one hand, and it is much more complex than the grounding devices earlier studied by narratology.
APA, Harvard, Vancouver, ISO, and other styles
48

Ou, Jiefu, Adithya Pratapa, Rishubh Gupta, and Teruko Mitamura. "Hierarchical Event Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (June 26, 2023): 13437–45. http://dx.doi.org/10.1609/aaai.v37i11.26576.

Full text
Abstract:
Event grounding aims at linking mention references in text corpora to events from a knowledge base (KB). Previous work on this task focused primarily on linking to a single KB event, thereby overlooking the hierarchical aspects of events. Events in documents are typically described at various levels of spatio-temporal granularity. These hierarchical relations are utilized in downstream tasks of narrative understanding and schema construction. In this work, we present an extension to the event grounding task that requires tackling hierarchical event structures from the KB. Our proposed task involves linking a mention reference to a set of event labels from a subevent hierarchy in the KB. We propose a retrieval methodology that leverages event hierarchy through an auxiliary hierarchical loss. On an automatically created multilingual dataset from Wikipedia and Wikidata, our experiments demonstrate the effectiveness of the hierarchical loss against retrieve and re-rank baselines. Furthermore, we demonstrate the systems' ability to aid hierarchical discovery among unseen events. Code is available at https://github.com/JefferyO/Hierarchical-Event-Grounding
APA, Harvard, Vancouver, ISO, and other styles
49

Gall-Maynard, David. "The End Is Always Near: Evaluating the Influence of Premillennial Apocalyptic Rhetoric on Evangelical Christian Attitudes toward Climate Change Discourse." CEA Critic 85, no. 3 (November 2023): 233–40. http://dx.doi.org/10.1353/cea.2023.a912099.

Full text
Abstract:
Abstract: One recurring feature of climate change discourse—and apocalyptic rhetoric more generally—is the appeal to textual authorities whose knowledge transcends that of rhetor and audience. "For religious apocalyptic," Brummett writes, "the grounding text will be one or more of the scriptures of the religion; for secular apocalyptic, the grounding text will be the assertion of a natural law governing the domain in question or it will be a widely revered secular text" (99). As a case-in-point, climate reform rhetors often invoke the authority of the scientific consensus surrounding climate change, research that is the basis for the IPCC's [Intergovernmental Panel on Climate Change] reports.
APA, Harvard, Vancouver, ISO, and other styles
50

Ingeman-Nielsen, Thomas, Soňa Tomaškovičová, and Torleif Dahlin. "Effect of electrode shape on grounding resistances — Part 1: The focus-one protocol." GEOPHYSICS 81, no. 1 (January 1, 2016): WA159—WA167. http://dx.doi.org/10.1190/geo2015-0484.1.

Full text
Abstract:
Electrode grounding resistance is a major factor affecting measurement quality in electric resistivity tomography (ERT) measurements for cryospheric applications. Still, little information is available on grounding resistances in the geophysical literature, mainly because it is difficult to measure. The focus-one protocol is a new method for estimating single electrode grounding resistances by measuring the resistance between a single electrode in an ERT array and all the remaining electrodes connected in parallel. For large arrays, the measured resistance is dominated by the grounding resistance of the electrode under test, the focus electrode. We have developed an equivalent circuit model formulation for the resistance measured when applying the focus-one protocol. Our model depends on the individual grounding resistances of the electrodes of the array, the mutual resistances between electrodes, and the instrument input impedance. Using analytical formulations for the potentials around prolate and oblate spheroidal electrode models (as approximations for rod and plate electrodes), we have investigated the performance and accuracy of the focus-one protocol in estimating single-electrode grounding resistances. We also found that the focus-one protocol provided accurate estimations of electrode grounding resistances to within [Formula: see text] for arrays of 30 electrodes or more when the ratio of instrument input impedance to the half-space resistivity was [Formula: see text] or more. The focus-one protocol was of high practical value in field operations because it helped to optimize array installation, electrode design, and placement. The measured grounding resistances may also be included in future inversion schemes to improve data interpretation under difficult environmental conditions such as those encountered in cryospheric applications.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography