Academic literature on the topic 'Vision language navigation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Vision language navigation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Vision language navigation"

1

Liang, Xiwen, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, and Xiaodan Liang. "Contrastive Instruction-Trajectory Learning for Vision-Language Navigation." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 2 (June 28, 2022): 1592–600. http://dx.doi.org/10.1609/aaai.v36i2.20050.

Full text
Abstract:
The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy. In this paper, we propose a Contrastive Instruction-Trajectory Learning (CITL) framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation. Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. Our CITL can be easily integrated with VLN backbones to form a new learning paradigm and achieve better generalizability in unseen environments. Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and RxR.
APA, Harvard, Vancouver, ISO, and other styles
2

Jin Jie, 金杰, 刘凯燕 Liu Kaiyan, and 查顺考 Zha Shunkao. "基于余弦相似的视觉语言导航算法." Laser & Optoelectronics Progress 58, no. 16 (2021): 1615001. http://dx.doi.org/10.3788/lop202158.1615001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Landi, Federico, Lorenzo Baraldi, Marcella Cornia, Massimiliano Corsini, and Rita Cucchiara. "Multimodal attention networks for low-level vision-and-language navigation." Computer Vision and Image Understanding 210 (September 2021): 103255. http://dx.doi.org/10.1016/j.cviu.2021.103255.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Li, Xin, Yu Zhang, Weilin Yuan, and Junren Luo. "Incorporating External Knowledge Reasoning for Vision-and-Language Navigation with Assistant’s Help." Applied Sciences 12, no. 14 (July 13, 2022): 7053. http://dx.doi.org/10.3390/app12147053.

Full text
Abstract:
Vision-and-Language Navigation (VLN) is a task designed to enable embodied agents carry out natural language instructions in realistic environments. Most VLN tasks, however, are guided by an elaborate set of instructions that is depicted step-by-step. This approach deviates from real-world problems in which humans only describe the object and its surroundings and allow the robot to ask for help when required. Vision-based Navigation with Language-based Assistance (VNLA) is a recently proposed task that requires an agent to navigate and find a target object according to a high-level language instruction. Due to the lack of step-by-step navigation guidance, the key to VNLA is to conduct goal-oriented exploration. In this paper, we design an Attention-based Knowledge-enabled Cross-modality Reasoning with Assistant’s Help (AKCR-AH) model to address the unique challenges of this task. AKCR-AH learns a generalized navigation strategy from three new perspectives: (1) external commonsense knowledge is incorporated into visual relational reasoning, so as to take proper action at each viewpoint by learning the internal–external correlations among object- and room-entities; (2) a simulated human assistant is introduced in the environment, who provides direct intervention assistance when required; (3) a memory-based Transformer architecture is adopted as the policy framework to make full use of the history clues stored in memory tokens for exploration. Extensive experiments demonstrate the effectiveness of our method compared with other baselines.
APA, Harvard, Vancouver, ISO, and other styles
5

Hwang, Jisu, and Incheol Kim. "Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation." Sensors 21, no. 3 (February 2, 2021): 1012. http://dx.doi.org/10.3390/s21031012.

Full text
Abstract:
Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.
APA, Harvard, Vancouver, ISO, and other styles
6

Chi, Ta-Chung, Minmin Shen, Mihail Eric, Seokhwan Kim, and Dilek Hakkani-tur. "Just Ask: An Interactive Learning Framework for Vision and Language Navigation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 03 (April 3, 2020): 2459–66. http://dx.doi.org/10.1609/aaai.v34i03.5627.

Full text
Abstract:
In the vision and language navigation task (Anderson et al. 2018), the agent may encounter ambiguous situations that are hard to interpret by just relying on visual information and natural language instructions. We propose an interactive learning framework to endow the agent with the ability to ask for users' help in such situations. As part of this framework, we investigate multiple learning approaches for the agent with different levels of complexity. The simplest model-confusion-based method lets the agent ask questions based on its confusion, relying on the predefined confidence threshold of a next action prediction model. To build on this confusion-based method, the agent is expected to demonstrate more sophisticated reasoning such that it discovers the timing and locations to interact with a human. We achieve this goal using reinforcement learning (RL) with a proposed reward shaping term, which enables the agent to ask questions only when necessary. The success rate can be boosted by at least 15% with only one question asked on average during the navigation. Furthermore, we show that the RL agent is capable of adjusting dynamically to noisy human responses. Finally, we design a continual learning strategy, which can be viewed as a data augmentation method, for the agent to improve further utilizing its interaction history with a human. We demonstrate the proposed strategy is substantially more realistic and data-efficient compared to previously proposed pre-exploration techniques.
APA, Harvard, Vancouver, ISO, and other styles
7

Francis, Jonathan, Nariaki Kitamura, Felix Labelle, Xiaopeng Lu, Ingrid Navarro, and Jean Oh. "Core Challenges in Embodied Vision-Language Planning." Journal of Artificial Intelligence Research 74 (May 28, 2022): 459–515. http://dx.doi.org/10.1613/jair.1.13646.

Full text
Abstract:
Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
APA, Harvard, Vancouver, ISO, and other styles
8

Magassouba, Aly, Komei Sugiura, and Hisashi Kawai. "CrossMap Transformer: A Crossmodal Masked Path Transformer Using Double Back-Translation for Vision-and-Language Navigation." IEEE Robotics and Automation Letters 6, no. 4 (October 2021): 6258–65. http://dx.doi.org/10.1109/lra.2021.3092686.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Rutendo, M., and M. A. Al Akkad. "Exploiting Machine Learning for Vision and Motion Planning of Autonomous Vehicles Navigation." Intellekt. Sist. Proizv. 19, no. 3 (2021): 95–104. http://dx.doi.org/10.22213/2410-9304-2021-3-95-104.

Full text
Abstract:
The object of this paper is to create a system that can control any vehicle in any gaming environment to simulate, study, experiment and improve how self-driving vehicles operate. It is to be taken as the bases for future work on autonomous vehicles with real hardware devices. The long-term goal is to eliminate human error. Perception, localisation, planning and control subsystems were developed. LiDAR and RADAR sensors were used in addition to a normal web Camera. After getting information from the perception module, the system will be able to localise where the vehicle is, then the planning module is used to plan to which location the vehicle will move, using localisation module data to draw up the best path to use. After knowing the best path, the system will control the vehicle to move autonomously without human help. As a controller a Proportional Integral Derivative PID controller was used. Python programming language, computer vision, and machine learning were used in developing the system, where the only hardware required is a computer with a GPU and powerful graphical card that can run a game which has a vehicle, roads with lane lines and a map of the road. The developed system is intended to be a good tool in conducting experiments for achieving reliable autonomous vehicle navigation.
APA, Harvard, Vancouver, ISO, and other styles
10

Skinnider, Michael A., R. Greg Stacey, David S. Wishart, and Leonard J. Foster. "Chemical language models enable navigation in sparsely populated chemical space." Nature Machine Intelligence 3, no. 9 (July 19, 2021): 759–70. http://dx.doi.org/10.1038/s42256-021-00368-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Vision language navigation"

1

Dakopoulos, Dimitrios. "TYFLOS: A WEARABLE NAVIGATION PROTOTYPE FOR BLIND & VISUALLY IMPAIRED; DESIGN, MODELLING AND EXPERIMENTAL RESULTS." Wright State University / OhioLINK, 2009. http://rave.ohiolink.edu/etdc/view?acc_num=wright1246542875.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Anderson, Peter James. "Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents." Phd thesis, 2018. http://hdl.handle.net/1885/164018.

Full text
Abstract:
Each time we ask for an object, describe a scene, follow directions or read a document containing images or figures, we are converting information between visual and linguistic representations. Indeed, for many tasks it is essential to reason jointly over visual and linguistic information. People do this with ease, typically without even noticing. Intelligent systems that perform useful tasks in unstructured situations, and interact with people, will also require this ability. In this thesis, we focus on the joint modelling of visual and linguistic information using deep neural networks. We begin by considering the challenging problem of automatically describing the content of an image in natural language, i.e., image captioning. Although there is considerable interest in this task, progress is hindered by the difficulty of evaluating the generated captions. Our first contribution is a new automatic image caption evaluation metric that measures the quality of generated captions by analysing their semantic content. Extensive evaluations across a range of models and datasets indicate that our metric, dubbed SPICE, shows high correlation with human judgements. Armed with a more effective evaluation metric, we address the challenge of image captioning. Visual attention mechanisms have been widely adopted in image captioning and visual question answering (VQA) architectures to facilitate fine-grained visual processing. We extend existing approaches by proposing a bottom-up and top-down attention mechanism that enables attention to be focused at the level of objects and other salient image regions, which is the natural basis for attention to be considered. Applying this approach to image captioning we achieve state of the art results on the COCO test server. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. Despite these advances, recurrent neural network (RNN) image captioning models typically do not generalise well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real applications. To address this problem, we propose constrained beam search, an approximate search algorithm that enforces constraints over RNN output sequences. Using this approach, we show that existing RNN captioning architectures can take advantage of side information such as object detector outputs and ground-truth image annotations at test time, without retraining. Our results significantly outperform previous approaches that incorporate the same information into the learning algorithm, achieving state of the art results for out-of-domain captioning on COCO. Last, to enable and encourage the application of vision and language methods to problems involving embodied agents, we present the Matterport3D Simulator, a large-scale interactive reinforcement learning environment constructed from densely-sampled panoramic RGB-D images of 90 real buildings. Using this simulator, which can in future support a range of embodied vision and language tasks, we collect the first benchmark dataset for visually-grounded natural language navigation in real buildings. We investigate the difficulty of this task, and particularly the difficulty of operating in unseen environments, using several baselines and a sequence-to-sequence model based on methods successfully applied to other vision and language tasks.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Vision language navigation"

1

Wang, Hanqing, Wenguan Wang, Tianmin Shu, Wei Liang, and Jianbing Shen. "Active Visual Information Gathering for Vision-Language Navigation." In Computer Vision – ECCV 2020, 307–22. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58542-6_19.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Qi, Yuankai, Zizheng Pan, Shengping Zhang, Anton van den Hengel, and Qi Wu. "Object-and-Action Aware Model for Visual Language Navigation." In Computer Vision – ECCV 2020, 303–17. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58607-2_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Fu, Tsu-Jui, Xin Eric Wang, Matthew F. Peterson, Scott T. Grafton, Miguel P. Eckstein, and William Yang Wang. "Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler." In Computer Vision – ECCV 2020, 71–86. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58539-6_5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Wang, Hu, Qi Wu, and Chunhua Shen. "Soft Expert Reward Learning for Vision-and-Language Navigation." In Computer Vision – ECCV 2020, 126–41. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58545-7_8.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Wang, Xin Eric, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, and Sujith Ravi. "Environment-Agnostic Multitask Learning for Natural Language Grounded Navigation." In Computer Vision – ECCV 2020, 413–30. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58586-0_25.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Zhou, Kaiwen, and Xin Eric Wang. "FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation." In Lecture Notes in Computer Science, 682–99. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_39.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Krantz, Jacob, Erik Wijmans, Arjun Majumdar, Dhruv Batra, and Stefan Lee. "Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments." In Computer Vision – ECCV 2020, 104–20. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58604-1_7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Cheng, Wenhao, Xingping Dong, Salman Khan, and Jianbing Shen. "Learning Disentanglement with Decoupled Labels for Vision-Language Navigation." In Lecture Notes in Computer Science, 309–29. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20059-5_18.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Majumdar, Arjun, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, and Dhruv Batra. "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web." In Computer Vision – ECCV 2020, 259–74. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58539-6_16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Chen, Shizhe, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. "Learning from Unlabeled 3D Environments for Vision-and-Language Navigation." In Lecture Notes in Computer Science, 638–55. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-19842-7_37.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Vision language navigation"

1

Zhang, Yubo, Hao Tan, and Mohit Bansal. "Diagnosing the Environment Bias in Vision-and-Language Navigation." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/124.

Full text
Abstract:
Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations. These step-by-step navigational instructions are crucial when the agent is navigating new environments about which it has no prior knowledge. Most recent works that study VLN observe a significant performance drop when tested on unseen environments (i.e., environments not used in training), indicating that the neural agent models are highly biased towards training environments. Although this issue is considered as one of the major challenges in VLN research, it is still under-studied and needs a clearer explanation. In this work, we design novel diagnosis experiments via environment re-splitting and feature replacement, looking into possible reasons for this environment bias. We observe that neither the language nor the underlying navigational graph, but the low-level visual appearance conveyed by ResNet features directly affects the agent model and contributes to this environment bias in results. According to this observation, we explore several kinds of semantic representations that contain less low-level visual information, hence the agent learned with these features could be better generalized to unseen testing environments. Without modifying the baseline agent model and its training method, our explored semantic features significantly decrease the performance gaps between seen and unseen on multiple datasets (i.e. R2R, R4R, and CVDN) and achieve competitive unseen results to previous state-of-the-art models.
APA, Harvard, Vancouver, ISO, and other styles
2

Wu, Zongkai, Zihan Liu, and Donglin Wang. "Multi-Grounding Navigator for Self-Supervised Vision-and-Language Navigation." In 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021. http://dx.doi.org/10.1109/ijcnn52387.2021.9533628.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Hong, Yicong, Cristian Rodriguez, Qi Wu, and Stephen Gould. "Sub-Instruction Aware Vision-and-Language Navigation." In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.emnlp-main.271.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Wang, Hanqing, Wenguan Wang, Wei Liang, Caiming Xiong, and Jianbing Shen. "Structured Scene Memory for Vision-Language Navigation." In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021. http://dx.doi.org/10.1109/cvpr46437.2021.00835.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Pashevich, Alexander, Cordelia Schmid, and Chen Sun. "Episodic Transformer for Vision-and-Language Navigation." In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. http://dx.doi.org/10.1109/iccv48922.2021.01564.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Liu, Chong, Fengda Zhu, Xiaojun Chang, Xiaodan Liang, Zongyuan Ge, and Yi-Dong Shen. "Vision-Language Navigation with Random Environmental Mixup." In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. http://dx.doi.org/10.1109/iccv48922.2021.00167.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Huang, Haoshuo, Vihan Jain, Harsh Mehta, Alexander Ku, Gabriel Magalhaes, Jason Baldridge, and Eugene Ie. "Transferable Representation Learning in Vision-and-Language Navigation." In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2019. http://dx.doi.org/10.1109/iccv.2019.00750.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Zhuang, Yifeng, Qiang Sun, Yanwei Fu, Lifeng Chen, and Xiangyang Xue. "Local Slot Attention for Vision and Language Navigation." In ICMR '22: International Conference on Multimedia Retrieval. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3512527.3531366.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zhu, Wanrong, Yuankai Qi, Pradyumna Narayana, Kazoo Sone, Sugato Basu, Xin Wang, Qi Wu, Miguel Eckstein, and William Yang Wang. "Diagnosing Vision-and-Language Navigation: What Really Matters." In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2022. http://dx.doi.org/10.18653/v1/2022.naacl-main.438.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Li, Jialu, Hao Tan, and Mohit Bansal. "Envedit: Environment Editing for Vision-and-Language Navigation." In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.01497.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography