To see the other types of publications on this topic, follow the link: Visual question generation.

Journal articles on the topic 'Visual question generation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual question generation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Patil, Charulata, and Manasi Patwardhan. "Visual Question Generation." ACM Computing Surveys 53, no. 3 (July 5, 2020): 1–22. http://dx.doi.org/10.1145/3383465.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Liu, Hongfei, Jiali Chen, Wenhao Fang, Jiayuan Xie, and Yi Cai. "Category-Guided Visual Question Generation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (June 26, 2023): 16262–63. http://dx.doi.org/10.1609/aaai.v37i13.26991.

Full text
Abstract:
Visual question generation aims to generate high-quality questions related to images. Generating questions based only on images can better reduce labor costs and thus be easily applied. However, their methods tend to generate similar general questions that fail to ask questions about the specific content of each image scene. In this paper, we propose a category-guided visual question generation model that can generate questions with multiple categories that focus on different objects in an image. Specifically, our model first selects the appropriate question category based on the objects in the image and the relationships among objects. Then, we generate corresponding questions based on the selected question categories. Experiments conducted on the TDIUC dataset show that our proposed model outperforms existing models in terms of diversity and quality.
APA, Harvard, Vancouver, ISO, and other styles
3

Mi, Li, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, and Devis Tuia. "ConVQG: Contrastive Visual Question Generation with Multimodal Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (March 24, 2024): 4207–15. http://dx.doi.org/10.1609/aaai.v38i5.28216.

Full text
Abstract:
Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.
APA, Harvard, Vancouver, ISO, and other styles
4

Sarrouti, Mourad, Asma Ben Abacha, and Dina Demner-Fushman. "Goal-Driven Visual Question Generation from Radiology Images." Information 12, no. 8 (August 20, 2021): 334. http://dx.doi.org/10.3390/info12080334.

Full text
Abstract:
Visual Question Generation (VQG) from images is a rising research topic in both fields of natural language processing and computer vision. Although there are some recent efforts towards generating questions from images in the open domain, the VQG task in the medical domain has not been well-studied so far due to the lack of labeled data. In this paper, we introduce a goal-driven VQG approach for radiology images called VQGRaD that generates questions targeting specific image aspects such as modality and abnormality. In particular, we study generating natural language questions based on the visual content of the image and on additional information such as the image caption and the question category. VQGRaD encodes the dense vectors of different inputs into two latent spaces, which allows generating, for a specific question category, relevant questions about the images, with or without their captions. We also explore the impact of domain knowledge incorporation (e.g., medical entities and semantic types) and data augmentation techniques on visual question generation in the medical domain. Experiments performed on the VQA-RAD dataset of clinical visual questions showed that VQGRaD achieves 61.86% BLEU score and outperforms strong baselines. We also performed a blinded human evaluation of the grammaticality, fluency, and relevance of the generated questions. The human evaluation demonstrated the better quality of VQGRaD outputs and showed that incorporating medical entities improves the quality of the generated questions. Using the test data and evaluation process of the ImageCLEF 2020 VQA-Med challenge, we found that relying on the proposed data augmentation technique to generate new training samples by applying different kinds of transformations, can mitigate the lack of data, avoid overfitting, and bring a substantial improvement in medical VQG.
APA, Harvard, Vancouver, ISO, and other styles
5

Pang, Wei, and Xiaojie Wang. "Visual Dialogue State Tracking for Question Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11831–38. http://dx.doi.org/10.1609/aaai.v34i07.6856.

Full text
Abstract:
GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual dialogue state tracking (VDST) based method for question generation. A visual dialogue state is defined as the distribution on objects in the image as well as representations of objects. Representations of objects are updated with the change of the distribution on objects. An object-difference based attention is used to decode new question. The distribution on objects is updated by comparing the question-answer pair and objects. Experimental results on GuessWhat?! dataset show that our model significantly outperforms existing methods and achieves new state-of-the-art performance. It is also noticeable that our model reduces the rate of repeated questions from more than 50% to 21.9% compared with previous state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
6

Kamala, M. "Visual Question Generation from Remote Sensing Images Using Gemini API." International Journal for Research in Applied Science and Engineering Technology 12, no. 3 (March 31, 2024): 2924–29. http://dx.doi.org/10.22214/ijraset.2024.59537.

Full text
Abstract:
Abstract: Visual Question Generation Extracting Information from Remote Sensing Images Remote Sensing Images plays a vital role in understanding and extracting information from aerial and satellite images. Utilizing Bidirectional Encoder Representation from Transformers (BERT) for extracting valuable insights from remote sensing images. Gemini Application Programming Interface(API), and Convolution Neural Networks (CNNs) are used. First, The proposed methodology employs CNN to extract high-level features from remote sensing images, capturing spatial data and generatingquestions. Similarly, the Gemini Application Programming Interface(API) integrates contextual understanding into the question-generation process by providing relevant environmental data. Lastly, BERT functions as a language model in which employees enhance and refine the generated questions by taking into account both the syntax and semantics. Hence, by combining all these techniques we are capable of generating required relevant questions from remote sensing images in an enhanced and efficient way.
APA, Harvard, Vancouver, ISO, and other styles
7

Kachare, Atul, Mukesh Kalla, and Ashutosh Gupta. "Visual Question Generation Answering (VQG-VQA) using Machine Learning Models." WSEAS TRANSACTIONS ON SYSTEMS 22 (June 28, 2023): 663–70. http://dx.doi.org/10.37394/23202.2023.22.67.

Full text
Abstract:
Presented automated visual question-answer system generates graphics-based question-answer pairs. The system consists of the Visual Query Generation (VQG) and Visual Question Answer (VQA) modules. VQG generates questions based on visual cues, and VQA provides matching answers to the VQG modules. VQG system generates questions using LSTM and VGG19 model, training parameters, and predicting words with the highest probability for output. VQA uses VGG-19 convolutional neural network for image encoding, embedding, and multilayer perceptron for high-quality responses. The proposed system reduces the need for human annotation and thus supports the traditional education sector by significantly reducing the human intervention required to generate text queries. The system can be used in interactive interfaces to help young children learn.
APA, Harvard, Vancouver, ISO, and other styles
8

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question Generation." Sensors 23, no. 3 (January 17, 2023): 1057. http://dx.doi.org/10.3390/s23031057.

Full text
Abstract:
Auxiliary clinical diagnosis has been researched to solve unevenly and insufficiently distributed clinical resources. However, auxiliary diagnosis is still dominated by human physicians, and how to make intelligent systems more involved in the diagnosis process is gradually becoming a concern. An interactive automated clinical diagnosis with a question-answering system and a question generation system can capture a patient’s conditions from multiple perspectives with less physician involvement by asking different questions to drive and guide the diagnosis. This clinical diagnosis process requires diverse information to evaluate a patient from different perspectives to obtain an accurate diagnosis. Recently proposed medical question generation systems have not considered diversity. Thus, we propose a diversity learning-based visual question generation model using a multi-latent space to generate informative question sets from medical images. The proposed method generates various questions by embedding visual and language information in different latent spaces, whose diversity is trained by our newly proposed loss. We have also added control over the categories of generated questions, making the generated questions directional. Furthermore, we use a new metric named similarity to accurately evaluate the proposed model’s performance. The experimental results on the Slake and VQA-RAD datasets demonstrate that the proposed method can generate questions with diverse information. Our model works with an answering model for interactive automated clinical diagnosis and generates datasets to replace the process of annotation that incurs huge labor costs.
APA, Harvard, Vancouver, ISO, and other styles
9

Boukhers, Zeyd, Timo Hartmann, and Jan Jürjens. "COIN: Counterfactual Image Generation for Visual Question Answering Interpretation." Sensors 22, no. 6 (March 14, 2022): 2245. http://dx.doi.org/10.3390/s22062245.

Full text
Abstract:
Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models’ behaviour.
APA, Harvard, Vancouver, ISO, and other styles
10

Guo, Zihan, Dezhi Han, and Kuan-Ching Li. "Double-layer affective visual question answering network." Computer Science and Information Systems, no. 00 (2020): 38. http://dx.doi.org/10.2298/csis200515038g.

Full text
Abstract:
Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accuracy as ordinary VQA baseline models. It is a reasonably new task to integrate the emotional information contained in the images into VQA. However, it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in AVQAN. Also, it is believed that this type of concatenation is harmful to the performance of the model. To mitigate such an effect, we propose the Double-Layer Affective Visual Question Answering Network (DAVQAN) that divides the task of generating emotional answers in VQA into two simpler subtasks: the generation of non-emotional responses and the production of mood labels, and two independent layers are utilized to tackle these subtasks. Comparative experimentation conducted on a preprocessed dataset to performance comparison shows that the overall performance of DAVQAN is 7.6% higher than AVQAN, demonstrating the effectiveness of the proposed model. We also introduce more advanced word embedding method and more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models, which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like the general VQA.
APA, Harvard, Vancouver, ISO, and other styles
11

Shridhar, Mohit, Dixant Mittal, and David Hsu. "INGRESS: Interactive visual grounding of referring expressions." International Journal of Robotics Research 39, no. 2-3 (January 2, 2020): 217–32. http://dx.doi.org/10.1177/0278364919897133.

Full text
Abstract:
This article presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The key question here is to ground referring expressions: understand expressions about objects and their relationships from image and natural language inputs. INGRESS allows unconstrained object categories and rich language expressions. Further, it asks questions to clarify ambiguous referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expressions, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred objects. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans. The INGRESS source code is available at https://github.com/MohitShridhar/ingress .
APA, Harvard, Vancouver, ISO, and other styles
12

Kim, Incheol. "Visual Experience-Based Question Answering with Complex Multimodal Environments." Mathematical Problems in Engineering 2020 (November 19, 2020): 1–18. http://dx.doi.org/10.1155/2020/8567271.

Full text
Abstract:
This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.
APA, Harvard, Vancouver, ISO, and other styles
13

Singh, Anjali, Ruhi Sharma Mittal, Shubham Atreja, Mourvi Sharma, Seema Nagar, Prasenjit Dey, and Mohit Jain. "Automatic Generation of Leveled Visual Assessments for Young Learners." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9713–20. http://dx.doi.org/10.1609/aaai.v33i01.33019713.

Full text
Abstract:
Images are an essential tool for communicating with children, particularly at younger ages when they are still developing their emergent literacy skills. Hence, assessments that use images to assess their conceptual knowledge and visual literacy, are an important component of their learning process. Creating assessments at scale is a challenging task, which has led to several techniques being proposed for automatic generation of textual assessments. However, none of them focuses on generating image-based assessments. To understand the manual process of creating visual assessments, we interviewed primary school teachers. Based on the findings from the preliminary study, we present a novel approach which uses image semantics to generate visual multiple choice questions (VMCQs) for young learners, wherein options are presented in the form of images. We propose a metric to measure the semantic similarity between two images, which we use to identify the four options – one answer and three distractor images – for a given question. We also use this metric for generating VMCQs at two difficulty levels – easy and hard. Through a quantitative evaluation, we show that the system-generated VMCQs are comparable to VMCQs created by experts, hence establishing the effectiveness of our approach.
APA, Harvard, Vancouver, ISO, and other styles
14

Kim, Jung-Jun, Dong-Gyu Lee, Jialin Wu, Hong-Gyu Jung, and Seong-Whan Lee. "Visual question answering based on local-scene-aware referring expression generation." Neural Networks 139 (July 2021): 158–67. http://dx.doi.org/10.1016/j.neunet.2021.02.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Liu, Yuhang, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, and Dangyang Chen. "Detection-Based Intermediate Supervision for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (March 24, 2024): 14061–68. http://dx.doi.org/10.1609/aaai.v38i12.29315.

Full text
Abstract:
Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.
APA, Harvard, Vancouver, ISO, and other styles
16

Ghosh, Akash, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman Chadha, and Setu Sinha. "CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 20 (March 24, 2024): 22031–39. http://dx.doi.org/10.1609/aaai.v38i20.30206.

Full text
Abstract:
In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care. Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements.
APA, Harvard, Vancouver, ISO, and other styles
17

Zhang, Weifeng, Jing Yu, Wenhong Zhao, and Chuan Ran. "DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation." Information Fusion 72 (August 2021): 70–79. http://dx.doi.org/10.1016/j.inffus.2021.02.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Zhang, Lizong, Haojun Yin, Bei Hui, Sijuan Liu, and Wei Zhang. "Knowledge-Based Scene Graph Generation with Visual Contextual Dependency." Mathematics 10, no. 14 (July 20, 2022): 2525. http://dx.doi.org/10.3390/math10142525.

Full text
Abstract:
Scene graph generation is the basis of various computer vision applications, including image retrieval, visual question answering, and image captioning. Previous studies have relied on visual features or incorporated auxiliary information to predict object relationships. However, the rich semantics of external knowledge have not yet been fully utilized, and the combination of visual and auxiliary information can lead to visual dependencies, which impacts relationship prediction among objects. Therefore, we propose a novel knowledge-based model with adjustable visual contextual dependency. Our model has three key components. The first module extracts the visual features and bounding boxes in the input image. The second module uses two encoders to fully integrate visual information and external knowledge. Finally, visual context loss and visual relationship loss are introduced to adjust the visual dependency of the model. The difference between the initial prediction results and the visual dependency results is calculated to generate the dependency-corrected results. The proposed model can obtain better global and contextual information for predicting object relationships, and the visual dependencies can be adjusted through the two loss functions. The results of extensive experiments show that our model outperforms most existing methods.
APA, Harvard, Vancouver, ISO, and other styles
19

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data." Electronics 12, no. 10 (May 10, 2023): 2183. http://dx.doi.org/10.3390/electronics12102183.

Full text
Abstract:
As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.
APA, Harvard, Vancouver, ISO, and other styles
20

Kruchinin, Vladimir, and Vladimir Kuzovkin. "Overview of Existing Methods for Automatic Generation of Tasks with Conditions in Natural Language." Computer tools in education, no. 1 (March 28, 2022): 85–96. http://dx.doi.org/10.32603/2071-2340-2022-1-85-96.

Full text
Abstract:
The paper considers the main algorithms for generating various school subject problems of closed and open type. Some of these algorythms (i.e. question answering, Visual question answering) use artificial intelligence and some not (i.e. sets of AND/OR tree, templates). It was shown that methods for generating tests using artificial intelligence have a high potential, but they require further development, in particular, the creation of large question-answer database in russian language.
APA, Harvard, Vancouver, ISO, and other styles
21

Li, Xiaochuan, Baoyu Fan, Runze Zhang, Liang Jin, Di Wang, Zhenhua Guo, Yaqian Zhao, and Rengang Li. "Image Content Generation with Causal Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (March 24, 2024): 13646–54. http://dx.doi.org/10.1609/aaai.v38i12.29269.

Full text
Abstract:
The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic Tom and Jerry animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.
APA, Harvard, Vancouver, ISO, and other styles
22

Tanaka, Ryota, Kyosuke Nishida, and Sen Yoshida. "VisualMRC: Machine Reading Comprehension on Document Images." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (May 18, 2021): 13878–88. http://dx.doi.org/10.1609/aaai.v35i15.17635.

Full text
Abstract:
Recent studies on machine reading comprehension have focused on text-level understanding but have not yet reached the level of human understanding of the visual layout and content of real-world documents. In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. We also introduce a new model that extends existing sequence-to-sequence models, pre-trained with large-scale text corpora, to take into account the visual layout and content of documents. Experiments with VisualMRC show that this model outperformed the base sequence-to-sequence models and a state-of-the-art VQA model. However, its performance is still below that of humans on most automatic evaluation metrics. The dataset will facilitate research aimed at connecting vision and language understanding.
APA, Harvard, Vancouver, ISO, and other styles
23

Wörgötter, Florentin, Ernst Niebur, and Christof Koch. "Generation of Direction Selectivity by Isotropic Intracortical Connections." Neural Computation 4, no. 3 (May 1992): 332–40. http://dx.doi.org/10.1162/neco.1992.4.3.332.

Full text
Abstract:
To what extent do the mechanisms generating different receptive field properties of neurons depend on each other? We investigated this question theoretically within the context of orientation and direction tuning of simple cells in the mammalian visual cortex. In our model a cortical cell of the "simple" type receives its orientation tuning by afferent convergence of aligned receptive fields of the lateral geniculate nucleus (Hubel and Wiesel 1962). We sharpen this orientation bias by postulating a special type of radially symmetric long-range lateral inhibition called circular inhibition. Surprisingly, this isotropic mechanism leads to the emergence of a strong bias for the direction of motion of a bar. We show that this directional anisotropy is neither caused by the probabilistic nature of the connections nor is it a consequence of the specific columnar structure chosen but that it is an inherent feature of the architecture of visual cortex.
APA, Harvard, Vancouver, ISO, and other styles
24

Wang, Junjue, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. "EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 5481–89. http://dx.doi.org/10.1609/aaai.v38i6.28357.

Full text
Abstract:
Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.
APA, Harvard, Vancouver, ISO, and other styles
25

Abrecht, Stephanie, Lydia Gauerhof, Christoph Gladisch, Konrad Groh, Christian Heinzemann, and Matthias Woehrle. "Testing Deep Learning-based Visual Perception for Automated Driving." ACM Transactions on Cyber-Physical Systems 5, no. 4 (October 31, 2021): 1–28. http://dx.doi.org/10.1145/3450356.

Full text
Abstract:
Due to the impressive performance of deep neural networks (DNNs) for visual perception, there is an increased demand for their use in automated systems. However, to use deep neural networks in practice, novel approaches are needed, e.g., for testing. In this work, we focus on the question of how to test deep learning-based visual perception functions for automated driving. Classical approaches for testing are not sufficient: A purely statistical approach based on a dataset split is not enough, as testing needs to address various purposes and not only average case performance. Additionally, a complete specification is elusive due to the complexity of the perception task in the open context of automated driving. In this article, we review and discuss existing work on testing DNNs for visual perception with a special focus on automated driving for test input and test oracle generation as well as test adequacy. We conclude that testing of DNNs in this domain requires several diverse test sets. We show how such tests sets can be constructed based on the presented approaches addressing different purposes based on the presented methods and identify open research questions.
APA, Harvard, Vancouver, ISO, and other styles
26

Cheng, Zesen, Kehan Li, Peng Jin, Siheng Li, Xiangyang Ji, Li Yuan, Chang Liu, and Jie Chen. "Parallel Vertex Diffusion for Unified Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (March 24, 2024): 1326–34. http://dx.doi.org/10.1609/aaai.v38i2.27896.

Full text
Abstract:
Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.
APA, Harvard, Vancouver, ISO, and other styles
27

Khademi, Mahmoud, and Oliver Schulte. "Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11237–45. http://dx.doi.org/10.1609/aaai.v34i07.6783.

Full text
Abstract:
We propose a new algorithm, called Deep Generative Probabilistic Graph Neural Networks (DG-PGNN), to generate a scene graph for an image. The input to DG-PGNN is an image, together with a set of region-grounded captions and object bounding-box proposals for the image. To generate the scene graph, DG-PGNN constructs and updates a new model, called a Probabilistic Graph Network (PGN). A PGN can be thought of as a scene graph with uncertainty: it represents each node and each edge by a CNN feature vector and defines a probability mass function (PMF) for node-type (object category) of each node and edge-type (predicate class) of each edge. The DG-PGNN sequentially adds a new node to the current PGN by learning the optimal ordering in a Deep Q-learning framework, where states are partial PGNs, actions choose a new node, and rewards are defined based on the ground-truth. After adding a node, DG-PGNN uses message passing to update the feature vectors of the current PGN by leveraging contextual relationship information, object co-occurrences, and language priors from captions. The updated features are then used to fine-tune the PMFs. Our experiments show that the proposed algorithm significantly outperforms the state-of-the-art results on the Visual Genome dataset for scene graph generation. We also show that the scene graphs constructed by DG-PGNN improve performance on the visual question answering task, for questions that need reasoning about objects and their interactions in the scene context.
APA, Harvard, Vancouver, ISO, and other styles
28

BELZ, A., T. L. BERG, and L. YU. "From image to language and back again." Natural Language Engineering 24, no. 3 (April 23, 2018): 325–62. http://dx.doi.org/10.1017/s1351324918000086.

Full text
Abstract:
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).
APA, Harvard, Vancouver, ISO, and other styles
29

Liu, Xiulong, Sudipta Paul, Moitreya Chatterjee, and Anoop Cherian. "CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (March 24, 2024): 3765–73. http://dx.doi.org/10.1609/aaai.v38i4.28167.

Full text
Abstract:
Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction.
APA, Harvard, Vancouver, ISO, and other styles
30

Zhou, Luowei, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. "Unified Vision-Language Pre-Training for Image Captioning and VQA." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 13041–49. http://dx.doi.org/10.1609/aaai.v34i07.7005.

Full text
Abstract:
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
APA, Harvard, Vancouver, ISO, and other styles
31

Katz, Chaim N., Kramay Patel, Omid Talakoub, David Groppe, Kari Hoffman, and Taufik A. Valiante. "Differential Generation of Saccade, Fixation, and Image-Onset Event-Related Potentials in the Human Mesial Temporal Lobe." Cerebral Cortex 30, no. 10 (June 4, 2020): 5502–16. http://dx.doi.org/10.1093/cercor/bhaa132.

Full text
Abstract:
Abstract Event-related potentials (ERPs) are a commonly used electrophysiological signature for studying mesial temporal lobe (MTL) function during visual memory tasks. The ERPs associated with the onset of visual stimuli (image-onset) and eye movements (saccades and fixations) provide insights into the mechanisms of their generation. We hypothesized that since eye movements and image-onset provide MTL structures with salient visual information, perhaps they both engage similar neural mechanisms. To explore this question, we used intracranial electroencephalographic data from the MTLs of 11 patients with medically refractory epilepsy who participated in a visual search task. We characterized the electrophysiological responses of MTL structures to saccades, fixations, and image-onset. We demonstrated that the image-onset response is an evoked/additive response with a low-frequency power increase. In contrast, ERPs following eye movements appeared to arise from phase resetting of higher frequencies than the image-onset ERP. Intriguingly, this reset was associated with saccade onset and not termination (fixation), suggesting it is likely the MTL response to a corollary discharge, rather than a response to visual stimulation. We discuss the distinct mechanistic underpinnings of these responses which shed light on the underlying neural circuitry involved in visual memory processing.
APA, Harvard, Vancouver, ISO, and other styles
32

Reddy, Revant Gangi, Xilin Rui, Manling Li, Xudong Lin, Haoyang Wen, Jaemin Cho, Lifu Huang, et al. "MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 11200–11208. http://dx.doi.org/10.1609/aaai.v36i10.21370.

Full text
Abstract:
Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
APA, Harvard, Vancouver, ISO, and other styles
33

Sejati, Sadewa Purba, and Ifnu Rifki Nurhidayanto. "Peningkatan Literasi Sumber Daya Air Tanah Menggunakan Media Interaktif Berbasis Android." Dinamisia : Jurnal Pengabdian Kepada Masyarakat 6, no. 6 (December 30, 2022): 1454–60. http://dx.doi.org/10.31849/dinamisia.v6i6.11118.

Full text
Abstract:
Groundwater is one of the elements of the geosphere that plays an important role in achieving sustainable development. The decrease in the quantity and quality of groundwater is very likely to occur due to intensive anthropogenic dynamics which often ignore environmental rules. The neglect of environmental rules that have the potential to reduce the quantity and quality of groundwater resources is caused by a lack of literacy and knowledge of groundwater science. Literacy of groundwater resources needs to be applied to all levels of society, especially the younger generation as the successor of sustainable development so that groundwater sustainability is maintained. Literacy resources to increase insight need to contain visual elements, animations, descriptions and can be accessed by Android-based smart phones. The realization of solutions to partner problems will be realized through training, discussions, and questions and answers. The training activities were carried out to provide an understanding of how to download, install, and use the Groundwater App with a smartphone. Discussion and question and answer activities were carried out to discuss the visual and interactive substance presented by the application. The activities that have been carried out have been able to increase the insight of the younger generation about groundwater resources.
APA, Harvard, Vancouver, ISO, and other styles
34

Oetken, L. "β CrB – a Rosetta Stone?" International Astronomical Union Colloquium 90 (1986): 355–58. http://dx.doi.org/10.1017/s025292110009179x.

Full text
Abstract:
AbstractCombining the information from the speckle interferometric and spectroscopic binary βCrB a mass of 1.82 solar masses and an absolute visual magnitude Mv = 1m.42 is found, indicating that the star may be in that state of evolution, when the stellar core has shrinked after hydrogen exhaustion in it and the energy generation mainly comes from the envelope. The question whether all magnetic Ap stars are in that special evolutionary state is revived.
APA, Harvard, Vancouver, ISO, and other styles
35

Rannula, Kateriina, Elle Sõrmus, and Siret Piirsalu. "GENERATION Z IN HIGHER EDUCATION – INVESTIGATING THE PREFERRED MEDIUM OF TEXT IN ACADEMIC READING." EPH - International Journal of Educational Research 4, no. 3 (November 18, 2020): 1–6. http://dx.doi.org/10.53555/ephijer.v4i3.67.

Full text
Abstract:
According to Nugin et al. (2016) it is not always easy to determine the exact date of birth of every generation as the transitions have been found to be difficult to pinpoint. (9). According to Seemiller and Grace (2015), the representatives of generation Z were born between 1995 to 2010 and have already entered higher education. Reading online has become one of the most widely used sources of knowledge for learners, especially in the academic contexts (Zarrabi, 2015), thus it is not surprising that for generation Z visual media may as well have the most substantial impact whereas internet and smart devices are the main means of communication making it possible for every question to be answered immediately. Teachers nowadays are faced with a somewhat challenging task to meet the information obtaining needs and preferences of the new generation while belonging to the previous generation themselves. In order to cast light into the process of choosing the most suitable methods and materials as well as the devices for the learners of Generation Z, a research was started in Tallinn Health Care College, with an aim to investigate different generations´ text searching and reading strategies. Current article focuses on describing the preferred medium of text among the representatives of Generation Z. An online semi-structured questionnaire was conducted descriptive analysis conducted. The results direct authors toward further research to investigate the strategies of working with texts keeping in mind the characteristics of Generation Z in order to use the mediums and strategies to create effective learning possibilities in higher education.
APA, Harvard, Vancouver, ISO, and other styles
36

Gladston, Angelin, and Deeban Balaji. "Semantic Attention Network for Image Captioning and Visual Question Answering Based on Image High-Level Semantic Attributes." International Journal of Big Data Intelligence and Applications 3, no. 1 (January 1, 2022): 1–18. http://dx.doi.org/10.4018/ijbdia.313201.

Full text
Abstract:
The main challenge in the vision-to-language system is generation of the caption with a proper meaningful answer for a question and extracting even the minute details from the image. The main contributions in this paper are presenting an approach based on image high-level semantic attributes and local image features address the challenges of V2L tasks. Especially, the high-level semantic attributes information is used to reduce the semantic gap between images and text. A novel semantic attention network is designed to explore the mapping relationships between semantic attributes and image regions. The semantic attention network highlights the concept-related regions and selects the region-related concepts. Two special V2L tasks, image captioning and VQA, are addressed by the proposed approach. Improved BLEU score shows the proposed image captioning performs well. The experimental results show that the proposed model is effective for V2L tasks.
APA, Harvard, Vancouver, ISO, and other styles
37

Geng, Shijie, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le Roux, Yongfeng Zhang, Hongsheng Li, and Anoop Cherian. "Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (May 18, 2021): 1415–23. http://dx.doi.org/10.1609/aaai.v35i2.16231.

Full text
Abstract:
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.
APA, Harvard, Vancouver, ISO, and other styles
38

Zhu, Yongxin, Zhen Liu, Yukang Liang, Xin Li, Hao Liu, Changcun Bao, and Linli Xu. "Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 9 (June 26, 2023): 11479–87. http://dx.doi.org/10.1609/aaai.v37i9.26357.

Full text
Abstract:
In this paper, we propose a novel multi-modal framework for Scene Text Visual Question Answering (STVQA), which requires models to read scene text in images for question answering. Apart from text or visual objects, which could exist independently, scene text naturally links text and visual modalities together by conveying linguistic semantics while being a visual object in an image simultaneously. Different to conventional STVQA models which take the linguistic semantics and visual semantics in scene text as two separate features, in this paper, we propose a paradigm of "Locate Then Generate" (LTG), which explicitly unifies this two semantics with the spatial bounding box as a bridge connecting them. Specifically, at first, LTG locates the region in an image that may contain the answer words with an answer location module (ALM) consisting of a region proposal network and a language refinement network, both of which can transform to each other with one-to-one mapping via the scene text bounding box. Next, given the answer words selected by ALM, LTG generates a readable answer sequence with an answer generation module (AGM) based on a pre-trained language model. As a benefit of the explicit alignment of the visual and linguistic semantics, even without any scene text based pre-training tasks, LTG can boost the absolute accuracy by +6.06% and +6.92% on the TextVQA dataset and the ST-VQA dataset respectively, compared with a non-pre-training baseline. We further demonstrate that LTG effectively unifies visual and text modalities through the spatial bounding box connection, which is underappreciated in previous methods.
APA, Harvard, Vancouver, ISO, and other styles
39

Gil, Bruno. "Digital redux: the confluence of technologies and politics in architecture." Architectural Research Quarterly 19, no. 3 (September 2015): 259–68. http://dx.doi.org/10.1017/s135913551500055x.

Full text
Abstract:
Much has been written on the impact of digital technology and its translation into architectural practice and education. This paper reconsiders this process of integration to question current trends in research and pedagogy. Recent efforts to expand architectural research in schools tend to focus on broadening its spectrum while reinforcing design as research. We argue that the relevance of digital technology to these discussions, and its role in expanding fields of research, depends on different cultures of investigation and their differing institutional contexts. Our interest is in cultures that, as lines of thought, broaden the field of architectural research and make it heterodox, thickening a line of research with multiple interpretations. Tracing early experiences in the ‘paperless’ studios of Columbia's Graduate School of Architecture Planning and Preservation, this paper questions how architectural pedagogies are developing in response to the current normalisation of digital design, with a focus on the Architecture Association's Design Research Lab (DRL) and the Strelka Institute in Moscow.Ultimately, the main question is addressed: in what ways can the digital be political? By referring to the DRL and Strelka research programmes, two distinct approaches have been critically explored. On the one hand, DRL has been pushing to the limit the idea of research by design, considering autonomous form as the materialisation of change in the design process, while Strelka has been practicing research as the information for design. If digital technology contributes to form generation at DRL, at Strelka it potentiates opinion generation, and the research product is information, rather than form itself. The triangulation of both approaches could eventually suggest a more thorough political expression by means of a digital redux.
APA, Harvard, Vancouver, ISO, and other styles
40

Sevastjanova, Rita, Wolfgang Jentner, Fabian Sperrle, Rebecca Kehlbeck, Jürgen Bernard, and Mennatallah El-assady. "QuestionComb: A Gamification Approach for the Visual Explanation of Linguistic Phenomena through Interactive Labeling." ACM Transactions on Interactive Intelligent Systems 11, no. 3-4 (December 31, 2021): 1–38. http://dx.doi.org/10.1145/3429448.

Full text
Abstract:
Linguistic insight in the form of high-level relationships and rules in text builds the basis of our understanding of language. However, the data-driven generation of such structures often lacks labeled resources that can be used as training data for supervised machine learning. The creation of such ground-truth data is a time-consuming process that often requires domain expertise to resolve text ambiguities and characterize linguistic phenomena. Furthermore, the creation and refinement of machine learning models is often challenging for linguists as the models are often complex, in-transparent, and difficult to understand. To tackle these challenges, we present a visual analytics technique for interactive data labeling that applies concepts from gamification and explainable Artificial Intelligence (XAI) to support complex classification tasks. The visual-interactive labeling interface promotes the creation of effective training data. Visual explanations of learned rules unveil the decisions of the machine learning model and support iterative and interactive optimization. The gamification-inspired design guides the user through the labeling process and provides feedback on the model performance. As an instance of the proposed technique, we present QuestionComb , a workspace tailored to the task of question classification (i.e., in information-seeking vs. non-information-seeking questions). Our evaluation studies confirm that gamification concepts are beneficial to engage users through continuous feedback, offering an effective visual analytics technique when combined with active learning and XAI.
APA, Harvard, Vancouver, ISO, and other styles
41

Halwani, Noha. "Visual Aids and Multimedia in Second Language Acquisition." English Language Teaching 10, no. 6 (May 25, 2017): 53. http://dx.doi.org/10.5539/elt.v10n6p53.

Full text
Abstract:
Education involves more than simply passing the final test. Rather, it is the process of educating an entire generation. This research project focused on language learners of English as a Second Language. This action research was conducted in an ESL classroom in H. Frank Carey High School, one of five high schools in the Sewanhaka Central District of Nassau County. The research project explored the question: “Can visual aids improve English language acquisition in reading and writing for a beginner ESL?” The data analyzed were log observation sheets, pull-out focus groups, checklists, and surveys of students. The basic findings were that reading and writing improved when teachers used visual aids, especially when teachers pulled students out of the classroom for individualized instruction. Therefore, the study concluded that the use of visual aids and multimedia can help the students to absorb the content and become interactive in the classroom with no fear of giving wrong answers or, of having trouble being a participant in the class because of shyness.
APA, Harvard, Vancouver, ISO, and other styles
42

Moore, Bartlett D., Henry J. Alitto, and W. Martin Usrey. "Orientation Tuning, But Not Direction Selectivity, Is Invariant to Temporal Frequency in Primary Visual Cortex." Journal of Neurophysiology 94, no. 2 (August 2005): 1336–45. http://dx.doi.org/10.1152/jn.01224.2004.

Full text
Abstract:
The activity of neurons in primary visual cortex is influenced by the orientation, contrast, and temporal frequency of a visual stimulus. This raises the question of how these stimulus properties interact to shape neuronal responses. While past studies have shown that the bandwidth of orientation tuning is invariant to stimulus contrast, the influence of temporal frequency on orientation-tuning bandwidth is unknown. Here, we investigate the influence of temporal frequency on orientation tuning and direction selectivity in area 17 of ferret visual cortex. For both simple cells and complex cells, measures of orientation-tuning bandwidth (half-width at half-maximum response) are ∼20–25° across a wide range of temporal frequencies. Thus cortical neurons display temporal-frequency invariant orientation tuning. In contrast, direction selectivity is typically reduced, and occasionally reverses, at nonpreferred temporal frequencies. These results show that the mechanisms contributing to the generation of orientation tuning and direction selectivity are differentially affected by the temporal frequency of a visual stimulus and support the notion that stability of orientation tuning is an important aspect of visual processing.
APA, Harvard, Vancouver, ISO, and other styles
43

Wardani, Winny Gunarti Widya, and Ahmad Faiz Muntazori. "Islamic Memes as Media of Da'wah for Millennials Generations: Analysis of Visual Language On Islamic Memes With Illustration Style." Cultural Syndrome 1, no. 1 (July 23, 2019): 61–78. http://dx.doi.org/10.30998/cs.v1i1.16.

Full text
Abstract:
Islam as a religion of da'wah has obliged every Muslim to play a role in spreading the truth of the Qur'an. In the era of information technology like today, the spread of Islamic teachings can be done in various ways, including through memes. For millennials who are proficient with technology, Islamic memes are an alternative media for da'wah. This is due to the power of memes in conveying messages through image visualization and humour-style text. Islamic memes are generally distributed via the internet and messaging applications on smartphones. Most Islamic memes are designed using illustration styles. To understand the visual language of memes, this study formulates the question: how to read visual signs in Islamic memes as da'wah media, because the types of da'wah in memes are not only in the form of written text but also in the form of images? This study uses a combination method, which combines quantitative and qualitative approaches. Quantitatively, this study collects data about the views of the millennial generation on the attractiveness of illustration-style Islamic memes. Whereas qualitatively, an analysis of samples of illustration-style Islamic memes uses semiotic theory to see the structure of design elements as the visual language of da’wah messages. The results of this study are expected to be a reference for the scientific field of visual communication design, as well as encourage the creation of more productive and communicative Islamic memes as da'wah media for millennial generations.
APA, Harvard, Vancouver, ISO, and other styles
44

Busch, Steffen, Alexander Schlichting, and Claus Brenner. "Generation and communication of dynamic maps using light projection." Proceedings of the ICA 1 (May 16, 2018): 1–8. http://dx.doi.org/10.5194/ica-proc-1-16-2018.

Full text
Abstract:
Many accidents are caused by miscommunication between traffic participants. Much research is being conducted in the area of car to car and car to infrastructure communication in order to eliminate this cause of accidents. How-ever, less attention is paid to the question how the behavior of a car can be communicated to pedestrians. Especially considering automated traffic, there is a lack of communication between cars and pedestrians.<br> In this paper, we address the question how an autonomously driving car can inform pedestrians about its intentions. Especially in case of highly automated driving, making eye contact with a driver will give no clue about his or her intensions. We developed a prototype which continuously informs pedestrians about the intentions of the vehicle by projecting visual patterns onto the ground. Furthermore, the system communicates its interpretation of the observed situation to the pedestrians to warn them or to encourage them to perform a certain action. In order to communicate adaptively, the vehicle needs to develop an understanding of the dynamics of a city to know what to expect in certain situations and what speed is appropriate. To support this, we created a dynamic map, which estimates the number of pedestrians and cyclists in a certain area, which is then used to determine how ‘hazardous’ the area is. This dynamic map is obtained from measurement data from many time instances, in contrast to the static car navigation maps, which are prevalent today. Apart from being used for communication purposes, the dynamic map can also influence the speed of a car, be it manually or autonomously driven. Adapting the speed in hazardous areas will avoid accidents where a car drives too fast, so that neither a human nor a computer-operated system would be able to stop in time.
APA, Harvard, Vancouver, ISO, and other styles
45

Ma, Han, Baoyu Fan, Benjamin K. Ng, and Chan-Tong Lam. "VL-Few: Vision Language Alignment for Multimodal Few-Shot Meta Learning." Applied Sciences 14, no. 3 (January 30, 2024): 1169. http://dx.doi.org/10.3390/app14031169.

Full text
Abstract:
Complex tasks in the real world involve different modal models, such as visual question answering (VQA). However, traditional multimodal learning requires a large amount of aligned data, such as image text pairs, and constructing a large amount of training data is a challenge for multimodal learning. Therefore, we propose VL-Few, which is a simple and effective method to solve the multimodal few-shot problem. VL-Few (1) proposes the modal alignment, which aligns visual features into language space through a lightweight model network and improves the multimodal understanding ability of the model; (2) adopts few-shot meta learning in the multimodal problem, which constructs a few-shot meta task pool to improve the generalization ability of the model; (3) proposes semantic alignment to enhance the semantic understanding ability of the model for the task, context, and demonstration; (4) proposes task alignment that constructs training data into the target task form and improves the task understanding ability of the model; (5) proposes generation alignment, which adopts the token-level training and multitask fusion loss to improve the generation ability of the model. Our experimental results show the effectiveness of VL-Few for multimodal few-shot problems.
APA, Harvard, Vancouver, ISO, and other styles
46

Riaubiene, Edita, Eglė Navickeinė, and Dalia Dijokienė. "The profile of Lithuanian architects in relation to the professional generations active today." Landscape architecture and art 22, no. 22 (December 20, 2023): 69–80. http://dx.doi.org/10.22616/j.landarchart.2023.22.07.

Full text
Abstract:
The research focuses on the professional profile of architects by analyzing their identity and creative principles. The aim is to explore the professional community of Lithuanian architects who are currently shaping the built environment, to identify their heterogeneity in terms of professional generations. The problem of the research is shaped by the current controversies in the field of architecture concerning the changing status, activities, and responsibilities of the architect. The relevance of the study lies in several aspects: the lack of in-depth sociological research on the professional community of Lithuanian architects; the attempt to verify and clarify the results of the semi-structured interview study Lithuanian Architects on Architecture, and the reflection on the global architectural situation and the new agenda for architectural design towards a high quality built environment. The study adopted a mixed methods research design. This involved the collection, analysis, and interpretation of both quantitative and qualitative data. This methodology is chosen because the research requires a complex and multifaceted approach to the phenomenon of architecture and the problems of architectural practice. It also allowed a larger group of research participants to be reached (450 respondents). The questionnaire contains 13 questions, each is structured in a multiple-choice format, with one option being an open-ended question. The questions are grouped under several themes: 1) the nature and fields of architectural practice and the concept of architecture; 2) the scope of practice and the allocation of professional time; 3) selfdetermination and professional loyalty; and 4) creative principles. Descriptive statistical methods were used to process the survey data. Content analysis and, to some extent, thematic analysis were used to analyze quantitative data from open-ended questions. The study highlights that the professional generations of architects analyzed follow the general trend of architecture, refuting the hypothesis that the approach of each generation is significantly different. However, it has been observed that the representatives of each generation show a particular attitude in a specific area, which indicates the dynamics of an attitude or predicts a change in the architectural community as a whole. The youngest generation of architects is an indicator of change. It is characterized by seeing a great diversity of aspects in architecture and architectural practice.
APA, Harvard, Vancouver, ISO, and other styles
47

Umanskaya, Zhanna V. "VISUALIZATION OF SOVIET CHILDHOOD IN DRAWINGS BY EUGENIYA DVOSKINA." RSUH/RGGU Bulletin. "Literary Theory. Linguistics. Cultural Studies" Series, no. 8 (2020): 96–115. http://dx.doi.org/10.28995/2686-7249-2020-8-96-115.

Full text
Abstract:
The author explores ways to visualize the everyday life of the Brezhnev period’s soviet childhood in a Eugeniya Dvoskina’s drawings cycle «#forthosewhoremember». Comparing the artist’s work with other modern visual nostalgic projects, the significance of the selected source is justified: this cycle allows us to give an idea of the visual environment of the child, typical kinds of the children’s territory, public and private areas in the collective memory of the generation. Based on the methodology of visual sociology (P. Shtompka, O.V. Gavrishina), the author analyzes the reasons for the cycle’s perception of the older generation as uniquely “Soviet” and raises the question about markers of “Soviet childhood”. The universality and heritability of many children’s practices makes them timeless, so the design of the material world and symbols of Soviet ideology are main signs of the historical era. Compositional and graphic solutions of images play an important role for the viewer’s perception. Knowledge of nature and artistic skill allows the artist to create heroes with accurate behavioral characteristics and evokes, in addition to visual, almost all types of sensory memory (tactile, motor, audio). The use of accompaniment text, often in the form of speech formulas, is crucial for this effect. If we consider this cycle in the logic of S.”Boym’s reasoning about nostalgia, drawings about soviet childhood can be attributed to the procedural type of nostalgia, which is characterized by irony and contradictory attitude to the past. Eugeniya Dvoskina’s work provides a complex multi-faceted visualization of the everyday life of Soviet childhood in the 60–80s of the XX century.
APA, Harvard, Vancouver, ISO, and other styles
48

Li, Yehao, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei. "Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 2 (May 31, 2022): 1–16. http://dx.doi.org/10.1145/3473140.

Full text
Abstract:
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality and sentence decoder that enables both multi-modal reasoning and sentence generation via inter-modal interaction. Considering that the linguistic representations of each image can span different granularities in this hierarchy including, from simple to comprehensive, individual label, a phrase, and a natural sentence, we pre-train Uni-EDEN through multi-granular vision-language proxy tasks: Masked Object Classification, Masked Region Phrase Generation, Image-Sentence Matching, and Masked Sentence Generation. In this way, Uni-EDEN is endowed with the power of both multi-modal representation extraction and language modeling. Extensive experiments demonstrate the compelling generalizability of Uni-EDEN by fine-tuning it to four vision-language perception and generation downstream tasks.
APA, Harvard, Vancouver, ISO, and other styles
49

Ladai, A. D., and J. Miller. "Point Cloud Generation from sUAS-Mounted iPhone Imagery: Performance Analysis." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XL-1 (November 7, 2014): 201–5. http://dx.doi.org/10.5194/isprsarchives-xl-1-201-2014.

Full text
Abstract:
The rapidly growing use of sUAS technology and fast sensor developments continuously inspire mapping professionals to experiment with low-cost airborne systems. Smartphones has all the sensors used in modern airborne surveying systems, including GPS, IMU, camera, etc. Of course, the performance level of the sensors differs by orders, yet it is intriguing to assess the potential of using inexpensive sensors installed on sUAS systems for topographic applications. This paper focuses on the quality analysis of point clouds generated based on overlapping images acquired by an iPhone 5s mounted on a sUAS platform. To support the investigation, test data was acquired over an area with complex topography and varying vegetation. In addition, extensive ground control, including GCPs and transects were collected with GSP and traditional geodetic surveying methods. The statistical and visual analysis is based on a comparison of the UAS data and reference dataset. The results with the evaluation provide a realistic measure of data acquisition system performance. The paper also gives a recommendation for data processing workflow to achieve the best quality of the final products: the digital terrain model and orthophoto mosaic. <br><br> After a successful data collection the main question is always the reliability and the accuracy of the georeferenced data.
APA, Harvard, Vancouver, ISO, and other styles
50

de Vries, Jan. "Renaissance Cities." Renaissance Quarterly 42, no. 4 (1989): 781–93. http://dx.doi.org/10.2307/2862282.

Full text
Abstract:
“What does economic history have to do with Renaissance scholarship?” This is the question I asked myself when I was asked to participate in a panel with the title “Recent Trends in Renaissance Scholarship: Economic History.” Over a generation ago economic history escaped from the confines of conventional historical periodization, in which the Renaissance functions as the keystone, with its claim to being the origin of modernity. This conventional periodization, with its inconsistent mingling of political and cultural criteria for the organization of the narrative of modern history, makes whole categories of historical questions almost impossible to ask, let alone to answer. For many economic historians—and I count myself among them—it was a liberation to abandon all this in favor of a periodizing structure determined by long trends in population, price levels, relative prices, and other phenomena associated with these.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography