Academic literature on the topic 'Multimodal document understanding'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Multimodal document understanding.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Multimodal document understanding"

1

Cho, Seongkuk, Jihoon Moon, Junhyeok Bae, Jiwon Kang, and Sangwook Lee. "A Framework for Understanding Unstructured Financial Documents Using RPA and Multimodal Approach." Electronics 12, no. 4 (February 13, 2023): 939. http://dx.doi.org/10.3390/electronics12040939.

Full text
Abstract:
The financial business process worldwide suffers from huge dependencies upon labor and written documents, thus making it tedious and time-consuming. In order to solve this problem, traditional robotic process automation (RPA) has recently been developed into a hyper-automation solution by combining computer vision (CV) and natural language processing (NLP) methods. These solutions are capable of image analysis, such as key information extraction and document classification. However, they could improve on text-rich document images and require much training data for processing multilingual documents. This study proposes a multimodal approach-based intelligent document processing framework that combines a pre-trained deep learning model with traditional RPA used in banks to automate business processes from real-world financial document images. The proposed framework can perform classification and key information extraction on a small amount of training data and analyze multilingual documents. In order to evaluate the effectiveness of the proposed framework, extensive experiments were conducted using Korean financial document images. The experimental results show the superiority of the multimodal approach for understanding financial documents and demonstrate that adequate labeling can improve performance by up to about 15%.
APA, Harvard, Vancouver, ISO, and other styles
2

Meskill, Carla, Jennifer Nilsen, and Alan Oliveira. "Intersections of Language, Content, and Multimodalities: Instructional Conversations in Mrs. B’s Sheltered English Biology Classroom." AERA Open 5, no. 2 (April 2019): 233285841985048. http://dx.doi.org/10.1177/2332858419850488.

Full text
Abstract:
The challenges inherent in mastering academic content in a new language are many. When it comes to learning science in U.S. high schools, English learners (ELs) confront these on a daily basis. In an effort to document expert language/content instructional strategies, we analyze Mrs. B’s sheltered high school biology class, made up of ELs from around the world and representing varying stages of emerging bilingualism. The aim of this 2-year case study was to detail effective teaching patterns in a high-functioning multicultural science class—a class where the myriad linguistic, cultural, and affective needs of students are expertly met—and to subsequently suggest a model for understanding and undertaking powerful language and content learning supported by multimodal referents. From a rich data set comprising class recordings, interviews, reflections from Mrs. B, course documents, student work, and survey responses emerged a model of the language/content multimodal interface for teaching ELs.
APA, Harvard, Vancouver, ISO, and other styles
3

Nugrahawati, Ana Wiyasa. "Teaching Religious Tolerance Through Critical and Evaluative Reading Course for English Language Education Students." ELE Reviews: English Language Education Reviews 3, no. 1 (May 31, 2023): 33–45. http://dx.doi.org/10.22515/elereviews.v3i1.6611.

Full text
Abstract:
Religious tolerance is crucial for bridging a good intercultural interaction among people from different religious backgrounds. In the context of teaching, the critical and evaluative reading course is one of the courses that can facilitate students to foster their religious tolerance. This research aims to investigate the implementation of critical and evaluative reading course in building students’ religious tolerance. Taking the case at UIN Raden Mas Said, this descriptive research collected the data through interviews, observation, and document analysis. The findings showed that the fundamental aspects for practicing reading comprehension in critical and evaluative reading course are multimodal text materials addressing religious, cultural, and value practices and beliefs taken from various media, printed or online. The teaching strategy was reading to learn to help students build critical thinking. The students were able to perform religious tolerance understanding during the study period. It implies religious tolerance can be cultivated through reading courses using multimodal texts that can help students in their daily intercultural interaction practices.
APA, Harvard, Vancouver, ISO, and other styles
4

Halverson, Erica Rosenfeld. "Film as Identity Exploration: A Multimodal Analysis of Youth-Produced Films." Teachers College Record: The Voice of Scholarship in Education 112, no. 9 (September 2010): 2352–78. http://dx.doi.org/10.1177/016146811011200903.

Full text
Abstract:
Background/Context Researchers have begun to document and understand the work youth do as they compose in multiple media including video games, online virtual worlds, participatory fan cultural practices, and in the digital media arts. However, we lack mechanisms for analyzing the products, especially when it comes to understanding the relationship between storytelling and identity. Objective In this article, I bring together prior research on youth-produced media, social semiotic analysis frameworks for analyzing these products and the formal analysis of films to construct an analytic framework for understanding youth-produced films as spaces for identity construction and representation. Research design The research reported on in this article is the design and illustration of an analytic framework for understanding how youth construct and represent their identities through the films they make. The framework design begins with Kress and van Leeuwen's (2006) work on the analysis of visual design as a set of semiotic resources for describing how we make meaning with multimodal texts. However, this work does little to depict how the specific tools of film both cinematic (e.g., editing, cinematography) and filmic (music, action) (Burn & Parker, 2003) are used to construct and communicate identities. Therefore, I turn to film theory to develop a coding scheme that can assist in the meaningful interpretation of the phases and transitions of youth-produced films. I then illustrate this framework in action by analyzing one youth-produced film, Rules of Engagement, as a multimodal product of identity. Conclusions/Recommendations This analysis demonstrates how films like Rules of Engagement display the construction of a viable social identity primarily through the interactions among filmic elements. Specifically it is in the transition spaces between phases of the film where youth actively insert their understanding of how to represent complex portraits of how they see themselves, how others see them, and how they fit into their communities. Analyzing the products of a rich, complex literacy practice is a critical way to make sense of how youth engage with issues of identity through the media they create. This is especially important for youth who feel marginalized in mainstream institutions and do not have opportunities to explore a positive sense of self in traditional institutional contexts. Understanding how the construction of multimodal representation supports identity development processes can help us to bring these new media literacy practices to youth who are most in need of alternative mechanisms for engaging in positive identity work.
APA, Harvard, Vancouver, ISO, and other styles
5

Troshchenkova, E. V., and E. A. Rudneva. "THE CONCEPT OF LEGAL DOCUMENT IN THE PROFESSIONAL SPHERE." Voprosy Kognitivnoy Lingvistiki, no. 1 (2023): 32–42. http://dx.doi.org/10.20916/1812-3228-2023-1-32-42.

Full text
Abstract:
The article aims at analyzing LEGAL DOCUMENT within the framework of conceptualization to create specific forms of mental representations such as scientific concepts. This specific case is used to model the formation of special knowledge and diagnose the problems that the expert community may encounter when using the classical attribute approach with binary oppositions in the content of the defined concept. We tried to show how both in written and spoken discourse lawyers fail to find common and essential features, which would unite all the elements included in the concept of LEGAL DOCUMENT and simultaneously differentiate it from documents of non-legal nature. Despite the fact that the phrase “legal document” is repeatedly mentioned in textbooks on the theory of state and law, often a self-evident expression, legal researchers admit that the concept of LEGAL DOCUMENT is difficult to define and there is a lot of controversy about it within the professional community. The study considered a) fragments of theoretical works (articles and monographs) and textbooks with explicit definitions of “legal document” and discussions of definitions by other authors, as well as other contexts of using “legal document” in scientific legal discourse and legal documents themselves; and b) oral statements of practicing lawyers on their understanding of what a “legal document” is - fragments of 5 semi-structured interviews. Cognitive-discursive and socio- and anthropolinguistic approaches were used for material analysis. Structural, lexical-semantic and conceptual analysis of the proposed definitions and quasi-definitions, as well as conversational analysis of the interviews were carried out. Individual statements were further considered in the broader context of reasoning about the problem, taking into account the general logic of argumentation development, the coherence/inconsistency of judgments both by different speakers and in the reasoning of one speaker, contradictions of examples of the formulated position, focusing/defocusing. Conversational analysis also took into account hesitation markers, prosody and extralinguistic multimodal data to reason about mental processes of the interviewees. The study shows that we seem to be dealing with an attempt to delineate with the traditional logical definition the boundaries of a scientific concept, which is based on the pre-existing and well-formed fragment of everyday knowledge, having slightly different structure and resisting such definition methods. As a concept of everyday consciousness, it would seem productive to describe the LEGAL DOCUMENT from the position of family resemblance as a fuzzy set of partially overlapping elements (without uniform feature(s), or some of them being a continuum of graded parameters). Such mental representation could be conveniently described through the idea of prototypes with good and bad examples of the category. However, the lawyers in legal discourse intermittently try to use the concept as one of everyday consciousness and a scientific formation and are not fully aware of the degree of difference. As a result, we see how logical contradictions in the professional discourse are intensified.
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Jiapeng, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. "Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 4 (May 18, 2021): 2738–45. http://dx.doi.org/10.1609/aaai.v35i4.16378.

Full text
Abstract:
Visual Information Extraction (VIE) has attracted considerable attention recently owing to its various advanced applications such as document understanding, automatic marking and intelligent education. Most existing works decoupled this problem into several independent sub-tasks of text spotting (text detection and recognition) and information extraction, which completely ignored the high correlation among them during optimization. In this paper, we propose a robust Visual Information Extraction System (VIES) towards real-world scenarios, which is an unified end-to-end trainable framework for simultaneous text detection, recognition and information extraction by taking a single document image as input and outputting the structured information. Specifically, the information extraction branch collects abundant visual and semantic representations from text spotting for multimodal feature fusion and conversely, provides higher-level semantic clues to contribute to the optimization of text spotting. Moreover, regarding the shortage of public benchmarks, we construct a fully-annotated dataset called EPHOIE (https://github.com/HCIILAB/EPHOIE), which is the first Chinese benchmark for both text spotting and visual information extraction. EPHOIE consists of 1,494 images of examination paper head with complex layouts and background, including a total of 15,771 Chinese handwritten or printed text instances. Compared with the state-of-the-art methods, our VIES shows significant superior performance on the EPHOIE dataset and achieves a 9.01% F-score gain on the widely used SROIE dataset under the end-to-end scenario.
APA, Harvard, Vancouver, ISO, and other styles
7

Maja, Inke Choirun Nisa’ Il, and Salim Nabhan. "Literacy in EFL Classroom: In-Service English Teachers’ Perceptions and Practices from Multiliteracies Perspective." JET ADI BUANA 7, no. 02 (October 31, 2022): 207–17. http://dx.doi.org/10.36456/jet.v7.n02.2022.7124.

Full text
Abstract:
The concept of literacy has evolved significantly over the years, with the advent of new technologies and the changing needs of society. In the context of ELT, the study of literacy conceptions and practices is still under-explored. Therefore, this study aims at exploring in-service English teachers’ perceptions and practices of literacy in ELT settings from a multiliteracies perspective. This research used a qualitative case study. The data were taken from in-service English teachers from one of the state high schools in Surabaya, Indonesia through interviews, observation, and document review. The data were then analyzed using thematic analysis. The result indicated that the in-service English teachers perceived a general conception of literacy, a skill-based conception of literacy, and literacy as social practice. In general, the English teachers lacked an understanding of the concept of literacy comprehensively and they associated literacy with skills. In addition, concerning literacy practices from a multiliteracy perspective covered the integration of multimodal in the use of media, the use of technology in teaching and learning activities, and a variety of literacy instructions. Despite some difficulties, the teachers utilized multiple modes of media and technology. This study might have implications for the understanding of the conception of literacy and teaching practices in an EFL setting.
APA, Harvard, Vancouver, ISO, and other styles
8

Liu, Susan I., Morgan Shikar, Emily Gante, Patricia Prufeta, Kaylee Ho, Philip S. Barie, Robert J. Winchell, and Jennifer I. Lee. "Improving Communication and Response to Clinical Deterioration to Increase Patient Safety in the Intensive Care Unit." Critical Care Nurse 42, no. 5 (October 1, 2022): 33–43. http://dx.doi.org/10.4037/ccn2022295.

Full text
Abstract:
Background In the critical care setting, early recognition of clinical decompensation is imperative to trigger prompt intervention and optimize patient outcomes. Local Problem In a 20-bed surgical intensive care unit of an urban academic medical center, cases of clinical deterioration that highlighted opportunities to improve the communication process prompted a reassessment of health care provider roles and responsibilities. Methods A quality improvement initiative was implemented to enhance communication among intensive care unit clinical staff members, improve the timeliness of reporting clinical deterioration, and ensure implementation of timely, appropriate interventions to eliminate adverse outcomes. Interventions Nurses were surveyed to determine their perceptions of communication and collaboration among providers. Education was provided that focused on familiarizing nurses with clinical conditions necessitating direct notification of the attending surgical intensivist and included review of a case in which escalation of care did not occur. Multidisciplinary rounds were expanded to engage night-shift nurses in clinical discussions and decision-making. A template was created to document episodes of escalation in the electronic health record. Results Since implementation of the quality improvement interventions, no incidents of patient harm or death related to failure to escalate have occurred to date. A total of 16 episodes of escalation for clinical deterioration were documented in the electronic health record. Most nurses reported an increased level of confidence in understanding when to escalate concerns about clinical deterioration. Conclusion Implementing a multimodal program to empower nurses to escalate clinical concerns directly to the attending physician eliminated adverse events related to failure to escalate.
APA, Harvard, Vancouver, ISO, and other styles
9

Sarti, Aimee J., Stephanie Sutherland, Andrew Healey, Sonny Dhanani, Angele Landriault, Frances Fothergill-Bourbonnais, Michael Hartwick, Janice Beitel, Simon Oczkowski, and Pierre Cardinal. "A Multicenter Qualitative Investigation of the Experiences and Perspectives of Substitute Decision Makers Who Underwent Organ Donation Decisions." Progress in Transplantation 28, no. 4 (September 16, 2018): 343–48. http://dx.doi.org/10.1177/1526924818800046.

Full text
Abstract:
Background: Organ donation research has centered on improving donation rates rather than focusing on the experience and impact on substitute decision makers. The purpose of this study was to document donor and nondonor family experiences, as well as lasting impacts of donation. Methods: We used a qualitative exploratory design. Semistructured interviews of 27 next-of-kin decision makers were conducted, transcribed verbatim, and entered into qualitative software. We analyzed the process-based reflections using inductive coding and thematic analysis techniques. Results: Four broad and interrelated themes emerged from the data: empathetic care, information needs, donation decision, and impact and follow-up. The donation experience left lasting impacts on family members due to lingering, unanswered questions. Suggested solutions to improve the donor experience for families included providers employing multimodal communication, ensuring a proper setting for family meetings, and the presence of a support person. Discussion: We now have improved our understanding of the donation process from the perspective of and final impression from the next of kin. To our knowledge, this is the largest cohort interviewed in Canada. We have explored families’ experiences, which included but did not end with donation. We learned that despite being appreciative of nurses, physicians, and organ and tissue donation coordinators, family members were often troubled by unanswered questions. Conclusion: This study described donor and nondonor family experiences with donation as well as lasting impacts. Addressing unanswered questions should be done in a place sufficiently remote from the donation event to enhance the family members’ understanding and well-being.
APA, Harvard, Vancouver, ISO, and other styles
10

Rind, Esther, Klaus Kimpel, Christine Preiser, Falko Papenfuss, Anke Wagner, Karina Alsyte, Achim Siegel, et al. "Adjusting working conditions and evaluating the risk of infection during the COVID-19 pandemic in different workplace settings in Germany: a study protocol for an explorative modular mixed methods approach." BMJ Open 10, no. 11 (November 2020): e043908. http://dx.doi.org/10.1136/bmjopen-2020-043908.

Full text
Abstract:
IntroductionCurrently, many countries, affected by the COVID-19 pandemic, discuss how the ‘lockdown-restrictions’ could be lifted to restart the economy and public life after the first wave of the COVID-19 disease has subsided. This study protocol describes an approach designed to provide an in-depth understanding of how companies and their employees in Germany deal with their working conditions during the COVID-19 pandemic. We are also interested in how and why the risk of infection with SARS-CoV-2 could vary across different professional activities, company sites and regions with different epidemiological activity or infection control measures in Germany. We expect the results of this study to contribute to the development of working conditions protecting the health of employees during and beyond the COVID-19 pandemic.Methods and analysisAn explorative multimodal mixed methods approach will be applied. Module 1 comprises a document analysis of prevailing federal and regional laws and regulations at the respective location of the participating company. Module 2 includes qualitative interviews with key actors at different companies. Module 3 is a repeated standardised employee survey designed to capture potential changes in the participants’ experiences and attitudes towards working conditions, occupational safety regulations/measures, and infection control measures during the COVID-19 pandemic. Module 4 comprises SARS-CoV-2 seroprevalence testing. This is carried out by the medical service of the participating company sites as a voluntary offer for employees. Qualitative data will be analysed through document and content analysis. The complexity of the quantitative analysis depends on the response rates of modules 3 and 4.Ethics and disseminationThe approval of the study design was received in June 2020 from the responsible local ethical committee of the Medical Faculty, University of Tübingen and University Hospital Tübingen (No. 423/2020BO). The results will be presented at national and international conferences and published in peer-reviewed journals.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Multimodal document understanding"

1

Bakkali, Souhail. "Multimodal Document Understanding with Unified Vision and Language Cross-Modal Learning." Electronic Thesis or Diss., La Rochelle, 2022. http://www.theses.fr/2022LAROS046.

Full text
Abstract:
Les modèles développés dans cette thèse sont le résultat d'un processus itératif d'analyse et de synthèse entre les théories existantes et nos études réalisées. Plus spécifiquement, nous souhaitons étudier l'apprentissage inter-modal pour la compréhension contextualisée sur les composants des documents à travers le langage et la vision. Cette thèse porte sur l'avancement de la recherche sur l'apprentissage inter-modal et apporte des contributions sur quatre fronts : (i) proposer une approche inter-modale avec des réseaux profonds pour exploiter conjointement les informations visuelles et textuelles dans un espace de représentation sémantique commun afin d'effectuer et de créer automatiquement des prédictions sur les documents multimodaux; (ii) à étudier des stratégies concurrentielles pour s'attaquer aux tâches de classification de documents intermodaux, de récupération basée sur le contenu et de classification few-shot de documents ; (iii) pour résoudre les problèmes liés aux données comme l'apprentissage lorsque les données ne sont pas annotées, en proposant un réseau qui apprend des représentations génériques à partir d'une collection de documents non étiquetés ; et (iv) à exploiter les paramètres d'apprentissage few-shot lorsque les données ne contiennent que peu d’exemples
The frameworks developed in this thesis were the outcome of an iterative process of analysis and synthesis between existing theories and our performed studies. More specifically, we wish to study cross-modality learning for contextualized comprehension on document components across language and vision. The main idea is to leverage multimodal information from document images into a common semantic space. This thesis focuses on advancing the research on cross-modality learning and makes contributions on four fronts: (i) to proposing a cross-modal approach with deep networks to jointly leverage visual and textual information into a common semantic representation space to automatically perform and make predictions about multimodal documents (i.e., the subject matter they are about); (ii) to investigating competitive strategies to address the tasks of cross-modal document classification, content-based retrieval and few-shot document classification; (iii) to addressing data-related issues like learning when data is not annotated, by proposing a network that learns generic representations from a collection of unlabeled documents; and (iv) to exploiting few-shot learning settings when data contains only few examples
APA, Harvard, Vancouver, ISO, and other styles
2

Delecraz, Sébastien. "Approches jointes texte/image pour la compréhension multimodale de documents." Thesis, Aix-Marseille, 2018. http://www.theses.fr/2018AIXM0634/document.

Full text
Abstract:
Les mécanismes de compréhension chez l'être humain sont par essence multimodaux. Comprendre le monde qui l'entoure revient chez l'être humain à fusionner l'information issue de l'ensemble de ses récepteurs sensoriels. La plupart des documents utilisés en traitement automatique de l'information sont multimodaux. Par exemple, du texte et des images dans des documents textuels ou des images et du son dans des documents vidéo. Cependant, les traitements qui leurs sont appliqués sont le plus souvent monomodaux. Le but de cette thèse est de proposer des traitements joints s'appliquant principalement au texte et à l'image pour le traitement de documents multimodaux à travers deux études : l'une portant sur la fusion multimodale pour la reconnaissance du rôle du locuteur dans des émissions télévisuelles, l'autre portant sur la complémentarité des modalités pour une tâche d'analyse linguistique sur des corpus d'images avec légendes. Pour la première étude nous nous intéressons à l'analyse de documents audiovisuels provenant de chaînes d'information télévisuelle. Nous proposons une approche utilisant des réseaux de neurones profonds pour la création d'une représentation jointe multimodale pour les représentations et la fusion des modalités. Dans la seconde partie de cette thèse nous nous intéressons aux approches permettant d'utiliser plusieurs sources d'informations multimodales pour une tâche monomodale de traitement automatique du langage, afin d'étudier leur complémentarité. Nous proposons un système complet de correction de rattachements prépositionnels utilisant de l'information visuelle, entraîné sur un corpus multimodal d'images avec légendes
The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
APA, Harvard, Vancouver, ISO, and other styles
3

Vukotic, Verdran. "Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data." Thesis, Rennes, INSA, 2017. http://www.theses.fr/2017ISAR0015/document.

Full text
Abstract:
La thèse porte sur le développement d'architectures neuronales profondes permettant d'analyser des contenus textuels ou visuels, ou la combinaison des deux. De manière générale, le travail tire parti de la capacité des réseaux de neurones à apprendre des représentations abstraites. Les principales contributions de la thèse sont les suivantes: 1) Réseaux récurrents pour la compréhension de la parole: différentes architectures de réseaux sont comparées pour cette tâche sur leurs facultés à modéliser les observations ainsi que les dépendances sur les étiquettes à prédire. 2) Prédiction d’image et de mouvement : nous proposons une architecture permettant d'apprendre une représentation d'une image représentant une action humaine afin de prédire l'évolution du mouvement dans une vidéo ; l'originalité du modèle proposé réside dans sa capacité à prédire des images à une distance arbitraire dans une vidéo. 3) Encodeurs bidirectionnels multimodaux : le résultat majeur de la thèse concerne la proposition d'un réseau bidirectionnel permettant de traduire une modalité en une autre, offrant ainsi la possibilité de représenter conjointement plusieurs modalités. L'approche été étudiée principalement en structuration de collections de vidéos, dons le cadre d'évaluations internationales où l'approche proposée s'est imposée comme l'état de l'art. 4) Réseaux adverses pour la fusion multimodale: la thèse propose d'utiliser les architectures génératives adverses pour apprendre des représentations multimodales en offrant la possibilité de visualiser les représentations dans l'espace des images
In this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain
APA, Harvard, Vancouver, ISO, and other styles
4

Mangin, Olivier. "Emergence de concepts multimodaux : de la perception de mouvements primitifs à l'ancrage de mots acoustiques." Thesis, Bordeaux, 2014. http://www.theses.fr/2014BORD0002/document.

Full text
Abstract:
Cette thèse considère l'apprentissage de motifs récurrents dans la perception multimodale. Elle s'attache à développer des modèles robotiques de ces facultés telles qu'observées chez l'enfant, et elle s'inscrit en cela dans le domaine de la robotique développementale.Elle s'articule plus précisément autour de deux thèmes principaux qui sont d'une part la capacité d'enfants ou de robots à imiter et à comprendre le comportement d'humains, et d'autre part l'acquisition du langage. A leur intersection, nous examinons la question de la découverte par un agent en développement d'un répertoire de motifs primitifs dans son flux perceptuel. Nous spécifions ce problème et établissons son lien avec ceux de l'indétermination de la traduction décrit par Quine et de la séparation aveugle de source tels qu'étudiés en acoustique.Nous en étudions successivement quatre sous-problèmes et formulons une définition expérimentale de chacun. Des modèles d'agents résolvant ces problèmes sont également décrits et testés. Ils s'appuient particulièrement sur des techniques dites de sacs de mots, de factorisation de matrices et d'apprentissage par renforcement inverse. Nous approfondissons séparément les trois problèmes de l'apprentissage de sons élémentaires tels les phonèmes ou les mots, de mouvements basiques de danse et d'objectifs primaires composant des tâches motrices complexes. Pour finir nous étudions le problème de l'apprentissage d'éléments primitifs multimodaux, ce qui revient à résoudre simultanément plusieurs des problèmes précédents. Nous expliquons notamment en quoi cela fournit un modèle de l'ancrage de mots acoustiques
This thesis focuses on learning recurring patterns in multimodal perception. For that purpose it develops cognitive systems that model the mechanisms providing such capabilities to infants; a methodology that fits into thefield of developmental robotics.More precisely, this thesis revolves around two main topics that are, on the one hand the ability of infants or robots to imitate and understand human behaviors, and on the other the acquisition of language. At the crossing of these topics, we study the question of the how a developmental cognitive agent can discover a dictionary of primitive patterns from its multimodal perceptual flow. We specify this problem and formulate its links with Quine's indetermination of translation and blind source separation, as studied in acoustics.We sequentially study four sub-problems and provide an experimental formulation of each of them. We then describe and test computational models of agents solving these problems. They are particularly based on bag-of-words techniques, matrix factorization algorithms, and inverse reinforcement learning approaches. We first go in depth into the three separate problems of learning primitive sounds, such as phonemes or words, learning primitive dance motions, and learning primitive objective that compose complex tasks. Finally we study the problem of learning multimodal primitive patterns, which corresponds to solve simultaneously several of the aforementioned problems. We also details how the last problems models acoustic words grounding
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Multimodal document understanding"

1

Cooney, Ciaran, Rachel Heyburn, Liam Madigan, Mairead O’Cuinn, Chloe Thompson, and Joana Cavadas. "Unimodal and Multimodal Representation Training for Relation Extraction." In Communications in Computer and Information Science, 450–61. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26438-2_35.

Full text
Abstract:
AbstractMultimodal integration of text, layout and visual information has achieved SOTA results in visually rich document understanding (VrDU) tasks, including relation extraction (RE). However, despite its importance, evaluation of the relative predictive capacity of these modalities is less prevalent. Here, we demonstrate the value of shared representations for RE tasks by conducting experiments in which each data type is iteratively excluded during training. In addition, text and layout data are evaluated in isolation. While a bimodal text and layout approach performs best (F1 = 0.684), we show that text is the most important single predictor of entity relations. Additionally, layout geometry is highly predictive and may even be a feasible unimodal approach. Despite being less effective, we highlight circumstances where visual information can bolster performance. In total, our results demonstrate the efficacy of training joint representations for RE.
APA, Harvard, Vancouver, ISO, and other styles
2

Harris, Teresa, and Miemsie Steyn. "Understanding Students’ Perspectives as Learners through Photovoice." In Academic Knowledge Construction and Multimodal Curriculum Development, 357–75. IGI Global, 2014. http://dx.doi.org/10.4018/978-1-4666-4797-8.ch022.

Full text
Abstract:
In this chapter, the authors explore photography as a participatory research tool that facilitates the interactions of participants and researchers as co-researchers to effect change. They illustrate this discussion with a study examining the perspectives of teacher education students regarding teaching practices and institutional structures. Photography offered participants a way to document experiences, and it became a community-based methodology that elicited narratives from the “participant as photographer” and the community of investigators.
APA, Harvard, Vancouver, ISO, and other styles
3

Edge, Christi. "A Teacher Educator's Meaning-Making From a Hybrid “Online Teaching Fellows” Professional Learning Experience." In Handbook of Research on Virtual Training and Mentoring of Online Instructors, 76–109. IGI Global, 2019. http://dx.doi.org/10.4018/978-1-5225-6322-8.ch005.

Full text
Abstract:
This chapter describes a two-part, hybrid “Online Teaching Fellows” faculty development initiative and the tensions and transformations one faculty participant experienced. Case study and self-study research methodologies were utilized to systematically document and explore, from an insider's perspective, the lived experience of professional learning related to the design and delivery of online courses. This chapter identifies and describes tensions and transformations that contributed to professional learning and concludes with a discussion of how literacy practices in the design of frameworks for teaching and for learning may contribute to understanding how instructors read and make meaning from experiences in the context of professional learning. Implications extend Rosenblatt's transactional theory of reading and writing to multimodal online teaching and learning contexts.
APA, Harvard, Vancouver, ISO, and other styles
4

Edge, Christi. "A Teacher Educator's Meaning-Making From a Hybrid “Online Teaching Fellows” Professional Learning Experience." In Research Anthology on Facilitating New Educational Practices Through Communities of Learning, 422–55. IGI Global, 2021. http://dx.doi.org/10.4018/978-1-7998-7294-8.ch023.

Full text
Abstract:
This chapter describes a two-part, hybrid “Online Teaching Fellows” faculty development initiative and the tensions and transformations one faculty participant experienced. Case study and self-study research methodologies were utilized to systematically document and explore, from an insider's perspective, the lived experience of professional learning related to the design and delivery of online courses. This chapter identifies and describes tensions and transformations that contributed to professional learning and concludes with a discussion of how literacy practices in the design of frameworks for teaching and for learning may contribute to understanding how instructors read and make meaning from experiences in the context of professional learning. Implications extend Rosenblatt's transactional theory of reading and writing to multimodal online teaching and learning contexts.
APA, Harvard, Vancouver, ISO, and other styles
5

"Understanding page-based media." In The Structure of Multimodal Documents, 10–34. Routledge, 2015. http://dx.doi.org/10.4324/9781315740454-2.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

G. Almeida-Návar, Saúl, Nexaí Reyes-Sampieri, Jose T. Morelos-Garcia, Jorge M. Antolinez-Motta, and Gabriel I. Herrejón-Galaviz. "Chronic Postoperative Pain." In Topics in Postoperative Pain [Working Title]. IntechOpen, 2023. http://dx.doi.org/10.5772/intechopen.111878.

Full text
Abstract:
Understanding the definition of pain has imposed numerous challenges toward pain practitioners. The pain experience phenomena are complicated to understand, and this construct goes beyond biomedical approaches. Persistent pain as a disease implicates changes that include modified sensory feedback within the somatosensory system. It has been documented that different anatomical restructuring in nociceptive integration and adaptations in nociceptive primary afferents and perception conduits are present in persistent pain situations. Chronic postoperative pain (CPOP) is known as a particular disorder, not only associated with a specific nerve damage or manifestation of a unique inflammatory response but also with a mixture of both. The occurrence of CPOP varies substantially among the literature and depends on the kind of procedure. There are reports informing that 10 to 50% of the patients undergoing common procedures had CPOP, and 2 to 10% of patients complained of severe pain. Systematic review has been performed trying to identify the Holy Grail, none showed sufficient evidence to guide CPOP treatment, and multimodal approaches must be tried in large randomized controlled trials (RCTs) to provide robust evidence as evidence-based management for CPOP still lacking.
APA, Harvard, Vancouver, ISO, and other styles
7

Hai-Jew, Shalin. "Exploiting Enriched Knowledge of Web Network Structures." In Enhancing Qualitative and Mixed Methods Research with Technology, 255–86. IGI Global, 2015. http://dx.doi.org/10.4018/978-1-4666-6493-7.ch011.

Full text
Abstract:
Understanding Web network structures may offer insights on various organizations and individuals. These structures are often latent and invisible without special software tools; the interrelationships between various websites may not be apparent with a surface perusal of the publicly accessible Web pages. Three publicly available tools may be “chained” (combined in sequence) in a data extraction sequence to enable visualization of various aspects of http network structures in an enriched way (with more detailed insights about the composition of such networks, given their heterogeneous and multimodal contents). Maltego Tungsten™, a penetration-testing tool, enables the mapping of Web networks, which are enriched with a variety of information: the technological understructure and tools used to build the network, some linked individuals (digital profiles), some linked documents, linked images, related emails, some related geographical data, and even the in-degree of the various nodes. NCapture with NVivo enables the extraction of public social media platform data and some basic analysis of these captures. The Network Overview, Discovery, and Exploration for Excel (NodeXL) tool enables the extraction of social media platform data and various evocative data visualizations and analyses. With the size of the Web growing exponentially and new domains (like .ventures, .guru, .education, .company, and others), the ability to map widely will offer a broad competitive advantage to those who would exploit this approach to enhance knowledge.
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Multimodal document understanding"

1

Wang, Wenjin, Zhengjie Huang, Bin Luo, Qianglong Chen, Qiming Peng, Yinxu Pan, Weichong Yin, et al. "mmLayout: Multi-grained MultiModal Transformer for Document Understanding." In MM '22: The 30th ACM International Conference on Multimedia. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3503161.3548406.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Gu, Zhangxuan, Changhua Meng, Ke Wang, Jun Lan, Weiqiang Wang, Ming Gu, and Liqing Zhang. "XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding." In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.00454.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Wang, Zilong, Mingjie Zhan, Xuebo Liu, and Ding Liang. "DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding." In Findings of the Association for Computational Linguistics: EMNLP 2020. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020. http://dx.doi.org/10.18653/v1/2020.findings-emnlp.80.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Dang, Xuan-Hong, Syed Yousaf Shah, and Petros Zerfos. "``The Squawk Bot'': Joint Learning of Time Series and Text Data Modalities for Automated Financial Information Filtering." In Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence {IJCAI-PRICAI-20}. California: International Joint Conferences on Artificial Intelligence Organization, 2020. http://dx.doi.org/10.24963/ijcai.2020/634.

Full text
Abstract:
Multimodal analysis that incorporates time series and textual corpora as input data sources is becoming a promising approach, especially in the financial industry. However, the main focus of such analysis has been on achieving high prediction accuracy rather than on understanding the association between the two data modalities. In this work, we address the important problem of automatically dis- covering a small set of top news articles associated with a given time series. Towards this goal, we pro- pose a novel multi-modal neural model called MSIN that jointly learns both the numerical time series and the categorical text articles in order to unearth the correlation between them. Through multiple steps of data interrelation between the two data modalities, MSIN learns to focus on a small subset of text articles that best align with the current performance in the time series. This succinct set is timely discovered and presented as recommended documents for the given time series, offering MSIN as an automated information filtering system. We empirically evaluate its performance on discovering daily top relevant news articles collected from Thomson Reuters for two given stock time series, AAPL and GOOG, over a period of seven consecutive years. The experimental results demonstrate MSIN achieves up to 84.9% and 87.2% respectively in recalling the ground truth articles, superior to SOTA algorithms that rely on conventional attention mechanisms in deep learning.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography