Увійти

Готові списки джерел за темами / Multi-Modal representations

Добірка наукової літератури з теми "Multi-Modal representations"

Автор: Grafiati

Опубліковано: 7 липня 2024

Оновлено: 7 липня 2024

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями

Оберіть тип джерела:

Ознайомтеся зі списками актуальних статей, книг, дисертацій, тез та інших наукових джерел на тему "Multi-Modal representations".

Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.

Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.

Зміст

Статті в журналах
Дисертації
Книги
Частини книг
Тези доповідей конференцій

Статті в журналах з теми "Multi-Modal representations":

1

Wu, Lianlong, Seewon Choi, Daniel Raggi, Aaron Stockdill, Grecia Garcia Garcia, Fiorenzo Colarusso, Peter C. H. Cheng, and Mateja Jamnik. "Generation of Visual Representations for Multi-Modal Mathematical Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 21 (March 24, 2024): 23850–52. http://dx.doi.org/10.1609/aaai.v38i21.30586.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web front-end user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Trees with just one click, utilising a back-end engine based on RST. A table of cognitive costs, based on the cognitive Representational Interpretive Structure Theory (RIST), that a representation places on a particular profile of user is produced at the same time. MaRE is general and domain independent, applicable to other representations encoded in RST. It may enhance mathematical education and research, facilitating multi-modal knowledge representation and discovery.

2

Zhang, Yi, Mingyuan Chen, Jundong Shen, and Chongjun Wang. "Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (June 28, 2022): 9100–9108. http://dx.doi.org/10.1609/aaai.v36i8.20895.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities. Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels, which neglects the diversity of each modality and fails to capture richer semantic information for each label from different perspectives. Besides, associated relationships of modalities and labels have not been fully exploited. In this paper, we propose versaTile multi-modAl learning for multI-labeL emOtion Recognition (TAILOR), aiming to refine multi-modal representations and enhance discriminative capacity of each label. Specifically, we design an adversarial multi-modal refinement module to sufficiently explore the commonality among different modalities and strengthen the diversity of each modality. To further exploit label-modal dependence, we devise a BERT-like cross-modal encoder to gradually fuse private and common modality representations in a granularity descent way, as well as a label-guided decoder to adaptively generate a tailored representation for each label with the guidance of label semantics. In addition, we conduct experiments on the benchmark MMER dataset CMU-MOSEI in both aligned and unaligned settings, which demonstrate the superiority of TAILOR over the state-of-the-arts.

3

Zhang, Dong, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. "Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (May 18, 2021): 14347–55. http://dx.doi.org/10.1609/aaai.v35i16.17687.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Multi-modal named entity recognition (MNER) aims to discover named entities in free text and classify them into pre-defined types with images. However, dominant MNER models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have the potential to refine multi-modal representation learning. To deal with this issue, we propose a unified multi-modal graph fusion (UMGF) approach for MNER. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between multi-modal semantic units (words and visual objects). Then, we stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations. Finally, we achieve an attention-based multi-modal representation for each word and perform entity labeling with a CRF decoder. Experimentation on the two benchmark datasets demonstrates the superiority of our MNER model.

4

Liu, Hao, Jindong Han, Yanjie Fu, Jingbo Zhou, Xinjiang Lu, and Hui Xiong. "Multi-modal transportation recommendation with unified route representation learning." Proceedings of the VLDB Endowment 14, no. 3 (November 2020): 342–50. http://dx.doi.org/10.14778/3430915.3430924.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Multi-modal transportation recommendation aims to provide the most appropriate travel route with various transportation modes according to certain criteria. After analyzing large-scale navigation data, we find that route representations exhibit two patterns: spatio-temporal autocorrelations within transportation networks and the semantic coherence of route sequences. However, there are few studies that consider both patterns when developing multi-modal transportation systems. To this end, in this paper, we study multi-modal transportation recommendation with unified route representation learning by exploiting both spatio-temporal dependencies in transportation networks and the semantic coherence of historical routes. Specifically, we propose to unify both dynamic graph representation learning and hierarchical multi-task learning for multi-modal transportation recommendations. Along this line, we first transform the multi-modal transportation network into time-dependent multi-view transportation graphs and propose a spatiotemporal graph neural network module to capture the spatial and temporal autocorrelation. Then, we introduce a coherent-aware attentive route representation learning module to project arbitrary-length routes into fixed-length representation vectors, with explicit modeling of route coherence from historical routes. Moreover, we develop a hierarchical multi-task learning module to differentiate route representations for different transport modes, and this is guided by the final recommendation feedback as well as multiple auxiliary tasks equipped in different network layers. Extensive experimental results on two large-scale real-world datasets demonstrate the performance of the proposed system outperforms eight baselines.

5

Wang, Huansha, Qinrang Liu, Ruiyang Huang, and Jianpeng Zhang. "Multi-Modal Entity Alignment Method Based on Feature Enhancement." Applied Sciences 13, no. 11 (June 1, 2023): 6747. http://dx.doi.org/10.3390/app13116747.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Multi-modal entity alignment refers to identifying equivalent entities between two different multi-modal knowledge graphs that consist of multi-modal information such as structural triples and descriptive images. Most previous multi-modal entity alignment methods have mainly used corresponding encoders of each modality to encode entity information and then perform feature fusion to obtain the multi-modal joint representation. However, this approach does not fully utilize the multi-modal information of aligned entities. To address this issue, we propose MEAFE, a multi-modal entity alignment method based on feature enhancement. The MEAFE adopts the multi-modal pre-trained model, OCR model, and GATv2 network to enhance the model’s ability to extract useful features in entity structure triplet information and image description, respectively, thereby generating more effective multi-modal representations. Secondly, it further adds modal distribution information of the entity to enhance the model’s understanding and modeling ability of the multi-modal information. Experiments on bilingual and cross-graph multi-modal datasets demonstrate that the proposed method outperforms models that use traditional feature extraction methods.

6

Wu, Tianxing, Chaoyu Gao, Lin Li, and Yuxiang Wang. "Leveraging Multi-Modal Information for Cross-Lingual Entity Matching across Knowledge Graphs." Applied Sciences 12, no. 19 (October 8, 2022): 10107. http://dx.doi.org/10.3390/app121910107.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In recent years, the scale of knowledge graphs and the number of entities have grown rapidly. Entity matching across different knowledge graphs has become an urgent problem to be solved for knowledge fusion. With the importance of entity matching being increasingly evident, the use of representation learning technologies to find matched entities has attracted extensive attention due to the computability of vector representations. However, existing studies on representation learning technologies cannot make full use of knowledge graph relevant multi-modal information. In this paper, we propose a new cross-lingual entity matching method (called CLEM) with knowledge graph representation learning on rich multi-modal information. The core is the multi-view intact space learning method to integrate embeddings of multi-modal information for matching entities. Experimental results on cross-lingual datasets show the superiority and competitiveness of our proposed method.

7

Han, Ning, Jingjing Chen, Hao Zhang, Huanwen Wang, and Hao Chen. "Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 2 (May 31, 2022): 1–23. http://dx.doi.org/10.1145/3483381.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation features capturing spatial-temporal object interactions that benefits mapping textual entities in the real-world retrieval system. To tackle this problem, we propose an adversarial multi-grained embedding network (AME-Net), a novel cross-modal retrieval framework that adopts both fine-grained local relation and coarse-grained global features in bridging text-video modalities. Additionally, with the newly proposed visual representation, we also integrate an adversarial learning strategy into AME-Net, to further narrow the domain gap between text and video representations. In summary, we contribute AME-Net with an adversarial learning strategy for learning a better joint embedding space, and experimental results on MSR-VTT and YouCook2 datasets demonstrate that our proposed framework consistently outperforms the state-of-the-art method.

8

Ying, Qichao, Xiaoxiao Hu, Yangming Zhou, Zhenxing Qian, Dan Zeng, and Shiming Ge. "Bootstrapping Multi-View Representations for Fake News Detection." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (June 26, 2023): 5384–92. http://dx.doi.org/10.1609/aaai.v37i4.25670.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Previous researches on multimedia fake news detection include a series of complex feature extraction and fusion networks to gather useful information from the news. However, how cross-modal consistency relates to the fidelity of news and how features from different modalities affect the decision-making are still open questions. This paper presents a novel scheme of Bootstrapping Multi-view Representations (BMR) for fake news detection. Given a multi-modal news, we extract representations respectively from the views of the text, the image pattern and the image semantics. Improved Multi-gate Mixture-of-Expert networks (iMMoE) are proposed for feature refinement and fusion. Representations from each view are separately used to coarsely predict the fidelity of the whole news, and the multimodal representations are able to predict the cross-modal consistency. With the prediction scores, we reweigh each view of the representations and bootstrap them for fake news detection. Extensive experiments conducted on typical fake news detection datasets prove that BMR outperforms state-of-the-art schemes.

9

Huang, Yufeng, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, et al. "Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (March 24, 2024): 2417–25. http://dx.doi.org/10.1609/aaai.v38i3.28017.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integrates Scene Graph Knowledge (SGK) to enhance multi-modal structured representations. Firstly, we use scene graphs to guide the construction of semantic negative examples, which results in an increased emphasis on learning structured representations. Moreover, a Knowledge-Enhance Encoder (KEE) is proposed to leverage SGK as input to further enhance structured representations. To verify the effectiveness of the proposed framework, we pre-train our model with the aforementioned approaches and conduct experiments on downstream tasks. Experimental results demonstrate that Structure-CLIP achieves state-of-the-art (SOTA) performance on VG-Attribution and VG-Relation datasets, with 12.5% and 4.1% ahead of the multi-modal SOTA model respectively. Meanwhile, the results on MSCOCO indicate that Structure-CLIP significantly enhances the structured representations while maintaining the ability of general representations. Our code is available at https://github.com/zjukg/Structure-CLIP.

10

van Tulder, Gijs, and Marleen de Bruijne. "Learning Cross-Modality Representations From Multi-Modal Images." IEEE Transactions on Medical Imaging 38, no. 2 (February 2019): 638–48. http://dx.doi.org/10.1109/tmi.2018.2868977.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Більше джерел

Дисертації з теми "Multi-Modal representations":

1

Gu, Jian. "Multi-modal Neural Representations for Semantic Code Search." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279101.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In recent decades, various software systems have gradually become the basis of our society. Programmers search existing code snippets from time to time in their daily life. It would be beneficial and meaningful to have better solutions for the task of semantic code search, which is to find the most semantically relevant code snippets for a given query. Our approach is to introduce tree representations by multi-modal learning. The core idea is to enrich semantic information for code snippets by preparing data of different modalities, and meanwhile ignore syntactic information. We design one novel tree structure named Simplified Semantic Tree and then extract RootPath representations from that. We utilize RootPath representation to complement the conventional sequential representation, namely the token sequence of the code snippet. Our multi-modal model receives code-query pair as input and computes similarity score as output, following the pseudo-siamese architecture. For each pair, besides the ready-made code sequence and query sequence, we extra one extra tree sequence from Simplified Semantic Tree. There are three encoders in our model, and they respectively encode these three sequences as vectors of the same length. Then we combine the code vector with the tree vector for one joint vector, which is still of the same length, as the multi-modal representation for the code snippet. We introduce triplet loss to ensure vectors of code and query in the same pair be close at the shared vector space. We conduct experiments in one large-scale multi-language corpus, with comparisons of strong baseline models by specified performance metrics. Among baseline models, the simplest Neural Bag-of-Words model is with the most satisfying performance. It indicates that syntactic information is likely to distract complex models from critical semantic information. Results show that our multi-modal representation approach performs better because it surpasses baseline models by far in most cases. The key to our multi-modal model is that it is totally about semantic information, and it learns from data of multiple modalities.
Under de senaste decennierna har olika programvarusystem gradvis blivit basen i vårt samhälle. Programmerare söker i befintliga kodavsnitt från tid till annan i deras dagliga liv. Det skulle vara fördelaktigt och meningsfullt att ha bättre lösningar för uppgiften att semantisk kodsökning, vilket är att hitta de mest semantiskt relevanta kodavsnitten för en given fråga. Vår metod är att introducera trädrepresentationer genom multimodal inlärning. Grundidén är att berika semantisk information för kodavsnitt genom att förbereda data med olika modaliteter och samtidigt ignorera syntaktisk information. Vi designar en ny trädstruktur med namnet Simplified Semantic Tree och extraherar sedan RootPath-representationer från det. Vi använder RootPath-representation för att komplettera den konventionella sekvensrepresentationen, nämligen kodsekvensens symbolsekvens. Vår multimodala modell får kodfrågeställningar som inmatning och beräknar likhetspoäng som utgång efter den pseudo-siamesiska arkitekturen. För varje par, förutom den färdiga kodsekvensen och frågesekvensen, extrager vi en extra trädsekvens från Simplified Semantic Tree. Det finns tre kodare i vår modell, och de kodar respektive tre sekvenser som vektorer av samma längd. Sedan kombinerar vi kodvektorn med trädvektorn för en gemensam vektor, som fortfarande är av samma längd som den multimodala representationen för kodavsnittet. Vi introducerar tripletförlust för att säkerställa att vektorer av kod och fråga i samma par är nära det delade vektorn. Vi genomför experiment i ett storskaligt flerspråkigt korpus, med jämförelser av starka baslinjemodeller med specificerade prestandametriker. Bland baslinjemodellerna är den enklaste Neural Bag-of-Words-modellen med den mest tillfredsställande prestanda. Det indikerar att syntaktisk information sannolikt kommer att distrahera komplexa modeller från kritisk semantisk information. Resultaten visar att vår multimodala representationsmetod fungerar bättre eftersom den överträffar basmodellerna i de flesta fall. Nyckeln till vår multimodala modell är att den helt handlar om semantisk information, och den lär sig av data om flera modaliteter.

2

Liu, Yahui. "Exploring Multi-Domain and Multi-Modal Representations for Unsupervised Image-to-Image Translation." Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/342634.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Unsupervised image-to-image translation (UNIT) is a challenging task in the image manipulation field, where input images in a visual domain are mapped into another domain with desired visual patterns (also called styles). An ideal direction in this field is to build a model that can map an input image in a domain to multiple target domains and generate diverse outputs in each target domain, which is termed as multi-domain and multi-modal unsupervised image-to-image translation (MMUIT). Recent studies have shown remarkable results in UNIT but they suffer from four main limitations: (1) State-of-the-art UNIT methods are either built from several two-domain mappings that are required to be learned independently or they generate low-diversity results, a phenomenon also known as model collapse. (2) Most of the manipulation is with the assistance of visual maps or digital labels without exploring natural languages, which could be more scalable and flexible in practice. (3) In an MMUIT system, the style latent space is usually disentangled between every two image domains. While interpolations within domains are smooth, interpolations between two different domains often result in unrealistic images with artifacts when interpolating between two randomly sampled style representations from two different domains. Improving the smoothness of the style latent space can lead to gradual interpolations between any two style latent representations even between any two domains. (4) It is expensive to train MMUIT models from scratch at high resolution. Interpreting the latent space of pre-trained unconditional GANs can achieve pretty good image translations, especially high-quality synthesized images (e.g., 1024x1024 resolution). However, few works explore building an MMUIT system with such pre-trained GANs. In this thesis, we focus on these vital issues and propose several techniques for building better MMUIT systems. First, we base on the content-style disentangled framework and propose to fit the style latent space with Gaussian Mixture Models (GMMs). It allows a well-trained network using a shared disentangled style latent space to model multi-domain translations. Meanwhile, we can randomly sample different style representations from a Gaussian component or use a reference image for style transfer. Second, we show how the GMM-modeled latent style space can be combined with a language model (e.g., a simple LSTM network) to manipulate multiple styles by using textual commands. Then, we not only propose easy-to-use constraints to improve the smoothness of the style latent space in MMUIT models, but also design a novel metric to quantitatively evaluate the smoothness of the style latent space. Finally, we build a new model to use pretrained unconditional GANs to do MMUIT tasks.

3

Song, Pingfan. "Multi-modal image processing via joint sparse representations induced by coupled dictionaries." Thesis, University College London (University of London), 2018. http://discovery.ucl.ac.uk/10061963/.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Real-world image processing tasks often involve various image modalities captured by different sensors. However, given that different sensors exhibit different characteristics, such multi-modal images are typically acquired with different resolutions, different blurring kernels, or even noise levels. In view of the fact that images associated with the same scene share some attributes, such as edges, textures or other primitives, it is natural to ask whether one can improve standard image processing tasks by leveraging the availability of multimodal images. This thesis introduces a sparsity-based machine learning framework along with algorithms to address such multimodal image processing problems. In particular, the thesis introduces a new coupled dictionary learning framework that is able to capture complex relationships and disparities between different image types in a learned sparse-representation domain in lieu of the original image domain. The thesis then introduces representative applications of this framework in key multimodal image processing problems. First, the thesis considers multi-modal image super-resolution problems where one wishes to super-resolve a certain low-resolution image modality given the availability of another high-resolution image modality of the same scene. It develops both a coupled dictionary learning algorithm and a coupled super-resolution algorithm to address this task arising in [1,2]. Second, the thesis considers multi-modal image denoising problems where one wishes to denoise a certain noisy image modality given the availability of another less noisy image modality of the same scene. The thesis develops an online coupled dictionary learning algorithm and a coupled sparse denoising algorithm to address this task arising in [3,4]. Finally, the thesis considers emerging medical imaging applications where one wishes to perform multi-contrast MRI reconstruction, including guided reconstruction and joint reconstruction. We propose an iterative framework to implement coupled dictionary learning, coupled sparse denoising and k-space consistency to address this task arising in [5,6]. The proposed framework is capable of capturing complex dependencies, including both similarities and disparities among multi-modal data. This enables transferring appropriate guidance information to the target image without introducing noticeable texture-copying artifacts. Practical experiments on multi-modal images also demonstrate that the proposed framework contributes to significant performance improvement in various image processing tasks, such as multi-modal image super-resolution, denoising and multi-contrast MRI reconstruction.

4

Suthana, Nanthia Ananda. "Investigating human medical temporal representations of episodic information a multi-modal approach /." Diss., Restricted to subscribing institutions, 2009. http://proquest.umi.com/pqdweb?did=1905692921&sid=1&Fmt=2&clientId=1564&RQT=309&VName=PQD.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations." Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096/document.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

La présente thèse étudie la modélisation conjointe des contenus visuels et textuels extraits à partir des documents multimédias pour résoudre les problèmes intermodaux. Ces tâches exigent la capacité de ``traduire'' l'information d'une modalité vers une autre. Un espace de représentation commun, par exemple obtenu par l'Analyse Canonique des Corrélation ou son extension kernelisée est une solution généralement adoptée. Sur cet espace, images et texte peuvent être représentés par des vecteurs de même type sur lesquels la comparaison intermodale peut se faire directement.Néanmoins, un tel espace commun souffre de plusieurs déficiences qui peuvent diminuer la performance des ces tâches. Le premier défaut concerne des informations qui sont mal représentées sur cet espace pourtant très importantes dans le contexte de la recherche intermodale. Le deuxième défaut porte sur la séparation entre les modalités sur l'espace commun, ce qui conduit à une limite de qualité de traduction entre modalités. Pour faire face au premier défaut concernant les données mal représentées, nous avons proposé un modèle qui identifie tout d'abord ces informations et puis les combine avec des données relativement bien représentées sur l'espace commun. Les évaluations sur la tâche d'illustration de texte montrent que la prise en compte de ces information fortement améliore les résultats de la recherche intermodale. La contribution majeure de la thèse se concentre sur la séparation entre les modalités sur l'espace commun pour améliorer la performance des tâches intermodales. Nous proposons deux méthodes de représentation pour les documents bi-modaux ou uni-modaux qui regroupent à la fois des informations visuelles et textuelles projetées sur l'espace commun. Pour les documents uni-modaux, nous suggérons un processus de complétion basé sur un ensemble de données auxiliaires pour trouver les informations correspondantes dans la modalité absente. Ces informations complémentaires sont ensuite utilisées pour construire une représentation bi-modale finale pour un document uni-modal. Nos approches permettent d'obtenir des résultats de l'état de l'art pour la recherche intermodale ou la classification bi-modale et intermodale
This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification

6

Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations." Electronic Thesis or Diss., Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

La présente thèse étudie la modélisation conjointe des contenus visuels et textuels extraits à partir des documents multimédias pour résoudre les problèmes intermodaux. Ces tâches exigent la capacité de ``traduire'' l'information d'une modalité vers une autre. Un espace de représentation commun, par exemple obtenu par l'Analyse Canonique des Corrélation ou son extension kernelisée est une solution généralement adoptée. Sur cet espace, images et texte peuvent être représentés par des vecteurs de même type sur lesquels la comparaison intermodale peut se faire directement.Néanmoins, un tel espace commun souffre de plusieurs déficiences qui peuvent diminuer la performance des ces tâches. Le premier défaut concerne des informations qui sont mal représentées sur cet espace pourtant très importantes dans le contexte de la recherche intermodale. Le deuxième défaut porte sur la séparation entre les modalités sur l'espace commun, ce qui conduit à une limite de qualité de traduction entre modalités. Pour faire face au premier défaut concernant les données mal représentées, nous avons proposé un modèle qui identifie tout d'abord ces informations et puis les combine avec des données relativement bien représentées sur l'espace commun. Les évaluations sur la tâche d'illustration de texte montrent que la prise en compte de ces information fortement améliore les résultats de la recherche intermodale. La contribution majeure de la thèse se concentre sur la séparation entre les modalités sur l'espace commun pour améliorer la performance des tâches intermodales. Nous proposons deux méthodes de représentation pour les documents bi-modaux ou uni-modaux qui regroupent à la fois des informations visuelles et textuelles projetées sur l'espace commun. Pour les documents uni-modaux, nous suggérons un processus de complétion basé sur un ensemble de données auxiliaires pour trouver les informations correspondantes dans la modalité absente. Ces informations complémentaires sont ensuite utilisées pour construire une représentation bi-modale finale pour un document uni-modal. Nos approches permettent d'obtenir des résultats de l'état de l'art pour la recherche intermodale ou la classification bi-modale et intermodale
This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification

7

Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature

8

Li, Lin. "Multi-scale spectral embedding representation registration (MSERg) for multi-modal imaging registration." Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1467902012.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Gay, Joanna. "Structural representation models for multi-modal image registration in biomedical applications." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-410820.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

In clinical applications it is often beneficial to use multiple imaging technologies to obtain information about different biomedical aspects of the subject under investigation, and to make best use of such sets of images they need to first be registered or aligned. Registration of multi-modal images is a challenging task and is currently the topic of much research, with new methods being published frequently. Structural representation models extract underlying features such as edges from images, distilling them into a common format that can be easily compared across different image modalities. This study compares the performance of two recent structural representation models on the task of aligning multi-modal biomedical images, specifically Second Harmonic Generation and Two Photon Excitation Fluorescence Microscopy images collected from skin samples. Performance is also evaluated on Brightfield Microscopy images. The two models evaluated here are PCANet-based Structural Representations (PSR, Zhu et al. (2018)) and Discriminative Local Derivative Patterns (dLDP, Jiang et al. (2017)). Mutual Information is used to provide a baseline for comparison. Although dLDP in particular gave promising results, worthy of further investigation, neither method outperformed the classic Mutual Information approach, as demonstrated in a series of experiments to register these particularly diverse modalities.

10

Aissa, Wafa. "Réseaux de modules neuronaux pour un raisonnement visuel compositionnel." Electronic Thesis or Diss., Paris, HESAM, 2023. http://www.theses.fr/2023HESAC033.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Cette thèse de doctorat porte sur le raisonnement visuel compositionnel. Lorsqu'on présente une paire image-question à un modèle de réseau de neurones, notre objectif est que le modèle réponde à la question en suivant une chaîne de raisonnement définie par un programme. Nous évaluons la capacité de raisonnement du modèle dans le cadre de la Question Réponse Visuelle (QRV). La QRV compositionnelle décompose les questions complexes en sous-problèmes modulaires plus simples. Ces sous-problèmes incluent des compétences de raisonnement telles que la détection d'objets et d'attributs, la détection de relations, les opérations logiques, le dénombrement et les comparaisons. Chaque sous-problème est attribué à un module différent. Cette approche décourage les raccourcis, exigeant une compréhension explicite du problème. Elle favorise également la transparence et l'explicabilité.Les réseaux de modules neuronaux (NMN) sont utilisés pour permettre un raisonnement compositionnel. Il sont basés sur un cadre de générateur-exécuteur, le générateur apprend la traduction de la question vers son programme de fonctions. L'exécuteur instancie un NMN où chaque fonction est attribuée à un module spécifique. Nous développons également un catalogue de modules neuronaux et définissons leurs fonctions et leurs structures. Les entraînements et les évaluations sont effectués sur l'ensemble de données GQA [3], qui comprend des questions, des programmes fonctionnels, des images et des réponses.L'une des principales contributions implique l'intégration de représentations pré-entraînées multi-modales dans la QRV modulaire. Cette intégration sert à initialiser le processus de raisonnement. Les expériences démontrent que les représentations multimodales surpassent les unimodales. Ceci permet de capturer des relations complexes intra-modales tout en facilitant l'alignement entre les différentes modalités, améliorant ainsi la précision globale de notre NMN.De plus, nous explorons différentes techniques d'entraînement pour améliorer le processus d'apprentissage et l'efficacité du coût de calcul. En plus d'optimiser les modules au sein de la chaîne de raisonnement pour produire collectivement des réponses précises, nous introduisons une approche d'apprentissage guidé pour optimiser les modules intermédiaires de la chaîne de raisonnement. Cela garantit que ces modules effectuent leurs sous-tâches de raisonnement spécifiques sans prendre de raccourcis ou compromettre l'intégrité du processus de raisonnement. L'une des techniques proposées s'inspire de la méthode d'apprentissage guidé couramment utilisée dans les modèles séquentiels. Des analyses comparatives démontrent les avantages de notre approche pour les NMN, comme détaillé dans notre article [1].Nous introduisons également une nouvelle stratégie d'apprentissage par Curriculum (CL) adaptée aux NMN pour réorganiser les exemples d'entraînement et définir une stratégie d'apprentissage progressif. Nous commençons par apprendre des programmes plus simples et augmentons progressivement la complexité des programmes d'entraînement. Nous utilisons plusieurs critères de difficulté pour définir l'approche du CL. Nos résultats montrent qu'en sélectionnant la méthode de CL appropriée, nous pouvons réduire considérablement le coût de l'entraînement et la quantité de données d'entraînement requise, avec un impact limité sur la précision finale de la QRV. Cette contribution importante constitue le cœur de notre article [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Proceedings of ACIVS 2023. Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019
The context of this PhD thesis is compositional visual reasoning. When presented with an image and a question pair, our objective is to have neural networks models answer the question by following a reasoning chain defined by a program. We assess the model's reasoning ability through a Visual Question Answering (VQA) setup.Compositional VQA breaks down complex questions into modular easier sub-problems.These sub-problems include reasoning skills such as object and attribute detection, relation detection, logical operations, counting, and comparisons. Each sub-problem is assigned to a different module. This approach discourages shortcuts, demanding an explicit understanding of the problem. It also promotes transparency and explainability.Neural module networks (NMN) are used to enable compositional reasoning. The framework is based on a generator-executor framework, the generator learns the translation of the question to its function program. The executor instantiates a neural module network where each function is assigned to a specific module. We also design a neural modules catalog and define the function and the structure of each module. The training and evaluations are conducted using the pre-processed GQA dataset cite{gqa}, which includes natural language questions, functional programs representing the reasoning chain, images, and corresponding answers.The research contributions revolve around the establishment of an NMN framework for the VQA task.One primary contribution involves the integration of vision and language pre-trained (VLP) representations into modular VQA. This integration serves as a ``warm-start" mechanism for initializing the reasoning process.The experiments demonstrate that cross-modal vision and language representations outperform uni-modal ones. This utilization enables the capture of intricate relationships within each individual modality while also facilitating alignment between different modalities, consequently enhancing overall accuracy of our NMN.Moreover, we explore various training techniques to enhance the learning process and improve cost-efficiency. In addition to optimizing the modules within the reasoning chain to collaboratively produce accurate answers, we introduce a teacher-guidance approach to optimize the intermediate modules in the reasoning chain. This ensures that these modules perform their specific reasoning sub-tasks without taking shortcuts or compromising the reasoning process's integrity. We propose and implement several teacher-guidance techniques, one of which draws inspiration from the teacher-forcing method commonly used in sequential models. Comparative analyses demonstrate the advantages of our teacher-guidance approach for NMNs, as detailed in our paper [1].We also introduce a novel Curriculum Learning (CL) strategy tailored for NMNs to reorganize the training examples and define a start-small training strategy. We begin by learning simpler programs and progressively increase the complexity of the training programs. We use several difficulty criteria to define the CL approach. Our findings demonstrate that by selecting the appropriate CL method, we can significantly reduce the training cost and required training data, with only a limited impact on the final VQA accuracy. This significant contribution forms the core of our paper [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Advanced Concepts for Intelligent Vision Systems, 21st International Conference (ACIVS 2023). Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019

Більше джерел

Книги з теми "Multi-Modal representations":

1

Po, Ming Jack. Multi-scale Representations for Classification of Protein Crystal Images and Multi-Modal Registration of the Lung. [New York, N.Y.?]: [publisher not identified], 2015.

Знайти повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

(Editor), Syed A. Ali, and Susan McRoy (Editor), eds. Representations for Multi-Modal Human-Computer Interaction: Papers from the Aaai Workshop (Technical Reports Vol. Ws-98-09). AAAI Press, 1998.

Знайти повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Case, Julialicia, Eric Freeze, and Salvatore Pane. Story Mode. Bloomsbury Publishing Plc, 2024. http://dx.doi.org/10.5040/9781350301405.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

Against the backdrop of a hyper-competitive AAA industry and the perception that it is a world reserved for top programmers and hard-core ‘gamers’, Story Mode offers an accessible entry-point for all into writing and designing complex and emotionally affecting narrative video games. The first textbook to combine game design with creative writing techniques, this much-needed resource makes the skills necessary to consume and create digital and multi-modal stories attainable and fun. Appealing to the growing calls for greater inclusivity and access to this important contemporary apparatus of expression, this book offers low-cost, accessible tools and instruction that bridge the knowledge gap for creative writers, showing them how they can merge their skill-set with the fundamentals of game creation and empowering them to produce their own games which push stories beyond the page and the written word. Broken down into 4 sections to best orientate writers from any technological background to the strategies of game production, this book offers: - Contextual and introductory chapters exploring the history and variety of various game genres. - Discussions of how traditional creative writing approaches to character, plot, world-building and dialogue can be utilised in game writing. - An in-depth overview of game studies concepts such as game construction, interactivity, audience engagement, empathy, real-world change and representation that orientate writers to approach games from the perspective of a designer. - A whole section on the practical elements of work-shopping, tools, collaborative writing as well as extended exercises guiding readers through long-term, collaborative, game-centred projects using suites and tools like Twine, Audacity, Bitsy, and GameMaker. Featuring detailed craft lessons, hands-on exercises and case studies, this is the ultimate guide for creative writers wanting to diversify into writing for interactive, digital and contemporary modes of storytelling. Designed not to lay out a roadmap to a successful career in the games industry but to empower writers to experiment in a medium previously regarded as exclusive, this book demystifies the process behind creating video games, orienting readers to a wide range of new possible forms and inspiring them to challenge mainstream notions of what video games can be and become.

Частини книг з теми "Multi-Modal representations":

1

Wiesen, Aryeh, and Yaakov HaCohen-Kerner. "Overview of Uni-modal and Multi-modal Representations for Classification Tasks." In Natural Language Processing and Information Systems, 397–404. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-319-91947-8_41.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Li, Cheng, Hui Sun, Zaiyi Liu, Meiyun Wang, Hairong Zheng, and Shanshan Wang. "Learning Cross-Modal Deep Representations for Multi-Modal MR Image Segmentation." In Lecture Notes in Computer Science, 57–65. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-32245-8_7.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Luo, Xi, Chunjie Cao, and Longjuan Wang. "Multi-modal Universal Embedding Representations for Language Understanding." In Communications in Computer and Information Science, 103–19. Singapore: Springer Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0523-0_7.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Zhao, Xiang, Weixin Zeng, and Jiuyang Tang. "Multimodal Entity Alignment." In Entity Alignment, 229–47. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-4250-3_9.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Анотація:

AbstractIn various tasks related to artificial intelligence, data is often present in multiple forms or modalities. Recently, it has become a popular approach to combine these different forms of information into a knowledge graph, creating a multi-modal knowledge graph (MMKG). However, multi-modal knowledge graphs (MMKGs) often face issues of insufficient data coverage and incompleteness. In order to address this issue, a possible strategy is to incorporate supplemental information from other multi-modal knowledge graphs (MMKGs). To achieve this goal, current methods for aligning entities could be utilized; however, these approaches work within the Euclidean space, and the resulting entity representations can distort the hierarchical structure of the knowledge graph. Additionally, the potential benefits of visual information have not been fully utilized.To address these concerns, we present a new approach for aligning entities across multiple modalities, which we call hyperbolic multi-modal entity alignment (). This method expands upon the conventional Euclidean representation by incorporating a hyperboloid manifold. Initially, we utilize hyperbolic graph convolutional networks() to acquire structural representations of entities. In terms of visual data, we create image embeddings using the model and subsequently map them into the hyperbolic space utilizing . Lastly, we merge the structural and visual representations within the hyperbolic space and utilize the combined embeddings to forecast potential entity alignment outcomes. Through a series of thorough experiments and ablation studies, we validate the efficacy of our proposed model and its individual components.

5

Bae, Inhwan, Jin-Hwi Park, and Hae-Gon Jeon. "Learning Pedestrian Group Representations for Multi-modal Trajectory Prediction." In Lecture Notes in Computer Science, 270–89. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-20047-2_16.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Florea, Filip, Alexandrina Rogozan, Eugen Barbu, Abdelaziz Bensrhair, and Stefan Darmoni. "MedIC at ImageCLEF 2006: Automatic Image Categorization and Annotation Using Combined Visual Representations." In Evaluation of Multilingual and Multi-modal Information Retrieval, 670–77. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007. http://dx.doi.org/10.1007/978-3-540-74999-8_82.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Qin, Chen, Bibo Shi, Rui Liao, Tommaso Mansi, Daniel Rueckert, and Ali Kamen. "Unsupervised Deformable Registration for Multi-modal Images via Disentangled Representations." In Lecture Notes in Computer Science, 249–61. Cham: Springer International Publishing, 2019. http://dx.doi.org/10.1007/978-3-030-20351-1_19.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Ge, Hongkun, Guorong Wu, Li Wang, Yaozong Gao, and Dinggang Shen. "Hierarchical Multi-modal Image Registration by Learning Common Feature Representations." In Machine Learning in Medical Imaging, 203–11. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-24888-2_25.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Dorent, Reuben, Nazim Haouchine, Fryderyk Kogl, Samuel Joutard, Parikshit Juvekar, Erickson Torio, Alexandra J. Golby, et al. "Unified Brain MR-Ultrasound Synthesis Using Multi-modal Hierarchical Representations." In Lecture Notes in Computer Science, 448–58. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-43999-5_43.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Kasiri, Keyvan, Paul Fieguth, and David A. Clausi. "Structural Representations for Multi-modal Image Registration Based on Modified Entropy." In Lecture Notes in Computer Science, 82–89. Cham: Springer International Publishing, 2015. http://dx.doi.org/10.1007/978-3-319-20801-5_9.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

Тези доповідей конференцій з теми "Multi-Modal representations":

1

Zolfaghari, Mohammadreza, Yi Zhu, Peter Gehler, and Thomas Brox. "CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations." In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 2021. http://dx.doi.org/10.1109/iccv48922.2021.00148.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Lee, O.-Joun, and Jin-Taek Kim. "Learning Multi-modal Representations of Narrative Multimedia." In RACS '20: International Conference on Research in Adaptive and Convergent Systems. New York, NY, USA: ACM, 2020. http://dx.doi.org/10.1145/3400286.3418216.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Zhou, Xin, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. "Bootstrap Latent Representations for Multi-modal Recommendation." In WWW '23: The ACM Web Conference 2023. New York, NY, USA: ACM, 2023. http://dx.doi.org/10.1145/3543507.3583251.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Vulić, Ivan, Douwe Kiela, Stephen Clark, and Marie-Francine Moens. "Multi-Modal Representations for Improved Bilingual Lexicon Learning." In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2016. http://dx.doi.org/10.18653/v1/p16-2031.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Wang, Kaiye, Wei Wang, and Liang Wang. "Learning unified sparse representations for multi-modal data." In 2015 IEEE International Conference on Image Processing (ICIP). IEEE, 2015. http://dx.doi.org/10.1109/icip.2015.7351464.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Liu, Xinyi, Wanxian Guan, Lianyun Li, Hui Li, Chen Lin, Xubin Li, Si Chen, Jian Xu, Hongbo Deng, and Bo Zheng. "Pretraining Representations of Multi-modal Multi-query E-commerce Search." In KDD '22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, 2022. http://dx.doi.org/10.1145/3534678.3539200.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Parfenova, Iuliia, Desmond Elliott, Raquel Fernández, and Sandro Pezzelle. "Probing Cross-Modal Representations in Multi-Step Relational Reasoning." In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021). Stroudsburg, PA, USA: Association for Computational Linguistics, 2021. http://dx.doi.org/10.18653/v1/2021.repl4nlp-1.16.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Huang, Jia-Hong, Ting-Wei Wu, and Marcel Worring. "Contextualized Keyword Representations for Multi-modal Retinal Image Captioning." In ICMR '21: International Conference on Multimedia Retrieval. New York, NY, USA: ACM, 2021. http://dx.doi.org/10.1145/3460426.3463667.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Grossiord, Eloise, Laurent Risser, Salim Kanoun, Soleakhena Ken, and Francois Malgouyres. "Learning Optimal Shape Representations for Multi-Modal Image Registration." In 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI). IEEE, 2020. http://dx.doi.org/10.1109/isbi45749.2020.9098631.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Lara, Bruno, and Juan M. Rendon. "Prediction of Undesired Situations Based on Multi-Modal Representations." In 2006 Electronics, Robotics and Automotive Mechanics Conference. IEEE, 2006. http://dx.doi.org/10.1109/cerma.2006.75.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.