Thematische Bibliographien / Multimodal embedding space

Auswahl der wissenschaftlichen Literatur zum Thema „Multimodal embedding space“

Autor: Grafiati

Veröffentlicht am 25. Mai 2024

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Inhaltsverzeichnis

Zeitschriftenartikel
Dissertationen
Buchteile
Konferenzberichte

Machen Sie sich mit den Listen der aktuellen Artikel, Bücher, Dissertationen, Berichten und anderer wissenschaftlichen Quellen zum Thema "Multimodal embedding space" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Zeitschriftenartikel zum Thema "Multimodal embedding space"

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev und Alexander Panchenko. „On Isotropy of Multimodal Embeddings“. Information 14, Nr. 7 (10.07.2023): 392. http://dx.doi.org/10.3390/info14070392.

Der volle Inhalt der Quelle

Annotation:

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mai, Sijie, Haifeng Hu und Songlong Xing. „Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion“. Proceedings of the AAAI Conference on Artificial Intelligence 34, Nr. 01 (03.04.2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Der volle Inhalt der Quelle

Annotation:

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhang, Linhai, Deyu Zhou, Yulan He und Zeng Yang. „MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces“. Proceedings of the AAAI Conference on Artificial Intelligence 35, Nr. 16 (18.05.2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Der volle Inhalt der Quelle

Annotation:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to what have been done in previous approaches which project signals from different modalities onto a homogeneous space. In this paper, we propose a Multimodal Event Representation Learning framework (MERL) to learn event representations based on both text and image modalities simultaneously. Event textual triples are projected as Gaussian density embeddings by a dual-path Gaussian triple encoder, while event images are projected as point embeddings by a visual event component-aware image encoder. Moreover, a novel score function motivated by statistical hypothesis testing is introduced to coordinate two embedding spaces. Experiments are conducted on various multimodal event-related tasks and results show that MERL outperforms a number of unimodal and multimodal baselines, demonstrating the effectiveness of the proposed framework.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi und Bin Ruan. „LGMRec: Local and Global Graph Learning for Multimodal Recommendation“. Proceedings of the AAAI Conference on Artificial Intelligence 38, Nr. 8 (24.03.2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Der volle Inhalt der Quelle

Annotation:

The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Moon, Jucheol, Nhat Anh Le, Nelson Hebert Minaya und Sang-Il Choi. „Multimodal Few-Shot Learning for Gait Recognition“. Applied Sciences 10, Nr. 21 (29.10.2020): 7619. http://dx.doi.org/10.3390/app10217619.

Der volle Inhalt der Quelle

Annotation:

A person’s gait is a behavioral trait that is uniquely associated with each individual and can be used to recognize the person. As information about the human gait can be captured by wearable devices, a few studies have led to the proposal of methods to process gait information for identification purposes. Despite recent advances in gait recognition, an open set gait recognition problem presents challenges to current approaches. To address the open set gait recognition problem, a system should be able to deal with unseen subjects who have not included in the training dataset. In this paper, we propose a system that learns a mapping from a multimodal time series collected using insole to a latent (embedding vector) space to address the open set gait recognition problem. The distance between two embedding vectors in the latent space corresponds to the similarity between two multimodal time series. Using the characteristics of the human gait pattern, multimodal time series are sliced into unit steps. The system maps unit steps to embedding vectors using an ensemble consisting of a convolutional neural network and a recurrent neural network. To recognize each individual, the system learns a decision function using a one-class support vector machine from a few embedding vectors of the person in the latent space, then the system determines whether an unknown unit step is recognized as belonging to a known individual. Our experiments demonstrate that the proposed framework recognizes individuals with high accuracy regardless they have been registered or not. If we could have an environment in which all people would be wearing the insole, the framework would be used for user verification widely.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang und Yu Huang. „A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis“. Proceedings of the AAAI Conference on Artificial Intelligence 38, Nr. 15 (24.03.2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Der volle Inhalt der Quelle

Annotation:

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Merkx, Danny, und Stefan L. Frank. „Learning semantic sentence representations from visually grounded language without lexical knowledge“. Natural Language Engineering 25, Nr. 4 (Juli 2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Der volle Inhalt der Quelle

Annotation:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Fan, Yunpeng, Wenyou Du, Yingwei Zhang und Xiaogang Wang. „Fault Detection for Multimodal Process Using Quality-Relevant Kernel Neighborhood Preserving Embedding“. Mathematical Problems in Engineering 2015 (2015): 1–15. http://dx.doi.org/10.1155/2015/210125.

Der volle Inhalt der Quelle

Annotation:

A new method named quality-relevant kernel neighborhood preserving embedding (QKNPE) has been proposed. Quality variables have been considered for the first time in kernel neighborhood preserving embedding (KNPE) method for monitoring multimodal process. In summary, the whole algorithm is a two-step process: first, to improve manifold structure and to deal with multimodal nonlinearity problem, the neighborhood preserving embedding technique is introduced; and second to monitoring the complete production process, the product quality variables are added in the objective function. Compared with the conventional monitoring method, the proposed method has the following advantages: (1) the hidden manifold which related to the character of industrial process has been embedded to a low dimensional space and the identifying information of the different mode of the monitored system has been extracted; (2) the product quality as an important factor has been considered for the first time in manifold method. In the experiment section, we applied this method to electrofused magnesia furnace (EFMF) process, which is a representative case study. The experimental results show the effectiveness of the proposed method.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao und Minoru Maruyama. „Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings“. Journal of Advanced Computational Intelligence and Intelligent Informatics 26, Nr. 6 (20.11.2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Der volle Inhalt der Quelle

Annotation:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Kim, Jongseok, Youngjae Yu, Hoeseong Kim und Gunhee Kim. „Dual Compositional Learning in Interactive Image Retrieval“. Proceedings of the AAAI Conference on Artificial Intelligence 35, Nr. 2 (18.05.2021): 1771–79. http://dx.doi.org/10.1609/aaai.v35i2.16271.

Der volle Inhalt der Quelle

Annotation:

We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models the difference between the reference and target image in the embedding space and matches it with the embedding of the text query. That is, we consider two cyclic directional mappings for triplets of (reference image, text query, target image) by using both Composition Network and Correction Network. We also propose a joint training loss that can further improve the robustness of multimodal representation learning. We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network. Moreover, an ensemble of our model won the first place in Fashion-IQ 2020 challenge held in a CVPR 2020 workshop.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Mehr Quellen

Dissertationen zum Thema "Multimodal embedding space"

Couairon, Guillaume. „Text-Based Semantic Image Editing“. Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS248.

Der volle Inhalt der Quelle

Annotation:

L’objectif de cette thèse est de proposer des algorithmes pour la tâche d’édition d’images basée sur le texte (TIE), qui consiste à éditer des images numériques selon une instruction formulée en langage naturel. Par exemple, étant donné une image d’un chien et la requête "Changez le chien en un chat", nous voulons produire une nouvelle image où le chien a été remplacé par un chat, en gardant tous les autres aspects de l’image inchangés (couleur et pose de l’animal, arrière- plan). L’objectif de l’étoile du nord est de permettre à tout un chacun de modifier ses images en utilisant uniquement des requêtes en langage naturel. Une des spécificités de l’édition d’images basée sur du texte est qu’il n’y a pratiquement pas de données d’entraînement pour former un algorithme supervisé. Dans cette thèse, nous proposons différentes solutions pour l’édition d’images, basées sur l’adaptation de grands modèles multimodaux entraînés sur d’énormes ensembles de données. Nous étudions tout d’abord une configuration d’édition simplifiée, appelée édition d’image basée sur la recherche, qui ne nécessite pas de modifier directement l’image d’entrée. Au lieu de cela, étant donné l’image et la requête de modification, nous recherchons dans une grande base de données une image qui correspond à la modification demandée. Nous nous appuyons sur des modèles multimodaux d’alignement image/texte entraînés sur des ensembles de données à l’échelle du web (comme CLIP) pour effectuer de telles transformations sans aucun exemple. Nous proposons également le cadre SIMAT pour évaluer l’édition d’images basée sur la recherche. Nous étudions ensuite comment modifier directement l’image d’entrée. Nous proposons FlexIT, une méthode qui modifie itérativement l’image d’entrée jus- qu’à ce qu’elle satisfasse un "objectif d’édition" abstrait défini dans un espace d’intégration multimodal. Nous introduisons des termes de régularisation pour imposer des transformations réalistes. Ensuite, nous nous concentrons sur les modèles de diffusion, qui sont des modèles génératifs puissants capables de synthétiser de nouvelles images conditionnées par une grande variété d’invites textuelles. Nous démontrons leur polyvalence en proposant DiffEdit, un algorithme qui adapte les modèles de diffusion pour l’édition d’images sans réglage fin. Nous proposons une stratégie "zero-shot" pour trouver automatiquement où l’image initiale doit être modifiée pour satisfaire la requête de transformation de texte
The aim of this thesis is to propose algorithms for the task of Text-based Image Editing (TIE), which consists in editing digital images according to an instruction formulated in natural language. For instance, given an image of a dog, and the query "Change the dog into a cat", we want to produce a novel image where the dog has been replaced by a cat, keeping all other image aspects unchanged (animal color and pose, background). The north-star goal is to enable anyone to edit their images using only queries in natural language. One specificity of text-based image editing is that there is practically no training data to train a supervised algorithm. In this thesis, we propose different solutions for editing images, based on the adaptation of large multimodal models trained on huge datasets. We first study a simplified editing setup, named Retrieval-based image edit- ing, which does not require to directly modify the input image. Instead, given the image and modification query, we search in a large database an image that corresponds to the requested edit. We leverage multimodal image/text alignment models trained on web-scale datasets (like CLIP) to perform such transformations without any examples. We also propose the SIMAT framework for evaluating retrieval-based image editing. We then study how to directly modify the input image. We propose FlexIT, a method which iteratively changes the input image until it satisfies an abstract "editing objective" defined in a multimodal embedding space. We introduce a variety of regularization terms to enforce realistic transformations. Next, we focus on diffusion models, which are powerful generative models able to synthetize novel images conditioned on a wide variety of textual prompts. We demonstrate their versatility by proposing DiffEdit, an algorithm which adapts diffusion models for image editing without finetuning. We propose a zero-shot strategy for finding automatically where the initial image should be changed to satisfy the text transformation query. Finally, we study a specific challenge useful in the context of image editing: how to synthetize a novel image by giving as constraint a spatial layout of objects with textual descriptions, a task which is known as Semantic Image Synthesis. We adopt the same strategy, consisting in adapting diffusion models to solve the task without any example. We propose the ZestGuide algorithm, which leverages the spatio-semantic information encoded in the attention layers of diffusion models

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Buchteile zum Thema "Multimodal embedding space"

Zhang, Chao, und Jiawei Han. „Data Mining and Knowledge Discovery“. In Urban Informatics, 797–814. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-15-8983-6_42.

Der volle Inhalt der Quelle

Annotation:

AbstractOur physical world is being projected into online cyberspace at an unprecedented rate. People nowadays visit different places and leave behind them million-scale digital traces such as tweets, check-ins, Yelp reviews, and Uber trajectories. Such digital data are a result of social sensing: namely people act as human sensors that probe different places in the physical world and share their activities online. The availability of massive social-sensing data provides a unique opportunity for understanding urban space in a data-driven manner and improving many urban computing applications, ranging from urban planning and traffic scheduling to disaster control and trip planning. In this chapter, we present recent developments in data-mining techniques for urban activity modeling, a fundamental task for extracting useful urban knowledge from social-sensing data. We first describe traditional approaches to urban activity modeling, including pattern discovery methods and statistical models. Then, we present the latest developments in multimodal embedding techniques for this task, which learns vector representations for different modalities to model people's spatiotemporal activities. We study the empirical performance of these methods and demonstrate how data-mining techniques can be successfully applied to social-sensing data to extract actionable knowledge and facilitate downstream applications.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Zhao, Xiang, Weixin Zeng und Jiuyang Tang. „Multimodal Entity Alignment“. In Entity Alignment, 229–47. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-4250-3_9.

Der volle Inhalt der Quelle

Annotation:

AbstractIn various tasks related to artificial intelligence, data is often present in multiple forms or modalities. Recently, it has become a popular approach to combine these different forms of information into a knowledge graph, creating a multi-modal knowledge graph (MMKG). However, multi-modal knowledge graphs (MMKGs) often face issues of insufficient data coverage and incompleteness. In order to address this issue, a possible strategy is to incorporate supplemental information from other multi-modal knowledge graphs (MMKGs). To achieve this goal, current methods for aligning entities could be utilized; however, these approaches work within the Euclidean space, and the resulting entity representations can distort the hierarchical structure of the knowledge graph. Additionally, the potential benefits of visual information have not been fully utilized.To address these concerns, we present a new approach for aligning entities across multiple modalities, which we call hyperbolic multi-modal entity alignment (). This method expands upon the conventional Euclidean representation by incorporating a hyperboloid manifold. Initially, we utilize hyperbolic graph convolutional networks() to acquire structural representations of entities. In terms of visual data, we create image embeddings using the model and subsequently map them into the hyperbolic space utilizing . Lastly, we merge the structural and visual representations within the hyperbolic space and utilize the combined embeddings to forecast potential entity alignment outcomes. Through a series of thorough experiments and ablation studies, we validate the efficacy of our proposed model and its individual components.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Valles-Perez, Ivan, Grzegorz Beringer, Piotr Bilinski, Gary Cook und Roberto Barra-Chicote. „SCRAPS: Speech Contrastive Representations of Acoustic and Phonetic Spaces“. In Frontiers in Artificial Intelligence and Applications. IOS Press, 2023. http://dx.doi.org/10.3233/faia230540.

Der volle Inhalt der Quelle

Annotation:

Numerous examples in the literature proved that deep learning models have the ability to work well with multimodal data. Recently, CLIP has enabled deep learning systems to learn shared latent spaces between images and text descriptions, with outstanding zero- or few-shot results in downstream tasks. In this paper we explore the same idea proposed by CLIP but applied to the speech domain, where the phonetic and acoustic spaces usually coexist. We train a CLIP-based model with the aim to learn shared representations of phonetic and acoustic spaces. The results show that the proposed model is sensible to phonetic changes, with a 91% of score drops when replacing 20% of the phonemes at random, while providing substantial robustness against different kinds of noise, with a 10% performance drop when mixing the audio with 75% of Gaussian noise. We also provide empirical evidence showing that the resulting embeddings are useful for a variety of downstream applications, such as intelligibility evaluation and the ability to leverage rich pre-trained phonetic embeddings in speech generation task. Finally, we discuss potential applications with interesting implications for the speech generation and recognition fields.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Konferenzberichte zum Thema "Multimodal embedding space"

Bhattacharya, Indrani, Arkabandhu Chowdhury und Vikas C. Raykar. „Multimodal Dialog for Browsing Large Visual Catalogs using Exploration-Exploitation Paradigm in a Joint Embedding Space“. In ICMR '19: International Conference on Multimedia Retrieval. New York, NY, USA: ACM, 2019. http://dx.doi.org/10.1145/3323873.3325036.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Rostami, Mohammad, und Aram Galstyan. „Cognitively Inspired Learning of Incremental Drifting Concepts“. In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/341.

Der volle Inhalt der Quelle

Annotation:

Humans continually expand their learned knowledge to new domains and learn new concepts without any interference with past learned experiences. In contrast, machine learning models perform poorly in a continual learning setting, where input data distribution changes over time. Inspired by the nervous system learning mechanisms, we develop a computational model that enables a deep neural network to learn new concepts and expand its learned knowledge to new domains incrementally in a continual learning setting. We rely on the Parallel Distributed Processing theory to encode abstract concepts in an embedding space in terms of a multimodal distribution. This embedding space is modeled by internal data representations in a hidden network layer. We also leverage the Complementary Learning Systems theory to equip the model with a memory mechanism to overcome catastrophic forgetting through implementing pseudo-rehearsal. Our model can generate pseudo-data points for experience replay and accumulate new experiences to past learned experiences without causing cross-task interference.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Gopalakrishnan, Sabarish, Premkumar Udaiyar, Shagan Sah und Raymond Ptucha. „Multi Stage Common Vector Space for Multimodal Embeddings“. In 2019 IEEE Applied Imagery Pattern Recognition Workshop (AIPR). IEEE, 2019. http://dx.doi.org/10.1109/aipr47015.2019.9174583.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Feng, LiWei, Hao Ai und Yuan Li. „Multimode Process Monitoring Based on Density Space Clustering Locally Linear Embedding Technique“. In 2023 2nd Conference on Fully Actuated System Theory and Applications (CFASTA). IEEE, 2023. http://dx.doi.org/10.1109/cfasta57821.2023.10243375.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Pasi, Piyush Singh, Karthikeya Battepati, Preethi Jyothi, Ganesh Ramakrishnan, Tanmay Mahapatra und Manoj Singh. „Temporally Aligning Long Audio Interviews with Questions: A Case Study in Multimodal Data Integration“. In Thirty-Second International Joint Conference on Artificial Intelligence {IJCAI-23}. California: International Joint Conferences on Artificial Intelligence Organization, 2023. http://dx.doi.org/10.24963/ijcai.2023/683.

Der volle Inhalt der Quelle

Annotation:

The problem of audio-to-text alignment has seen significant amount of research using complete supervision during training. However, this is typically not in the context of long audio recordings wherein the text being queried does not appear verbatim within the audio file. This work is a collaboration with a non-governmental organization called CARE India that collects long audio health surveys from young mothers residing in rural parts of Bihar, India. Given a question drawn from a questionnaire that is used to guide these surveys, we aim to locate where the question is asked within a long audio recording. This is of great value to African and Asian organizations that would otherwise have to painstakingly go through long and noisy audio recordings to locate questions (and answers) of interest. Our proposed framework, INDENT, uses a cross-attention-based model and prior information on the temporal ordering of sentences to learn speech embeddings that capture the semantics of the underlying spoken text. These learnt embeddings are used to retrieve the corresponding audio segment based on text queries at inference time. We empirically demonstrate the significant effectiveness (improvement in R-avg of about 3%) of our model over those obtained using text-based heuristics. We also show how noisy ASR, generated using state-of-the-art ASR models for Indian languages, yields better results when used in place of speech. INDENT, trained only on Hindi data is able to cater to all languages supported by the (semantically) shared text space. We illustrate this empirically on 11 Indic languages.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!