Log in

Relevant bibliographies by topics / Multimodal retrieval / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Multimodal retrieval.

Dissertations / Theses on the topic 'Multimodal retrieval'

Author: Grafiati

Published: 18 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 34 dissertations / theses for your research on the topic 'Multimodal retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Adebayo, Kolawole John <1986&gt. "Multimodal Legal Information Retrieval." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amsdottorato.unibo.it/8634/1/ADEBAYO-JOHN-tesi.pdf.

Full text

Abstract:

The goal of this thesis is to present a multifaceted way of inducing semantic representation from legal documents as well as accessing information in a precise and timely manner. The thesis explored approaches for semantic information retrieval (IR) in the Legal context with a technique that maps specific parts of a text to the relevant concept. This technique relies on text segments, using the Latent Dirichlet Allocation (LDA), a topic modeling algorithm for performing text segmentation, expanding the concept using some Natural Language Processing techniques, and then associating the text segments to the concepts using a semi-supervised text similarity technique. This solves two problems, i.e., that of user specificity in formulating query, and information overload, for querying a large document collection with a set of concepts is more fine-grained since specific information, rather than full documents is retrieved. The second part of the thesis describes our Neural Network Relevance Model for E-Discovery Information Retrieval. Our algorithm is essentially a feature-rich Ensemble system with different component Neural Networks extracting different relevance signal. This model has been trained and evaluated on the TREC Legal track 2010 data. The performance of our models across board proves that it capture the semantics and relatedness between query and document which is important to the Legal Information Retrieval domain.

APA, Harvard, Vancouver, ISO, and other styles

2

Chen, Jianan. "Deep Learning Based Multimodal Retrieval." Electronic Thesis or Diss., Rennes, INSA, 2023. http://www.theses.fr/2023ISAR0019.

Full text

Abstract:

Les tâches multimodales jouent un rôle crucial dans la progression vers l'atteinte de l'intelligence artificielle (IA) générale. L'objectif principal de la recherche multimodale est d'exploiter des algorithmes d'apprentissage automatique pour extraire des informations sémantiques pertinentes, en comblant le fossé entre différentes modalités telles que les images visuelles, le texte linguistique et d'autres sources de données. Il convient de noter que l'entropie de l'information associée à des données hétérogènes pour des sémantiques de haut niveau identiques varie considérablement, ce qui pose un défi important pour les modèles multimodaux. Les modèles de réseau multimodal basés sur l'apprentissage profond offrent une solution efficace pour relever les difficultés découlant des différences substantielles d'entropie de l’information. Ces modèles présentent une précision et une stabilité impressionnantes dans les tâches d'appariement d'informations multimodales à grande échelle, comme la recherche d'images et de textes. De plus, ils démontrent de solides capacités d'apprentissage par transfert, permettant à un modèle bien entraîné sur une tâche multimodale d'être affiné et appliqué à une nouvelle tâche multimodale. Dans nos recherches, nous développons une nouvelle base de données multimodale et multi-vues générative spécifiquement conçue pour la tâche de segmentation référentielle multimodale. De plus, nous établissons une référence de pointe (SOTA) pour les modèles de segmentation d'expressions référentielles dans le domaine multimodal. Les résultats de nos expériences comparatives sont présentés de manière visuelle, offrant des informations claires et complètes
Multimodal tasks play a crucial role in the progression towards achieving general artificial intelligence (AI). The primary goal of multimodal retrieval is to employ machine learning algorithms to extract relevant semantic information, bridging the gap between different modalities such as visual images, linguistic text, and other data sources. It is worth noting that the information entropy associated with heterogeneous data for the same high-level semantics varies significantly, posing a significant challenge for multimodal models. Deep learning-based multimodal network models provide an effective solution to tackle the difficulties arising from substantial differences in information entropy. These models exhibit impressive accuracy and stability in large-scale cross-modal information matching tasks, such as image-text retrieval. Furthermore, they demonstrate strong transfer learning capabilities, enabling a well-trained model from one multimodal task to be fine-tuned and applied to a new multimodal task, even in scenarios involving few-shot or zero-shot learning. In our research, we develop a novel generative multimodal multi-view database specifically designed for the multimodal referential segmentation task. Additionally, we establish a state-of-the-art (SOTA) benchmark and multi-view metric for referring expression segmentation models in the multimodal domain. The results of our comparative experiments are presented visually, providing clear and comprehensive insights

APA, Harvard, Vancouver, ISO, and other styles

3

Böckmann, Christine, Jens Biele, Roland Neuber, and Jenny Niebsch. "Retrieval of multimodal aerosol size distribution by inversion of multiwavelength data." Universität Potsdam, 1997. http://opus.kobv.de/ubp/volltexte/2007/1436/.

Full text

Abstract:

The ill-posed problem of aerosol size distribution determination from a small number of backscatter and extinction measurements was solved successfully with a mollifier method which is advantageous since the ill-posed part is performed on exactly given quantities, the points r where n(r) is evaluated may be freely selected. A new twodimensional model for the troposphere is proposed.

APA, Harvard, Vancouver, ISO, and other styles

4

Zhu, Meng. "Cross-modal semantic-associative labelling, indexing and retrieval of multimodal data." Thesis, University of Reading, 2010. http://centaur.reading.ac.uk/24828/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Kahn, Itamar. "Remembering the past : multimodal imaging of cortical contributions to episodic retrieval." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33171.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2005.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references.
What is the nature of the neural processes that allow humans to remember past events? The theoretical framework adopted in this thesis builds upon cognitive models that suggest that episodic retrieval can be decomposed into two classes of computations: (1) recovery processes that serve to reactivate stored memories, making information from a past episode readily available, and (2) control processes that serve to guide the retrieval attempt and monitor/evaluate information arising from the recovery processes. A multimodal imaging approach that combined fMRI and MEG was adopted to gain insight into the spatial and temporal brain mechanisms supporting episodic retrieval. Chapter 1 reviews major findings and theories in the episodic retrieval literature grounding the open questions and controversies within the suggested framework. Chapter 2 describes an fMRI and MEG experiment that identified medial temporal cortical structures that signal item memory strength, thus supporting the perception of item familiarity. Chapter 3 describes an fMRI experiment that demonstrated that retrieval of contextual details involves reactivation of neural patterns engaged at encoding.
(cont.) Further, leveraging this pattern of reactivation, it was demonstrated that false recognition may be accompanied by recollection. The fMRI experiment reported in Chapter 3, when combined with an MEG experiment reported in Chapter 4, directly addressed questions regarding the control processes engaged during episodic retrieval. In particular, Chapter 3 showed that parietal and prefrontal cortices contribute to controlling the act of arriving at a retrieval decision. Chapter 4 then illuminates the temporal characteristics of parietal activation during episodic retrieval, providing novel evidence about the nature of parietal responses and thus constraints on theories of parietal involvement in episodic retrieval. The conducted research targeted distinct aspects of the multi-faceted act of remembering the past. The obtained data contribute to the building of an anatomical and temporal "blueprint" documenting the cascade of neural events that unfold during attempts to remember, as well as when such attempts are met with success or lead to memory errors. In the course of framing this research within the context of cognitive models of retrieval, the obtained neural data reflect back on and constrain these theories of remembering.
by Itamar Kahn.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

6

Nag, Chowdhury Sreyasi [Verfasser]. "Text-image synergy for multimodal retrieval and annotation / Sreyasi Nag Chowdhury." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1240674139/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Lolich, María, and Susana Azzollini. "Phenomenological retrieval style of autobiographical memories in a sample of major depressed individuals." Pontificia Universidad Católica del Perú, 2016. http://repositorio.pucp.edu.pe/index/handle/123456789/99894.

Full text

Abstract:

Autobiographical memory retrieval implies different phenomenological features. Given the lack of previous work in Hispanic-speaking populations, 34 in depth interview were carried out in individuals with and without Major Depressive Disorder in Buenos Aires, Argentina. Phenomenological components during the evocation of autobiographical memories were explored. Data was qualitatively analyzed using Grounded Theory. During the descriptive analyses, seven phenomenological categories were detected as emerging from the discourse. The axial and selective analyses revealed two main discursive axles areas; rhetoric-propo sitional and specificity- generalized. The impact on affective regulation processes, derived from the assumption of an amodal or multimodal style of processing autobiographical infor mation, merits further attention.
La evocación de recuerdos autobiográficos se caracteriza por presentar distintos compo nentes fenomenológicos. Dada la ausencia de trabajos previos realizados en poblaciones hispanoparlantes, se realizaron 34 entrevistas en profundidad a individuos con y sin tras torno depresivo mayor de la ciudad de Buenos Aires (Argentina). Fueron explorados los componentes fenomenológicos presentes en la evocación de recuerdos autobiográficos significativos. Los datos fueron analizados cualitativamente por medio de la Teoría Fun damentada en los Hechos. Durante el análisis descriptivo, se detectaron siete categorías fenomenológicas emergentes del discurso. Del análisis axial y selectivo fueron identificados dos ejes discursivos: retórico-proposicional y especificidad-generalidad. Las implicancias, en la regulación afectiva, derivadas de la asunción de un estilo amodal o multimodal de proce samiento de información autobiográfica merecen mayor atención.
A evocação de memórias autobiográficas é caracterizada por diferentes componentes feno menológicos. Dada a falta de trabalhos prévios sobre o tema em populações de língua espanhola, 34 entrevistas em profundidade foram conduzidas em indivíduos com e sem transtorno depressivo maior na cidade de Buenos Aires (Argentina). Foram explorados os componentes fenomenológicos presentes na evocação de memórias autobiográficas signi ficativas. Os dados foram analisados qualitativamente através da Teoria Fundamentada. Durante a análise descritiva, foram detectadas sete categorias fenomenológicas emer gentes no discurso. Dos analises axial e seletivo foram identificados dois eixos discursivos: retórico-proposicional e especificidade-generalidade. As implicações, na regulação afetiva, decorrentes da assunção de um estilo amodal ou um estilo multimodal no processamento de informações autobiográficas merecem mais atenção.

APA, Harvard, Vancouver, ISO, and other styles

8

Valero-Mas, Jose J. "Towards Interactive Multimodal Music Transcription." Doctoral thesis, Universidad de Alicante, 2017. http://hdl.handle.net/10045/71275.

Full text

Abstract:

La transcripción de música por computador es de vital importancia en tareas del llamo campo de la Extracción y recuperación de información musical por su utilidad como proceso para la obtención de una abstracción simbólica que codifica el contenido musical de un fichero de audio. En esta disertación se estudia este problema desde una perspectiva diferente a la típicamente considerada para estos problemas, la perspectiva interactiva y multimodal. En este paradigma el usuario cobra especial importancia puesto que es parte activa en la resolución del problema (interactividad); por otro lado, la multimodalidad implica que diferentes fuentes de información extraídas de la misma señal se aúnan para ayudar a una mejor resolución de la tarea.

APA, Harvard, Vancouver, ISO, and other styles

9

Quack, Till. "Large scale mining and retrieval of visual data in a multimodal context." Konstanz Hartung-Gorre, 2009. http://d-nb.info/993614620/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Saragiotis, Panagiotis. "Cross-modal classification and retrieval of multimodal data using combinations of neural networks." Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/843338/.

Full text

Abstract:

Current neurobiological thinking supported, in part, by experimentation stresses the importance of cross-modality. Uni-modal cognitive tasks, language and vision, for example, are performed with the help of many networks working simultaneously or sequentially; and for cross-modal tasks, like picture / object naming and word illustration, the output of these networks is combined to produce higher cognitive behaviour. The notion of multi-net processing is used typically in the pattern recognition literature, where ensemble networks of weak classifiers - typically supervised - appear to outperform strong classifiers. We have built a system, based on combinations of neural networks, that demonstrates how cross-modal classification can be used to retrieve multi-modal data using one of the available modalities of information. Two multi-net systems were used in this work: one comprising Kohonen SOMs that interact with each other via a Hebbian network and a fuzzy ARTMAP network where the interaction is through the embedded map field. The multi-nets were used for the cross-modal retrieval of images given keywords and for finding the appropriate keywords for an image. The systems were trained on two publicly available image databases that had collateral annotations on the images. The Hemera collection, comprising images of pre-segmented single objects, and the Corel collection with images of multiple objects were used for automatically generating various sets of input vectors. We have attempted to develop a method for evaluating the performance of multi-net systems using a monolithic network trained on modally-undifferentiated vectors as an intuitive bench-mark. To this extent single SOM and fuzzy ART networks were trained using a concatenated visual / linguistic vector to test the performance of multi-net systems with typical monolithic systems. Both multi-nets outperform the respective monolithic systems in terms of information retrieval measures of precision and recall on test images drawn from both datasets; the SOM multi-net outperforms the fuzzy ARTMAP both in terms of convergence and precision-recall. The performance of the SOM-based multi-net in retrieval, classification and auto-annotation is on a par with that of state of the art systems like "ALIP" and "Blobworld". Much of the neural network based simulations reported in the literature use supervised learning algorithms. Such algorithms are suited when classes of objects are predefined and objects in themselves are quite unique in terms of their attributes. We have compared the performance of our multi-net systems with that of a multi-layer perceptron (MLP). The MLP does show substantially greater precision and recall on a (fixed) class of objects when compared with our unsupervised systems. However when 'lesioned' -the network connectivity 'damaged' deliberately- the multi-net systems show a greater degree of robustness. Cross-modal systems appear to hold considerable intellectual and commercial potential and the multi-net approach facilitates the simulation of such systems.

APA, Harvard, Vancouver, ISO, and other styles

11

Fedel, Gabriel de Souza. "Busca multimodal para apoio à pesquisa em biodiversidade." [s.n.], 2011. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275751.

Full text

Abstract:

Orientador: Cláudia Maria Bauzer Medeiros
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-18T07:07:49Z (GMT). No. of bitstreams: 1 Fedel_GabrieldeSouza_M.pdf: 14390093 bytes, checksum: 63058da33a22121e927f1cdbaff297d3 (MD5) Previous issue date: 2011
Resumo: A pesquisa em computação aplicada à biodiversidade apresenta muitos desafios, que vão desde o grande volume de dados altamente heterogêneos até a variedade de tipos de usuários. Isto gera a necessidade de ferramentas versáteis de recuperação. As ferramentas disponíveis ainda são limitadas e normalmente só consideram dados textuais, deixando de explorar a potencialidade da busca por dados de outra natureza, como imagens ou sons. Esta dissertação analisa os problemas de realizar consultas multimodais a partir de predicados que envolvem texto e imagem para o domínio de biodiversidade, especificando e implementando um conjunto de ferramentas para processar tais consultas. As contribuições do trabalho, validado com dados reais, incluem a construção de uma ontologia taxonômica associada a nomes vulgares e a possibilidade de apoiar dois perfis de usuários (especialistas e leigos). Estas características estendem o escopo da consultas atualmente disponíveis em sistemas de biodiversidade. Este trabalho está inserido no projeto Bio-CORE, uma parceria entre pesquisadores de computação e biologia para criar ferramentas computacionais para dar apoio à pesquisa em biodiversidade
Abstract: Research on Computing applied to biodiversity present several challenges, ranging from the massive volumes of highly heterogeneous data to the variety in user profiles. This kind of scenario requires versatile data retrieval and management tools. Available tools are still limited. Most often, they only consider textual data and do not take advantage of the multiple data types available, such as images or sounds. This dissertation discusses issues concerning multimodal queries that involve both text and images as search parameters, for the domanin of biodiversity. It presents the specification and implementation of a set of tools to process such queries, which were validate with real data from Unicamp's Zoology Museum. The aim contributions also include the construction of a taxonomic ontology that includes species common names, and support to both researchers and non-experts in queries. Such features extend the scop of queries available in biodiversity information systems. This research is associated with the Biocore project, jointly conducted by researchers in computing and biology, to design and develop computational tools to support research in biodiversity
Mestrado
Banco de Dados
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

12

Dyar, Samuel S. "A multimodal speech interface for dynamic creation and retrieval of geographical landmarks on a mobile device." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/62638.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 140).
As mobile devices become more powerful, researchers look to develop innovative applications that use new and effective means of input. Furthermore, developers must exploit the device's many capabilities (GPS, camera, touch screen, etc) in order to make equally powerful applications. This thesis presents the development of a multimodal system that allows users to create and share informative geographical landmarks using Android-powered smart-phones. The content associated with each landmark is dynamically integrated into the system's vocabulary, which allows users to easily use speech to access landmarks by the information related to them. The initial results of releasing the application on the Android Market have been encouraging, but also suggest that improvements need to be made to the system.
by Samuel S. Dyar.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

13

Calumby, Rodrigo Tripodi 1985. "Recuperação multimodal de imagens com realimentação de relevância baseada em programação genética." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275814.

Full text

Abstract:

Orientador: Ricardo da Silva Torres
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-16T05:18:58Z (GMT). No. of bitstreams: 1 Calumby_RodrigoTripodi_M.pdf: 15749586 bytes, checksum: 2493b0b703adc1973eeabf7eb70ad21c (MD5) Previous issue date: 2010
Resumo: Este trabalho apresenta uma abordagem para recuperação multimodal de imagens com realimentação de relevância baseada em programação genética. Supõe-se que cada imagem da coleção possui informação textual associada (metadado, descrição textual, etc.), além de ter suas propriedades visuais (por exemplo, cor e textura) codificadas em vetores de características. A partir da informação obtida ao longo das iterações de realimentação de relevância, programação genética é utilizada para a criação de funções de combinação de medidas de similaridades eficazes. Com essas novas funções, valores de similaridades diversos são combinados em uma única medida, que mais adequadamente reflete as necessidades do usuário. As principais contribuições deste trabalho consistem na proposta e implementação de dois arcabouços. O primeiro, RFCore, é um arcabouço genérico para atividades de realimentação de relevância para manipulação de objetos digitais. O segundo, MMRFGP, é um arcabouço para recuperação de objetos digitais com realimentação de relevância baseada em programação genética, construído sobre o RFCore. O método proposto de recuperação multimodal de imagens foi validado sobre duas coleções de imagens, uma desenvolvida pela Universidade de Washington e outra da ImageCLEF Photographic Retrieval Task. A abordagem proposta mostrou melhores resultados para recuperação multimodal frente a utilização das modalidades isoladas. Além disso, foram obtidos resultados para recuperação visual e multimodal melhores do que as melhores submissões para a ImageCLEF Photographic Retrieval Task 2008
Abstract: This work presents an approach for multimodal content-based image retrieval with relevance feedback based on genetic programming. We assume that there is textual information (e.g., metadata, textual descriptions) associated with collection images. Furthermore, image content properties (e.g., color and texture) are characterized by image descriptores. Given the information obtained over the relevance feedback iterations, genetic programming is used to create effective combination functions that combine similarities associated with different features. Hence using these new functions the different similarities are combined into a unique measure that more properly meets the user needs. The main contribution of this work is the proposal and implementation of two frameworks. The first one, RFCore, is a generic framework for relevance feedback tasks over digital objects. The second one, MMRF-GP, is a framework for digital object retrieval with relevance feedback based on genetic programming and it was built on top of RFCore. We have validated the proposed multimodal image retrieval approach over 2 datasets, one from the University of Washington and another from the ImageCLEF Photographic Retrieval Task. Our approach has yielded the best results for multimodal image retrieval when compared with one-modality approaches. Furthermore, it has achieved better results for visual and multimodal image retrieval than the best submissions for ImageCLEF Photographic Retrieval Task 2008
Mestrado
Sistemas de Recuperação da Informação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

14

Durak, Nurcan. "Semantic Video Modeling And Retrieval With Visual, Auditory, Textual Sources." Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/12605438/index.pdf.

Full text

Abstract:

The studies on content-based video indexing and retrieval aim at accessing video content from different aspects more efficiently and effectively. Most of the studies have concentrated on the visual component of video content in modeling and retrieving the video content. Beside visual component, much valuable information is also carried in other media components, such as superimposed text, closed captions, audio, and speech that accompany the pictorial component. In this study, semantic content of video is modeled using visual, auditory, and textual components. In the visual domain, visual events, visual objects, and spatial characteristics of visual objects are extracted. In the auditory domain, auditory events and auditory objects are extracted. In textual domain, speech transcripts and visible texts are considered. With our proposed model, users can access video content from different aspects and get desired information more quickly. Beside multimodality, our model is constituted on semantic hierarchies that enable querying the video content at different semantic levels. There are sequence-scene hierarchies in visual domain, background-foreground hierarchies in auditory domain, and subject hierarchies in speech domain. Presented model has been implemented and multimodal content queries, hierarchical queries, fuzzy spatial queries, fuzzy regional queries, fuzzy spatio-temporal queries, and temporal queries have been applied on video content successfully.

APA, Harvard, Vancouver, ISO, and other styles

15

Oztarak, Hakan. "Structural And Event Based Multimodal Video Data Modeling." Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/12606919/index.pdf.

Full text

Abstract:

Investments on multimedia technology enable us to store many more reflections of the real world in digital world as videos. By recording videos about real world entities, we carry a lot of information to the digital world directly. In order to store and efficiently query this information, a video database system (VDBS) is necessary. In this thesis work, we propose a structural, event based and multimodal (SEBM) video data model for VDBSs. SEBM video data model supports three different modalities that are visual, auditory and textual modalities and we propose that we can dissolve these three modalities with a single SEBM video data model. This proposal is supported by the interpretation of the video data by human. Hence we can answer the content based, spatio-temporal and fuzzy queries of the user more easily, since we store the video data as the way that s/he interprets the real world data. We follow divide and conquer technique when answering very complicated queries. We have implemented the SEBM video data model in a Java based system that uses XML for representing the SEBM data model and Berkeley XML DBMS for storing the data based on the SEBM prototype system.

APA, Harvard, Vancouver, ISO, and other styles

16

Rubio, Romano Antonio. "Fashion discovery : a computer vision approach." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/672423.

Full text

Abstract:

Performing semantic interpretation of fashion images is undeniably one of the most challenging domains for computer vision. Subtle variations in color and shape might confer different meanings or interpretations to an image. Not only is it a domain tightly coupled with human understanding, but also with scene interpretation and context. Being able to extract fashion-specific information from images and interpret that information in a proper manner can be useful in many situations and help understanding the underlying information in an image. Fashion is also one of the most important businesses around the world, with an estimated value of 3 trillion dollars and a constantly growing online market, which increases the utility of image-based algorithms to search, classify or recommend garments. This doctoral thesis aims to solve specific problems related with the treatment of fashion e-commerce data, from low-level pure pixel information to high-level abstract conclusions of the garments appearing in an image, taking advantage of the multi-modality of the available data for developing some of the solutions. The contributions include: - A new superpixel extraction method focused on improving the annotation process for clothing images. - The construction of an image and text embedding for fashion data. - The application of this embedding space to the task of retrieving the main product in an image showing a complete outfit. In summary, fashion is a complex computer vision and machine learning problem at many levels, and developing specific algorithms that are able to capture essential information from pictures and text is not trivial. In order to solve some of the challenges it proposes, and taking into account that this is an Industrial Ph.D., we contribute with a variety of solutions that can boost the performance of many tasks useful for the fashion e-commerce industry.
La interpretación semántica de imágenes del mundo de la moda es sin duda uno de los dominios más desafiantes para la visión por computador. Leves variaciones en color y forma pueden conferir significados o interpretaciones distintas a una imagen. Es un dominio estrechamente ligado a la comprensión humana subjetiva, pero también a la interpretación y reconocimiento de escenarios y contextos. Ser capaz de extraer información específica sobre moda de imágenes e interpretarla de manera correcta puede ser útil en muchas situaciones y puede ayudar a entender la información subyacente en una imagen. Además, la moda es uno de los negocios más importantes a nivel global, con un valor estimado de tres trillones de dólares y un mercado online en constante crecimiento, lo cual aumenta el interés de los algoritmos basados en imágenes para buscar, clasificar o recomendar prendas. Esta tesis doctoral pretende resolver problemas específicos relacionados con el tratamiento de datos de tiendas virtuales de moda, yendo desde la información más básica a nivel de píxel hasta un entendimiento más abstracto que permita extraer conclusiones sobre las prendas presentes en una imagen, aprovechando para ello la Multi-modalidad de los datos disponibles para desarrollar algunas de las soluciones. Las contribuciones incluyen: - Un nuevo método de extracción de superpíxeles enfocado a mejorar el proceso de anotación de imágenes de moda. - La construcción de un espacio común para representar imágenes y textos referentes a moda. - La aplicación de ese espacio en la tarea de identificar el producto principal dentro de una imagen que muestra un conjunto de prendas. En resumen, la moda es un dominio complejo a muchos niveles en términos de visión por computador y aprendizaje automático, y desarrollar algoritmos específicos capaces de capturar la información esencial a partir de imágenes y textos no es una tarea trivial. Con el fin de resolver algunos de los desafíos que esta plantea, y considerando que este es un doctorado industrial, contribuimos al tema con una variedad de soluciones que pueden mejorar el rendimiento de muchas tareas extremadamente útiles para la industria de la moda online
Automàtica, robòtica i visió

APA, Harvard, Vancouver, ISO, and other styles

17

SIMONETTA, FEDERICO. "MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS." Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.

Full text

Abstract:

This Thesis discusses the development of technologies for the automatic resynthesis of music recordings using digital synthesizers. First, the main issue is identified in the understanding of how Music Information Processing (MIP) methods can take into consideration the influence of the acoustic context on the music performance. For this, a novel conceptual and mathematical framework named “Music Interpretation Analysis” (MIA) is presented. In the proposed framework, a distinction is made between the “performance” – the physical action of playing – and the “interpretation” – the action that the performer wishes to achieve. Second, the Thesis describes further works aiming at the democratization of music production tools via automatic resynthesis: 1) it elaborates software and file formats for historical music archiving and multimodal machine-learning datasets; 2) it explores and extends MIP technologies; 3) it presents the mathematical foundations of the MIA framework and shows preliminary evaluations to demonstrate the effectiveness of the approach

APA, Harvard, Vancouver, ISO, and other styles

18

Ismail, Nor Azman. "Flexible photo retrieval (FlexPhoReS) : a prototype for multimodel personal digital photo retrieval." Thesis, Loughborough University, 2007. https://dspace.lboro.ac.uk/2134/12924.

Full text

Abstract:

Digital photo technology is developing rapidly and is motivating more people to have large personal collections of digital photos. However, effective and fast retrieval of digital photos is not always easy, especially when the collections grow into thousands. World Wide Web (WWW) is one of the platforms that allows digital photo users to publish a collection of photos in a centralised and organised way. Users typically find their photos by searching or browsing uSing a keyboard and mouse. Also in development at the moment are alternative user interfaces such as graphical user interfaces with speech (S/GUI) and other multimodal user interfaces which offer more flexibility to users. The aim of this research was to design and evaluate a flexible user interface for a web based personal digital photo retrieval system. A model of a flexible photo retrieval system (FlexPhoReS) was developed based on a review of the literature and a small-scale user study. A prototype, based on the model, was built using MATLAB and WWW technology. FlexPhoReS is a web based personal digital photo retrieval prototype that enables digital photo users to . accomplish photo retrieval tasks (browsing, keyword and visual example searching (CBI)) using either mouse and keyboard input modalities or mouse and speech input modalities. An evaluation with 20 digital photo users was conducted using usability testing methods. The result showed that there was a significant difference in search performance between using mouse and keyboard input modalities and using mouse and speech input modalities. On average, the reduction in search performance time due to using mouse and speech input modalities was 37.31%. Participants were also significantly more satisfied with mouse and speech input modalities than with mouse and keyboard input modalities although they felt that both were complementary. This research demonstrated that the prototype was successful in providing a flexible model of the photo retrieval process by offering alternative input modalities through a multimodal user interface in the World Wide Web environment.

APA, Harvard, Vancouver, ISO, and other styles

19

Bonardi, Fabien. "Localisation visuelle multimodale visible/infrarouge pour la navigation autonome." Thesis, Normandie, 2017. http://www.theses.fr/2017NORMR028/document.

Full text

Abstract:

On regroupe sous l’expression navigation autonome l’ensemble des méthodes visantà automatiser les déplacements d’un robot mobile. Les travaux présentés seconcentrent sur la problématique de la localisation en milieu extérieur, urbain etpériurbain, et approchent la problématique de la localisation visuelle soumise à lafois à un changement de capteurs (géométrie et modalité) ainsi qu’aux changementsde l’environnement à long terme, contraintes combinées encore très peu étudiéesdans l’état de l’art. Les recherches menées dans le cadre de cette thèse ont porté surl’utilisation exclusive de capteurs de vision. La contribution majeure de cette thèseporte sur la phase de description et compression des données issues des images sousla forme d’un histogramme de mots visuels que nous avons nommée PHROG (PluralHistograms of Restricted Oriented Gradients). Les expériences menées ont été réaliséessur plusieurs bases d’images avec différentes modalités visibles et infrarouges. Lesrésultats obtenus démontrent une amélioration des performances de reconnaissance descènes comparés aux méthodes de l’état de l’art. Par la suite, nous nous intéresseronsà la nature séquentielle des images acquises dans un contexte de navigation afin defiltrer et supprimer des estimations de localisation aberrantes. Les concepts d’un cadreprobabiliste Bayésien permettent deux applications de filtrage probabiliste appliquéesà notre problématique : une première solution définit un modèle de déplacementsimple du robot avec un filtre d’histogrammes et la deuxième met en place un modèleplus évolué faisant appel à l’odométrie visuelle au sein d’un filtre particulaire.123
Autonomous navigation field gathers the set of algorithms which automate the moves of a mobile robot. The case study of this thesis focuses on the outdoor localisation issue with additionnal constraints : the use of visual sensors only with variable specifications (geometry, modality, etc) and long-term apparence changes of the surrounding environment. Both types of constraints are still rarely studied in the state of the art. Our main contribution concerns the description and compression steps of the data extracted from images. We developped a method called PHROG which represents data as a visual-words histogram. Obtained results on several images datasets show an improvment of the scenes recognition performance compared to methods from the state of the art. In a context of navigation, acquired images are sequential such that we can envision a filtering method to avoid faulty localisation estimation. Two probabilistic filtering approaches are proposed : a first one defines a simple movement model with a histograms filter and a second one sets up a more complex model using visual odometry and a particules filter

APA, Harvard, Vancouver, ISO, and other styles

20

Nguyen, Nhu Van. "Représentations visuelles de concepts textuels pour la recherche et l'annotation interactives d'images." Phd thesis, Université de La Rochelle, 2011. http://tel.archives-ouvertes.fr/tel-00730707.

Full text

Abstract:

En recherche d'images aujourd'hui, nous manipulons souvent de grands volumes d'images, qui peuvent varier ou même arriver en continu. Dans une base d'images, on se retrouve ainsi avec certaines images anciennes et d'autres nouvelles, les premières déjà indexées et possiblement annotées et les secondes en attente d'indexation ou d'annotation. Comme la base n'est pas annotée uniformément, cela rend l'accès difficile par le biais de requêtes textuelles. Nous présentons dans ce travail différentes techniques pour interagir, naviguer et rechercher dans ce type de bases d'images. Premièrement, un modèle d'interaction à court terme est utilisé pour améliorer la précision du système. Deuxièmement, en se basant sur un modèle d'interaction à long terme, nous proposons d'associer mots textuels et caractéristiques visuelles pour la recherche d'images par le texte, par le contenu visuel, ou mixte texte/visuel. Ce modèle de recherche d'images permet de raffiner itérativement l'annotation et la connaissance des images. Nous identifions quatre contributions dans ce travail. La première contribution est un système de recherche multimodale d'images qui intègre différentes sources de données, comme le contenu de l'image et le texte. Ce système permet l'interrogation par l'image, l'interrogation par mot-clé ou encore l'utilisation de requêtes hybrides. La deuxième contribution est une nouvelle technique pour le retour de pertinence combinant deux techniques classiques utilisées largement dans la recherche d'information~: le mouvement du point de requête et l'extension de requêtes. En profitant des images non pertinentes et des avantages de ces deux techniques classiques, notre méthode donne de très bons résultats pour une recherche interactive d'images efficace. La troisième contribution est un modèle nommé "Sacs de KVR" (Keyword Visual Representation) créant des liens entre des concepts sémantiques et des représentations visuelles, en appui sur le modèle de Sac de Mots. Grâce à une stratégie d'apprentissage incrémental, ce modèle fournit l'association entre concepts sémantiques et caractéristiques visuelles, ce qui contribue à améliorer la précision de l'annotation sur l'image et la performance de recherche. La quatrième contribution est un mécanisme de construction incrémentale des connaissances à partir de zéro. Nous ne séparons pas les phases d'annotation et de recherche, et l'utilisateur peut ainsi faire des requêtes dès la mise en route du système, tout en laissant le système apprendre au fur et à mesure de son utilisation. Les contributions ci-dessus sont complétées par une interface permettant la visualisation et l'interrogation mixte textuelle/visuelle. Même si pour l'instant deux types d'informations seulement sont utilisées, soit le texte et le contenu visuel, la généricité du modèle proposé permet son extension vers d'autres types d'informations externes à l'image, comme la localisation (GPS) et le temps.

APA, Harvard, Vancouver, ISO, and other styles

21

Inagaki, Yasuyoshi, Katsuhiko Toyama, Nobuo Kawaguchi, Shigeki Matsubara, Satoru Matsunaga, 康善稲垣, 勝彦外山, 信夫河口, 茂樹松原, and 悟. 松永. "Sync/Mail : 話し言葉の漸進的変換に基づく即時応答インタフェース." 一般社団法人情報処理学会, 1998. http://hdl.handle.net/2237/15382.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Guillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://tel.archives-ouvertes.fr/tel-00522278/en/.

Full text

Abstract:

La présente thèse s'intéresse à l'utilisation de méta-données textuelles pour l'analyse d'image. Nous cherchons à utiliser ces informations additionelles comme supervision faible pour l'apprentissage de modèles de reconnaissance visuelle. Nous avons observé un récent et grandissant intérêt pour les méthodes capables d'exploiter ce type de données car celles-ci peuvent potentiellement supprimer le besoin d'annotations manuelles, qui sont coûteuses en temps et en ressources. Nous concentrons nos efforts sur deux types de données visuelles associées à des informations textuelles. Tout d'abord, nous utilisons des images de dépêches qui sont accompagnées de légendes descriptives pour s'attaquer à plusieurs problèmes liés à la reconnaissance de visages. Parmi ces problèmes, la vérification de visages est la tâche consistant à décider si deux images représentent la même personne, et le nommage de visages cherche à associer les visages d'une base de données à leur noms corrects. Ensuite, nous explorons des modèles pour prédire automatiquement les labels pertinents pour des images, un problème connu sous le nom d'annotation automatique d'image. Ces modèles peuvent aussi être utilisés pour effectuer des recherches d'images à partir de mots-clés. Nous étudions enfin un scénario d'apprentissage multimodal semi-supervisé pour la catégorisation d'image. Dans ce cadre de travail, les labels sont supposés présents pour les données d'apprentissage, qu'elles soient manuellement annotées ou non, et absentes des données de test. Nos travaux se basent sur l'observation que la plupart de ces problèmes peuvent être résolus si des mesures de similarité parfaitement adaptées sont utilisées. Nous proposons donc de nouvelles approches qui combinent apprentissage de distance, modèles par plus proches voisins et méthodes par graphes pour apprendre, à partir de données visuelles et textuelles, des similarités visuelles spécifiques à chaque problème. Dans le cas des visages, nos similarités se concentrent sur l'identité des individus tandis que, pour les images, elles concernent des concepts sémantiques plus généraux. Expérimentalement, nos approches obtiennent des performances à l'état de l'art sur plusieurs bases de données complexes. Pour les deux types de données considérés, nous montrons clairement que l'apprentissage bénéficie de l'information textuelle supplémentaire résultant en l'amélioration de la performance des systèmes de reconnaissance visuelle.

APA, Harvard, Vancouver, ISO, and other styles

23

Guillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://www.theses.fr/2010GRENM048.

Full text

Abstract:

La présente thèse s'intéresse à l'utilisation de méta-données textuelles pour l'analyse d'image. Nous cherchons à utiliser ces informations additionelles comme supervision faible pour l'apprentissage de modèles de reconnaissance visuelle. Nous avons observé un récent et grandissant intérêt pour les méthodes capables d'exploiter ce type de données car celles-ci peuvent potentiellement supprimer le besoin d'annotations manuelles, qui sont coûteuses en temps et en ressources. Nous concentrons nos efforts sur deux types de données visuelles associées à des informations textuelles. Tout d'abord, nous utilisons des images de dépêches qui sont accompagnées de légendes descriptives pour s'attaquer à plusieurs problèmes liés à la reconnaissance de visages. Parmi ces problèmes, la vérification de visages est la tâche consistant à décider si deux images représentent la même personne, et le nommage de visages cherche à associer les visages d'une base de données à leur noms corrects. Ensuite, nous explorons des modèles pour prédire automatiquement les labels pertinents pour des images, un problème connu sous le nom d'annotation automatique d'image. Ces modèles peuvent aussi être utilisés pour effectuer des recherches d'images à partir de mots-clés. Nous étudions enfin un scénario d'apprentissage multimodal semi-supervisé pour la catégorisation d'image. Dans ce cadre de travail, les labels sont supposés présents pour les données d'apprentissage, qu'elles soient manuellement annotées ou non, et absentes des données de test. Nos travaux se basent sur l'observation que la plupart de ces problèmes peuvent être résolus si des mesures de similarité parfaitement adaptées sont utilisées. Nous proposons donc de nouvelles approches qui combinent apprentissage de distance, modèles par plus proches voisins et méthodes par graphes pour apprendre, à partir de données visuelles et textuelles, des similarités visuelles spécifiques à chaque problème. Dans le cas des visages, nos similarités se concentrent sur l'identité des individus tandis que, pour les images, elles concernent des concepts sémantiques plus généraux. Expérimentalement, nos approches obtiennent des performances à l'état de l'art sur plusieurs bases de données complexes. Pour les deux types de données considérés, nous montrons clairement que l'apprentissage bénéficie de l'information textuelle supplémentaire résultant en l'amélioration de la performance des systèmes de reconnaissance visuelle
This dissertation delves into the use of textual metadata for image understanding. We seek to exploit this additional textual information as weak supervision to improve the learning of recognition models. There is a recent and growing interest for methods that exploit such data because they can potentially alleviate the need for manual annotation, which is a costly and time-consuming process. We focus on two types of visual data with associated textual information. First, we exploit news images that come with descriptive captions to address several face related tasks, including face verification, which is the task of deciding whether two images depict the same individual, and face naming, the problem of associating faces in a data set to their correct names. Second, we consider data consisting of images with user tags. We explore models for automatically predicting tags for new images, i. E. Image auto-annotation, which can also used for keyword-based image search. We also study a multimodal semi-supervised learning scenario for image categorisation. In this setting, the tags are assumed to be present in both labelled and unlabelled training data, while they are absent from the test data. Our work builds on the observation that most of these tasks can be solved if perfectly adequate similarity measures are used. We therefore introduce novel approaches that involve metric learning, nearest neighbour models and graph-based methods to learn, from the visual and textual data, task-specific similarities. For faces, our similarities focus on the identities of the individuals while, for images, they address more general semantic visual concepts. Experimentally, our approaches achieve state-of-the-art results on several standard and challenging data sets. On both types of data, we clearly show that learning using additional textual information improves the performance of visual recognition systems

APA, Harvard, Vancouver, ISO, and other styles

24

Slizovskaia, Olga. "Audio-visual deep learning methods for musical instrument classification and separation." Doctoral thesis, Universitat Pompeu Fabra, 2020. http://hdl.handle.net/10803/669963.

Full text

Abstract:

In music perception, the information we receive from a visual system and audio system is often complementary. Moreover, visual perception plays an important role in the overall experience of being exposed to a music performance. This fact brings attention to machine learning methods that could combine audio and visual information for automatic music analysis. This thesis addresses two research problems: instrument classification and source separation in the context of music performance videos. A multimodal approach for each task is developed using deep learning techniques to train an encoded representation for each modality. For source separation, we also study two approaches conditioned on instrument labels and examine the influence that two extra sources of information have on separation performance compared with a conventional model. Another important aspect of this work is in the exploration of different fusion methods which allow for better multimodal integration of information sources from associated domains.
En la percepción musical, normalmente recibimos por nuestro sistema visual y por nuestro sistema auditivo informaciones complementarias. Además, la percepción visual juega un papel importante en nuestra experiencia integral ante una interpretación musical. Esta relación entre audio y visión ha incrementado el interés en métodos de aprendizaje automático capaces de combinar ambas modalidades para el análisis musical automático. Esta tesis se centra en dos problemas principales: la clasificación de instrumentos y la separación de fuentes en el contexto de videos musicales. Para cada uno de los problemas, se desarrolla un método multimodal utilizando técnicas de Deep Learning. Esto nos permite obtener -a través del aprendizaje- una representación codificada para cada modalidad. Además, para el problema de la separación de fuentes, también proponemos dos modelos condicionados a las etiquetas de los instrumentos, y examinamos la influencia que tienen dos fuentes de información extra en el rendimiento de la separación -comparándolas contra un modelo convencional-. Otro aspecto importante de este trabajo se basa en la exploración de diferentes modelos de fusión que permiten una mejor integración multimodal de fuentes de información de dominios asociados.
En la percepció visual, és habitual que rebem informacions complementàries des del nostres sistemes visual i auditiu. A més a més, la percepció visual té un paper molt important en la nostra experiència integral davant una interpretació musical. Aquesta relació entre àudio i visió ha fet créixer l'interès en mètodes d’aprenentatge automàtic capaços de combinar ambdues modalitats per l’anàlisi musical automàtic. Aquesta tesi se centra en dos problemes principals: la classificació d'instruments i la separació de fonts en el context dels vídeos musicals. Per a cadascú dels problemes, s'ha desenvolupat un mètode multimodal fent servir tècniques de Deep Learning. Això ens ha permès d'obtenir – gràcies a l’aprenentatge- una representació codificada per a cada modalitat. A més a més, en el cas del problema de separació de fonts, també proposem dos models condicionats a les etiquetes dels instruments, i examinem la influència que tenen dos fonts d’informació extra sobre el rendiment de la separació -tot comparant-les amb un model convencional-. Un altre aspecte d’aquest treball es basa en l’exploració de diferents models de fusió, els quals permeten una millor integració multimodal de fonts d'informació de dominis associats.

APA, Harvard, Vancouver, ISO, and other styles

25

Karlsson, Kristina. "Semantic represenations of retrieved memory information depend on cue-modality." Thesis, Stockholms universitet, Psykologiska institutionen, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-58817.

Full text

Abstract:

The semantic content (i.e., meaning of words) is the essence of retrieved autobiographical memories. In comparison to previous research, which has mainly focused on phenomenological experiences and age distribution of memory events, the present study provides a novel view on the retrieval of event information by addressing the semantic representation of memories. In the present study the semantic representation (i.e., word locations represented by vectors in a high dimensional space) of retrieved memory information were investigated, by analyzing the data with an automatic statistical algorithm. The experiment comprised a cued recall task, where participants were presented with unimodal (i.e., one sense modality) or multimodal (i.e., three sense modalities in conjunction) retrieval cues and asked to recall autobiographical memories. The memories were verbally narrated, recorded and transcribed to text. The semantic content of the memory narrations was analyzed with a semantic representation generated by latent semantic analysis (LSA). The results indicated that the semantic representation of visually evoked memories were most similar to the multimodally evoked memories, followed by auditorily and olfactorily evoked memories. By categorizing the semantic content into clusters, the present study also identified unique characteristics in the memory content across modalities.

APA, Harvard, Vancouver, ISO, and other styles

26

Poignant, Johann. "Identification non-supervisée de personnes dans les flux télévisés." Phd thesis, Université de Grenoble, 2013. http://tel.archives-ouvertes.fr/tel-00958774.

Full text

Abstract:

Ce travail de thèse a pour objectif de proposer plusieurs méthodes d'identi- fication non-supervisées des personnes présentes dans les flux télévisés à l'aide des noms écrits à l'écran. Comme l'utilisation de modèles biométriques pour reconnaître les personnes présentes dans de larges collections de vidéos est une solution peu viable sans connaissance a priori des personnes à identifier, plusieurs méthodes de l'état de l'art proposent d'employer d'autres sources d'informations pour obtenir le nom des personnes présentes. Ces méthodes utilisent principalement les noms prononcés comme source de noms. Cependant, on ne peut avoir qu'une faible confiance dans cette source en raison des erreurs de transcription ou de détection des noms et aussi à cause de la difficulté de savoir à qui fait référence un nom prononcé. Les noms écrits à l'écran dans les émissions de télévision ont été peu utilisés en raison de la difficulté à extraire ces noms dans des vidéos de mauvaise qualité. Toutefois, ces dernières années ont vu l'amélioration de la qualité des vidéos et de l'incrustation des textes à l'écran. Nous avons donc ré-évalué, dans cette thèse, l'utilisation de cette source de noms. Nous avons d'abord développé LOOV (pour Lig Overlaid OCR in Vidéo), un outil d'extraction des textes sur-imprimés à l'image dans les vidéos. Nous obtenons avec cet outil un taux d'erreur en caractères très faible. Ce qui nous permet d'avoir une confiance importante dans cette source de noms. Nous avons ensuite comparé les noms écrits et les noms prononcés dans leurs capacités à fournir le nom des personnes présentes dans les émissions de télévisions. Il en est ressorti que deux fois plus de personnes sont nommables par les noms écrits que par les noms prononcés extraits automatiquement. Un autre point important à noter est que l'association entre un nom et une personne est intrinsèquement plus simple pour les noms écrits que pour les noms prononcés. Cette très bonne source de noms nous a donc permis de développer plusieurs méthodes de nommage non-supervisé des personnes présentes dans les émissions de télévision. Nous avons commencé par des méthodes de nommage tardives où les noms sont propagés sur des clusters de locuteurs. Ces méthodes remettent plus ou moins en cause les choix fait lors du processus de regroupement des tours de parole en clusters de locuteurs. Nous avons ensuite proposé deux méthodes (le nommage intégré et le nommage précoce) qui intègrent de plus en plus l'information issue des noms écrits pendant le processus de regroupement. Pour identifier les personnes visibles, nous avons adapté la méthode de nommage précoce pour des clusters de visages. Enfin, nous avons aussi montré que cette méthode fonctionne aussi pour nommer des clusters multi-modaux voix-visage. Avec cette dernière méthode, qui nomme au cours d'un unique processus les tours de paroles et les visages, nous obtenons des résultats comparables aux meilleurs systèmes ayant concouru durant la première campagne d'évaluation REPERE.

APA, Harvard, Vancouver, ISO, and other styles

27

Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations." Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096/document.

Full text

Abstract:

La présente thèse étudie la modélisation conjointe des contenus visuels et textuels extraits à partir des documents multimédias pour résoudre les problèmes intermodaux. Ces tâches exigent la capacité de ``traduire'' l'information d'une modalité vers une autre. Un espace de représentation commun, par exemple obtenu par l'Analyse Canonique des Corrélation ou son extension kernelisée est une solution généralement adoptée. Sur cet espace, images et texte peuvent être représentés par des vecteurs de même type sur lesquels la comparaison intermodale peut se faire directement.Néanmoins, un tel espace commun souffre de plusieurs déficiences qui peuvent diminuer la performance des ces tâches. Le premier défaut concerne des informations qui sont mal représentées sur cet espace pourtant très importantes dans le contexte de la recherche intermodale. Le deuxième défaut porte sur la séparation entre les modalités sur l'espace commun, ce qui conduit à une limite de qualité de traduction entre modalités. Pour faire face au premier défaut concernant les données mal représentées, nous avons proposé un modèle qui identifie tout d'abord ces informations et puis les combine avec des données relativement bien représentées sur l'espace commun. Les évaluations sur la tâche d'illustration de texte montrent que la prise en compte de ces information fortement améliore les résultats de la recherche intermodale. La contribution majeure de la thèse se concentre sur la séparation entre les modalités sur l'espace commun pour améliorer la performance des tâches intermodales. Nous proposons deux méthodes de représentation pour les documents bi-modaux ou uni-modaux qui regroupent à la fois des informations visuelles et textuelles projetées sur l'espace commun. Pour les documents uni-modaux, nous suggérons un processus de complétion basé sur un ensemble de données auxiliaires pour trouver les informations correspondantes dans la modalité absente. Ces informations complémentaires sont ensuite utilisées pour construire une représentation bi-modale finale pour un document uni-modal. Nos approches permettent d'obtenir des résultats de l'état de l'art pour la recherche intermodale ou la classification bi-modale et intermodale
This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification

APA, Harvard, Vancouver, ISO, and other styles

28

Bursuc, Andrei. "Indexation et recherche de contenus par objet visuel." Phd thesis, Ecole Nationale Supérieure des Mines de Paris, 2012. http://pastel.archives-ouvertes.fr/pastel-00873966.

Full text

Abstract:

La question de recherche des objets vidéo basés sur le contenu lui-même, est de plus en plus difficile et devient un élément obligatoire pour les moteurs de recherche vidéo. Cette thèse présente un cadre pour la recherche des objets vidéo définis par l'utilisateur et apporte deux grandes contributions. La première contribution, intitulée DOOR (Dynamic Object Oriented Retrieval), est un cadre méthodologique pour la recherche et récupération des instances d'objets vidéo sélectionnés par un utilisateur, tandis que la seconde contribution concerne le support offert pour la recherche des vidéos, à savoir la navigation dans les vidéo, le système de récupération de vidéos et l'interface avec son architecture sous-jacente.Dans le cadre DOOR, l'objet comporte une représentation hybride obtenues par une sur-segmentation des images, consolidé avec la construction des graphs d'adjacence et avec l'agrégation des points d'intérêt. L'identification des instances d'objets à travers plusieurs vidéos est formulée comme un problème d'optimisation de l'énergie qui peut approximer un tache NP-difficile. Les objets candidats sont des sous-graphes qui rendent une énergie optimale vers la requête définie par l'utilisateur. Quatre stratégies d'optimisation sont proposées: Greedy, Greedy relâché, recuit simulé et GraphCut. La représentation de l'objet est encore améliorée par l'agrégation des points d'intérêt dans la représentation hybride, où la mesure de similarité repose sur une technique spectrale intégrant plusieurs types des descripteurs. Le cadre DOOR est capable de s'adapter à des archives vidéo a grande échelle grâce à l'utilisation de représentation sac-de-mots, enrichi avec un algorithme de définition et d'expansion de la requête basée sur une approche multimodale, texte, image et vidéo. Les techniques proposées sont évaluées sur plusieurs corpora de test TRECVID et qui prouvent leur efficacité.La deuxième contribution, OVIDIUS (On-line VIDeo Indexing Universal System) est une plate-forme en ligne pour la navigation et récupération des vidéos, intégrant le cadre DOOR. Les contributions de cette plat-forme portent sur le support assuré aux utilisateurs pour la recherche vidéo - navigation et récupération des vidéos, interface graphique. La plate-forme OVIDIUS dispose des fonctionnalités de navigation hiérarchique qui exploite la norme MPEG-7 pour la description structurelle du contenu vidéo. L'avantage majeur de l'architecture propose c'est sa structure modulaire qui permet de déployer le système sur terminaux différents (fixes et mobiles), indépendamment des systèmes d'exploitation impliqués. Le choix des technologies employées pour chacun des modules composant de la plate-forme est argumentée par rapport aux d'autres options technologiques.

APA, Harvard, Vancouver, ISO, and other styles

29

Pinho, Eduardo Miguel Coutinho Gomes de. "Multimodal information retrieval in medical imaging archives." Doctoral thesis, 2019. http://hdl.handle.net/10773/29206.

Full text

Abstract:

The proliferation of digital medical imaging modalities in hospitals and other diagnostic facilities has created huge repositories of valuable data, often not fully explored. Moreover, the past few years show a growing trend of data production. As such, studying new ways to index, process and retrieve medical images becomes an important subject to be addressed by the wider community of radiologists, scientists and engineers. Content-based image retrieval, which encompasses various methods, can exploit the visual information of a medical imaging archive, and is known to be beneficial to practitioners and researchers. However, the integration of the latest systems for medical image retrieval into clinical workflows is still rare, and their effectiveness still show room for improvement. This thesis proposes solutions and methods for multimodal information retrieval, in the context of medical imaging repositories. The major contributions are a search engine for medical imaging studies supporting multimodal queries in an extensible archive; a framework for automated labeling of medical images for content discovery; and an assessment and proposal of feature learning techniques for concept detection from medical images, exhibiting greater potential than feature extraction algorithms that were pertinently used in similar tasks. These contributions, each in their own dimension, seek to narrow the scientific and technical gap towards the development and adoption of novel multimodal medical image retrieval systems, to ultimately become part of the workflows of medical practitioners, teachers, and researchers in healthcare.
A proliferação de modalidades de imagem médica digital, em hospitais, clínicas e outros centros de diagnóstico, levou à criação de enormes repositórios de dados, frequentemente não explorados na sua totalidade. Além disso, os últimos anos revelam, claramente, uma tendência para o crescimento da produção de dados. Portanto, torna-se importante estudar novas maneiras de indexar, processar e recuperar imagens médicas, por parte da comunidade alargada de radiologistas, cientistas e engenheiros. A recuperação de imagens baseada em conteúdo, que envolve uma grande variedade de métodos, permite a exploração da informação visual num arquivo de imagem médica, o que traz benefícios para os médicos e investigadores. Contudo, a integração destas soluções nos fluxos de trabalho é ainda rara e a eficácia dos mais recentes sistemas de recuperação de imagem médica pode ser melhorada. A presente tese propõe soluções e métodos para recuperação de informação multimodal, no contexto de repositórios de imagem médica. As contribuições principais são as seguintes: um motor de pesquisa para estudos de imagem médica com suporte a pesquisas multimodais num arquivo extensível; uma estrutura para a anotação automática de imagens; e uma avaliação e proposta de técnicas de representation learning para deteção automática de conceitos em imagens médicas, exibindo maior potencial do que as técnicas de extração de features visuais outrora pertinentes em tarefas semelhantes. Estas contribuições procuram reduzir as dificuldades técnicas e científicas para o desenvolvimento e adoção de sistemas modernos de recuperação de imagem médica multimodal, de modo a que estes façam finalmente parte das ferramentas típicas dos profissionais, professores e investigadores da área da saúde.
Programa Doutoral em Informática

APA, Harvard, Vancouver, ISO, and other styles

30

Duan, Lingyu. "Multimodal mid-level representations for semantic analysis of broadcast video." Thesis, 2008. http://hdl.handle.net/1959.13/25819.

Full text

Abstract:

Research Doctorate - Doctor of Philosophy (PhD)
This thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.

APA, Harvard, Vancouver, ISO, and other styles

31

Duan, Lingyu. "Multimodal mid-level representations for semantic analysis of broadcast video." 2008. http://hdl.handle.net/1959.13/25819.

Full text

Abstract:

Research Doctorate - Doctor of Philosophy (PhD)
This thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.

APA, Harvard, Vancouver, ISO, and other styles

32

Lu, Hung-Tsung, and 盧宏宗. "Semantic Retrieval of Personal Photos Using Multimodal Deep Autoencoder Fusing Visual and Speech Features." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/58fvxy.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Mourão, André Belchior. "Towards an Architecture for Efficient Distributed Search of Multimodal Information." Doctoral thesis, 2018. http://hdl.handle.net/10362/38850.

Full text

Abstract:

The creation of very large-scale multimedia search engines, with more than one billion images and videos, is a pressing need of digital societies where data is generated by multiple connected devices. Distributing search indexes in cloud environments is the inevitable solution to deal with the increasing scale of image and video collections. The distribution of such indexes in this setting raises multiple challenges such as the even partitioning of data space, load balancing across index nodes and the fusion of the results computed over multiple nodes. The main question behind this thesis is how to reduce and distribute the multimedia retrieval computational complexity? This thesis studies the extension of sparse hash inverted indexing to distributed settings. The main goal is to ensure that indexes are uniformly distributed across computing nodes while keeping similar documents on the same nodes. Load balancing is performed at both node and index level, to guarantee that the retrieval process is not delayed by nodes that have to inspect larger subsets of the index. Multimodal search requires the combination of the search results from individual modalities and document features. This thesis studies rank fusion techniques focused on reducing complexity by automatically selecting only the features that improve retrieval effectiveness. The achievements of this thesis span both distributed indexing and rank fusion research. Experiments across multiple datasets show that sparse hashes can be used to distribute documents and queries across index entries in a balanced and redundant manner across nodes. Rank fusion results show that is possible to reduce retrieval complexity and improve efficiency by searching only a subset of the feature indexes.

APA, Harvard, Vancouver, ISO, and other styles

34

Carvalho, José Ricardo de Abreu. "Pesquisa multimodal de imagens em dispositivos móveis." Master's thesis, 2021. http://hdl.handle.net/10400.13/3984.

Full text

Abstract:

Apesar das evoluções no campo de Reverse Image Search, com algoritmos cada vez mais robustos e eficazes, continua a haver interesse para que as técnicas de pesquisa possam ser aprimoradas, melhorando a experiência do utilizador na procura das imagens que tem em mente. O objetivo principal deste trabalho foi desenvolver uma aplicação para dispositivos móveis (smartphones) que permitisse ao utilizador encontrar imagens através de inputs multimodais. Assim, esta dissertação, para além de propor pesquisas por diversos modos (palavras-chave, desenho, e imagens da câmara ou existentes no dispositivo), propõe que o utilizador consiga criar uma imagem por si só através de desenho, ou editar/alterar uma imagem existente, tendo feedback no momento aquando de cada alteração/interação. Ao longo da experiência de pesquisa, o utilizador consegue usar as imagens encontradas (que achar relevantes) e ir aprimorando a pesquisa através dessa edição, indo de encontro ao que pensa encontrar. A implementação desta proposta teve como base a Cloud Vision API da Google responsável pela obtenção dos resultados através do input de imagem, a Google Custom Search API para a obtenção de imagens através do input por texto, e a framework ATsketchkit que permitia a criação de desenho, para o sistema iOS da Apple. Foram realizados testes com um conjunto de utilizadores com diversos níveis de experiência em pesquisa de imagens e na habilidade de desenho, permitindo aferir a preferência nos diferentes métodos de input, a satisfação na obtenção dos resultados, bem como da usabilidade do protótipo.
Despite the evolution in the field of reverse image search, with algorithms becoming more robust and effective, there still interest for improving search techniques, improving the user experience when searching for the images the user has in mind. The main goal of this work was to develop an application for mobile devices (smartphones) that would allow the user to find images through multimodal inputs. Thus, this dissertation, in addition to propose the search for images in different ways (keywords, drawing/sketching, and camera or device images), proposes that the user can create an image by himself through drawing, editing / changing an existing image, having feedback at the time of each change / interaction. Throughout the search experience, the user can use the images found (which it finds relevant) and improve the search through its edition, going against what it thinks to find. The implementation of this proposal was based on a Google Cloud Vision API responsible for obtaining the results, and the ATsketchkit framework that allowed the creation of drawings, for Apple's iOS system. Tests were carried out with a set of users with different levels of experience in image research and different drawing ability, allowing to assess preference in different input methods, satisfaction with the images retrieved, as well as the usability of the prototype.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!