To see the other types of publications on this topic, follow the link: Multimodal embedding and retrieval.

Dissertations / Theses on the topic 'Multimodal embedding and retrieval'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 48 dissertations / theses for your research on the topic 'Multimodal embedding and retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Rubio, Romano Antonio. "Fashion discovery : a computer vision approach." Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/672423.

Full text
Abstract:
Performing semantic interpretation of fashion images is undeniably one of the most challenging domains for computer vision. Subtle variations in color and shape might confer different meanings or interpretations to an image. Not only is it a domain tightly coupled with human understanding, but also with scene interpretation and context. Being able to extract fashion-specific information from images and interpret that information in a proper manner can be useful in many situations and help understanding the underlying information in an image. Fashion is also one of the most important businesses around the world, with an estimated value of 3 trillion dollars and a constantly growing online market, which increases the utility of image-based algorithms to search, classify or recommend garments. This doctoral thesis aims to solve specific problems related with the treatment of fashion e-commerce data, from low-level pure pixel information to high-level abstract conclusions of the garments appearing in an image, taking advantage of the multi-modality of the available data for developing some of the solutions. The contributions include: - A new superpixel extraction method focused on improving the annotation process for clothing images. - The construction of an image and text embedding for fashion data. - The application of this embedding space to the task of retrieving the main product in an image showing a complete outfit. In summary, fashion is a complex computer vision and machine learning problem at many levels, and developing specific algorithms that are able to capture essential information from pictures and text is not trivial. In order to solve some of the challenges it proposes, and taking into account that this is an Industrial Ph.D., we contribute with a variety of solutions that can boost the performance of many tasks useful for the fashion e-commerce industry.
La interpretación semántica de imágenes del mundo de la moda es sin duda uno de los dominios más desafiantes para la visión por computador. Leves variaciones en color y forma pueden conferir significados o interpretaciones distintas a una imagen. Es un dominio estrechamente ligado a la comprensión humana subjetiva, pero también a la interpretación y reconocimiento de escenarios y contextos. Ser capaz de extraer información específica sobre moda de imágenes e interpretarla de manera correcta puede ser útil en muchas situaciones y puede ayudar a entender la información subyacente en una imagen. Además, la moda es uno de los negocios más importantes a nivel global, con un valor estimado de tres trillones de dólares y un mercado online en constante crecimiento, lo cual aumenta el interés de los algoritmos basados en imágenes para buscar, clasificar o recomendar prendas. Esta tesis doctoral pretende resolver problemas específicos relacionados con el tratamiento de datos de tiendas virtuales de moda, yendo desde la información más básica a nivel de píxel hasta un entendimiento más abstracto que permita extraer conclusiones sobre las prendas presentes en una imagen, aprovechando para ello la Multi-modalidad de los datos disponibles para desarrollar algunas de las soluciones. Las contribuciones incluyen: - Un nuevo método de extracción de superpíxeles enfocado a mejorar el proceso de anotación de imágenes de moda. - La construcción de un espacio común para representar imágenes y textos referentes a moda. - La aplicación de ese espacio en la tarea de identificar el producto principal dentro de una imagen que muestra un conjunto de prendas. En resumen, la moda es un dominio complejo a muchos niveles en términos de visión por computador y aprendizaje automático, y desarrollar algoritmos específicos capaces de capturar la información esencial a partir de imágenes y textos no es una tarea trivial. Con el fin de resolver algunos de los desafíos que esta plantea, y considerando que este es un doctorado industrial, contribuimos al tema con una variedad de soluciones que pueden mejorar el rendimiento de muchas tareas extremadamente útiles para la industria de la moda online
Automàtica, robòtica i visió
APA, Harvard, Vancouver, ISO, and other styles
2

Engilberge, Martin. "Deep Inside Visual-Semantic Embeddings." Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS150.

Full text
Abstract:
De nos jours l’Intelligence artificielle (IA) est omniprésente dans notre société. Le récent développement des méthodes d’apprentissage basé sur les réseaux de neurones profonds aussi appelé “Deep Learning” a permis une nette amélioration des modèles de représentation visuelle et textuelle. Cette thèse aborde la question de l’apprentissage de plongements multimodaux pour représenter conjointement des données visuelles et sémantiques. C’est une problématique centrale dans le contexte actuel de l’IA et du deep learning, qui présente notamment un très fort potentiel pour l’interprétabilité des modèles. Nous explorons dans cette thèse les espaces de représentations conjoints visuels et sémantiques. Nous proposons deux nouveaux modèles permettant de construire de tels espaces. Nous démontrons également leur capacité à localiser des concepts sémantiques dans le domaine visuel. Nous introduisons également une nouvelle méthode permettant d’apprendre une approximation différentiable des fonctions d’évaluation basée sur le rang
Nowadays Artificial Intelligence (AI) is omnipresent in our society. The recentdevelopment of learning methods based on deep neural networks alsocalled "Deep Learning" has led to a significant improvement in visual representation models.and textual.In this thesis, we aim to further advance image representation and understanding.Revolving around Visual Semantic Embedding (VSE) approaches, we explore different directions: We present relevant background covering images and textual representation and existing multimodal approaches. We propose novel architectures further improving retrieval capability of VSE and we extend VSE models to novel applications and leverage embedding models to visually ground semantic concept. Finally, we delve into the learning process andin particular the loss function by learning differentiable approximation of ranking based metric
APA, Harvard, Vancouver, ISO, and other styles
3

Adebayo, Kolawole John <1986&gt. "Multimodal Legal Information Retrieval." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amsdottorato.unibo.it/8634/1/ADEBAYO-JOHN-tesi.pdf.

Full text
Abstract:
The goal of this thesis is to present a multifaceted way of inducing semantic representation from legal documents as well as accessing information in a precise and timely manner. The thesis explored approaches for semantic information retrieval (IR) in the Legal context with a technique that maps specific parts of a text to the relevant concept. This technique relies on text segments, using the Latent Dirichlet Allocation (LDA), a topic modeling algorithm for performing text segmentation, expanding the concept using some Natural Language Processing techniques, and then associating the text segments to the concepts using a semi-supervised text similarity technique. This solves two problems, i.e., that of user specificity in formulating query, and information overload, for querying a large document collection with a set of concepts is more fine-grained since specific information, rather than full documents is retrieved. The second part of the thesis describes our Neural Network Relevance Model for E-Discovery Information Retrieval. Our algorithm is essentially a feature-rich Ensemble system with different component Neural Networks extracting different relevance signal. This model has been trained and evaluated on the TREC Legal track 2010 data. The performance of our models across board proves that it capture the semantics and relatedness between query and document which is important to the Legal Information Retrieval domain.
APA, Harvard, Vancouver, ISO, and other styles
4

Chen, Jianan. "Deep Learning Based Multimodal Retrieval." Electronic Thesis or Diss., Rennes, INSA, 2023. http://www.theses.fr/2023ISAR0019.

Full text
Abstract:
Les tâches multimodales jouent un rôle crucial dans la progression vers l'atteinte de l'intelligence artificielle (IA) générale. L'objectif principal de la recherche multimodale est d'exploiter des algorithmes d'apprentissage automatique pour extraire des informations sémantiques pertinentes, en comblant le fossé entre différentes modalités telles que les images visuelles, le texte linguistique et d'autres sources de données. Il convient de noter que l'entropie de l'information associée à des données hétérogènes pour des sémantiques de haut niveau identiques varie considérablement, ce qui pose un défi important pour les modèles multimodaux. Les modèles de réseau multimodal basés sur l'apprentissage profond offrent une solution efficace pour relever les difficultés découlant des différences substantielles d'entropie de l’information. Ces modèles présentent une précision et une stabilité impressionnantes dans les tâches d'appariement d'informations multimodales à grande échelle, comme la recherche d'images et de textes. De plus, ils démontrent de solides capacités d'apprentissage par transfert, permettant à un modèle bien entraîné sur une tâche multimodale d'être affiné et appliqué à une nouvelle tâche multimodale. Dans nos recherches, nous développons une nouvelle base de données multimodale et multi-vues générative spécifiquement conçue pour la tâche de segmentation référentielle multimodale. De plus, nous établissons une référence de pointe (SOTA) pour les modèles de segmentation d'expressions référentielles dans le domaine multimodal. Les résultats de nos expériences comparatives sont présentés de manière visuelle, offrant des informations claires et complètes
Multimodal tasks play a crucial role in the progression towards achieving general artificial intelligence (AI). The primary goal of multimodal retrieval is to employ machine learning algorithms to extract relevant semantic information, bridging the gap between different modalities such as visual images, linguistic text, and other data sources. It is worth noting that the information entropy associated with heterogeneous data for the same high-level semantics varies significantly, posing a significant challenge for multimodal models. Deep learning-based multimodal network models provide an effective solution to tackle the difficulties arising from substantial differences in information entropy. These models exhibit impressive accuracy and stability in large-scale cross-modal information matching tasks, such as image-text retrieval. Furthermore, they demonstrate strong transfer learning capabilities, enabling a well-trained model from one multimodal task to be fine-tuned and applied to a new multimodal task, even in scenarios involving few-shot or zero-shot learning. In our research, we develop a novel generative multimodal multi-view database specifically designed for the multimodal referential segmentation task. Additionally, we establish a state-of-the-art (SOTA) benchmark and multi-view metric for referring expression segmentation models in the multimodal domain. The results of our comparative experiments are presented visually, providing clear and comprehensive insights
APA, Harvard, Vancouver, ISO, and other styles
5

Böckmann, Christine, Jens Biele, Roland Neuber, and Jenny Niebsch. "Retrieval of multimodal aerosol size distribution by inversion of multiwavelength data." Universität Potsdam, 1997. http://opus.kobv.de/ubp/volltexte/2007/1436/.

Full text
Abstract:
The ill-posed problem of aerosol size distribution determination from a small number of backscatter and extinction measurements was solved successfully with a mollifier method which is advantageous since the ill-posed part is performed on exactly given quantities, the points r where n(r) is evaluated may be freely selected. A new twodimensional model for the troposphere is proposed.
APA, Harvard, Vancouver, ISO, and other styles
6

Zhu, Meng. "Cross-modal semantic-associative labelling, indexing and retrieval of multimodal data." Thesis, University of Reading, 2010. http://centaur.reading.ac.uk/24828/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Kahn, Itamar. "Remembering the past : multimodal imaging of cortical contributions to episodic retrieval." Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33171.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2005.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references.
What is the nature of the neural processes that allow humans to remember past events? The theoretical framework adopted in this thesis builds upon cognitive models that suggest that episodic retrieval can be decomposed into two classes of computations: (1) recovery processes that serve to reactivate stored memories, making information from a past episode readily available, and (2) control processes that serve to guide the retrieval attempt and monitor/evaluate information arising from the recovery processes. A multimodal imaging approach that combined fMRI and MEG was adopted to gain insight into the spatial and temporal brain mechanisms supporting episodic retrieval. Chapter 1 reviews major findings and theories in the episodic retrieval literature grounding the open questions and controversies within the suggested framework. Chapter 2 describes an fMRI and MEG experiment that identified medial temporal cortical structures that signal item memory strength, thus supporting the perception of item familiarity. Chapter 3 describes an fMRI experiment that demonstrated that retrieval of contextual details involves reactivation of neural patterns engaged at encoding.
(cont.) Further, leveraging this pattern of reactivation, it was demonstrated that false recognition may be accompanied by recollection. The fMRI experiment reported in Chapter 3, when combined with an MEG experiment reported in Chapter 4, directly addressed questions regarding the control processes engaged during episodic retrieval. In particular, Chapter 3 showed that parietal and prefrontal cortices contribute to controlling the act of arriving at a retrieval decision. Chapter 4 then illuminates the temporal characteristics of parietal activation during episodic retrieval, providing novel evidence about the nature of parietal responses and thus constraints on theories of parietal involvement in episodic retrieval. The conducted research targeted distinct aspects of the multi-faceted act of remembering the past. The obtained data contribute to the building of an anatomical and temporal "blueprint" documenting the cascade of neural events that unfold during attempts to remember, as well as when such attempts are met with success or lead to memory errors. In the course of framing this research within the context of cognitive models of retrieval, the obtained neural data reflect back on and constrain these theories of remembering.
by Itamar Kahn.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
8

Nag, Chowdhury Sreyasi [Verfasser]. "Text-image synergy for multimodal retrieval and annotation / Sreyasi Nag Chowdhury." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1240674139/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Luqman, Muhammad Muzzamil. "Fuzzy multilevel graph embedding for recognition, indexing and retrieval of graphic document images." Thesis, Tours, 2012. http://www.theses.fr/2012TOUR4005/document.

Full text
Abstract:
Cette thèse aborde le problème du manque de performance des outils exploitant des représentationsà base de graphes en reconnaissance des formes. Nous proposons de contribuer aux nouvellesméthodes proposant de tirer partie, à la fois, de la richesse des méthodes structurelles et de la rapidité des méthodes de reconnaissance de formes statistiques. Deux principales contributions sontprésentées dans ce manuscrit. La première correspond à la proposition d'une nouvelle méthode deprojection explicite de graphes procédant par analyse multi-facettes des graphes. Cette méthodeeffectue une caractérisation des graphes suivant différents niveaux qui correspondent, selon nous,aux point-clés des représentations à base de graphes. Il s'agit de capturer l'information portéepar un graphe au niveau global, au niveau structure et au niveau local ou élémentaire. Ces informationscapturées sont encapsulés dans un vecteur de caractéristiques numériques employantdes histogrammes flous. La méthode proposée utilise, de plus, un mécanisme d'apprentissage nonsupervisée pour adapter automatiquement ses paramètres en fonction de la base de graphes àtraiter sans nécessité de phase d'apprentissage préalable. La deuxième contribution correspondà la mise en place d'une architecture pour l'indexation de masses de graphes afin de permettre,par la suite, la recherche de sous-graphes présents dans cette base. Cette architecture utilise laméthode précédente de projection explicite de graphes appliquée sur toutes les cliques d'ordre 2pouvant être extraites des graphes présents dans la base à indexer afin de pouvoir les classifier.Cette classification permet de constituer l'index qui sert de base à la description des graphes etdonc à leur indexation en ne nécessitant aucune base d'apprentissage pré-étiquetées. La méthodeproposée est applicable à de nombreux domaines, apportant la souplesse d'un système de requêtepar l'exemple et la granularité des techniques d'extraction ciblée (focused retrieval)
This thesis addresses the problem of lack of efficient computational tools for graph based structural pattern recognition approaches and proposes to exploit computational strength of statistical pattern recognition. It has two fold contributions. The first contribution is a new method of explicit graph embedding. The proposed graph embedding method exploits multilevel analysis of graph for extracting graph level information, structural level information and elementary level information from graphs. It embeds this information into a numeric feature vector. The method employs fuzzy overlapping trapezoidal intervals for addressing the noise sensitivity of graph representations and for minimizing the information loss while mapping from continuous graph space to discrete vector space. The method has unsupervised learning abilities and is capable of automatically adapting its parameters to underlying graph dataset. The second contribution is a framework for automatic indexing of graph repositories for graph retrieval and subgraph spotting. This framework exploits explicit graph embedding for representing the cliques of order 2 by numeric feature vectors, together with classification and clustering tools for automatically indexing a graph repository. It does not require a labeled learning set and can be easily deployed to a range of application domains, offering ease of query by example (QBE) and granularity of focused retrieval
APA, Harvard, Vancouver, ISO, and other styles
10

Lolich, María, and Susana Azzollini. "Phenomenological retrieval style of autobiographical memories in a sample of major depressed individuals." Pontificia Universidad Católica del Perú, 2016. http://repositorio.pucp.edu.pe/index/handle/123456789/99894.

Full text
Abstract:
Autobiographical memory retrieval implies different phenomenological features. Given the lack of previous work in Hispanic-speaking populations, 34 in depth interview were carried out in individuals with and without Major Depressive Disorder in Buenos Aires, Argentina. Phenomenological components during the evocation of autobiographical memories were explored. Data was qualitatively analyzed using Grounded Theory. During the descriptive analyses, seven phenomenological categories were detected as emerging from the discourse. The axial and selective analyses revealed two main discursive axles areas; rhetoric-propo­ sitional and specificity- generalized. The impact on affective regulation processes, derived from the assumption of an amodal or multimodal style of processing autobiographical infor­ mation, merits further attention.
La evocación de recuerdos autobiográficos se caracteriza por presentar distintos compo­ nentes fenomenológicos. Dada la ausencia de trabajos previos realizados en poblaciones hispanoparlantes, se realizaron 34 entrevistas en profundidad a individuos con y sin tras­ torno depresivo mayor de la ciudad de Buenos Aires (Argentina). Fueron explorados los componentes fenomenológicos presentes en la evocación de recuerdos autobiográficos significativos. Los datos fueron analizados cualitativamente por medio de la Teoría Fun­ damentada en los Hechos. Durante el análisis descriptivo, se detectaron siete categorías fenomenológicas emergentes del discurso. Del análisis axial y selectivo fueron identificados dos ejes discursivos: retórico-proposicional y especificidad-generalidad. Las implicancias, en la regulación afectiva, derivadas de la asunción de un estilo amodal o multimodal de proce­ samiento de información autobiográfica merecen mayor atención.
A evocação de memórias autobiográficas é caracterizada por diferentes componentes feno­ menológicos. Dada a falta de trabalhos prévios sobre o tema em populações de língua espanhola, 34 entrevistas em profundidade foram conduzidas em indivíduos com e sem transtorno depressivo maior na cidade de Buenos Aires (Argentina). Foram explorados os componentes fenomenológicos presentes na evocação de memórias autobiográficas signi­ ficativas. Os dados foram analisados qualitativamente através da Teoria Fundamentada. Durante a análise descritiva, foram detectadas sete categorias fenomenológicas emer­ gentes no discurso. Dos analises axial e seletivo foram identificados dois eixos discursivos: retórico-proposicional e especificidade-generalidade. As implicações, na regulação afetiva, decorrentes da assunção de um estilo amodal ou um estilo multimodal no processamento de informações autobiográficas merecem mais atenção.
APA, Harvard, Vancouver, ISO, and other styles
11

Valero-Mas, Jose J. "Towards Interactive Multimodal Music Transcription." Doctoral thesis, Universidad de Alicante, 2017. http://hdl.handle.net/10045/71275.

Full text
Abstract:
La transcripción de música por computador es de vital importancia en tareas del llamo campo de la Extracción y recuperación de información musical por su utilidad como proceso para la obtención de una abstracción simbólica que codifica el contenido musical de un fichero de audio. En esta disertación se estudia este problema desde una perspectiva diferente a la típicamente considerada para estos problemas, la perspectiva interactiva y multimodal. En este paradigma el usuario cobra especial importancia puesto que es parte activa en la resolución del problema (interactividad); por otro lado, la multimodalidad implica que diferentes fuentes de información extraídas de la misma señal se aúnan para ayudar a una mejor resolución de la tarea.
APA, Harvard, Vancouver, ISO, and other styles
12

Quack, Till. "Large scale mining and retrieval of visual data in a multimodal context." Konstanz Hartung-Gorre, 2009. http://d-nb.info/993614620/04.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Saragiotis, Panagiotis. "Cross-modal classification and retrieval of multimodal data using combinations of neural networks." Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/843338/.

Full text
Abstract:
Current neurobiological thinking supported, in part, by experimentation stresses the importance of cross-modality. Uni-modal cognitive tasks, language and vision, for example, are performed with the help of many networks working simultaneously or sequentially; and for cross-modal tasks, like picture / object naming and word illustration, the output of these networks is combined to produce higher cognitive behaviour. The notion of multi-net processing is used typically in the pattern recognition literature, where ensemble networks of weak classifiers - typically supervised - appear to outperform strong classifiers. We have built a system, based on combinations of neural networks, that demonstrates how cross-modal classification can be used to retrieve multi-modal data using one of the available modalities of information. Two multi-net systems were used in this work: one comprising Kohonen SOMs that interact with each other via a Hebbian network and a fuzzy ARTMAP network where the interaction is through the embedded map field. The multi-nets were used for the cross-modal retrieval of images given keywords and for finding the appropriate keywords for an image. The systems were trained on two publicly available image databases that had collateral annotations on the images. The Hemera collection, comprising images of pre-segmented single objects, and the Corel collection with images of multiple objects were used for automatically generating various sets of input vectors. We have attempted to develop a method for evaluating the performance of multi-net systems using a monolithic network trained on modally-undifferentiated vectors as an intuitive bench-mark. To this extent single SOM and fuzzy ART networks were trained using a concatenated visual / linguistic vector to test the performance of multi-net systems with typical monolithic systems. Both multi-nets outperform the respective monolithic systems in terms of information retrieval measures of precision and recall on test images drawn from both datasets; the SOM multi-net outperforms the fuzzy ARTMAP both in terms of convergence and precision-recall. The performance of the SOM-based multi-net in retrieval, classification and auto-annotation is on a par with that of state of the art systems like "ALIP" and "Blobworld". Much of the neural network based simulations reported in the literature use supervised learning algorithms. Such algorithms are suited when classes of objects are predefined and objects in themselves are quite unique in terms of their attributes. We have compared the performance of our multi-net systems with that of a multi-layer perceptron (MLP). The MLP does show substantially greater precision and recall on a (fixed) class of objects when compared with our unsupervised systems. However when 'lesioned' -the network connectivity 'damaged' deliberately- the multi-net systems show a greater degree of robustness. Cross-modal systems appear to hold considerable intellectual and commercial potential and the multi-net approach facilitates the simulation of such systems.
APA, Harvard, Vancouver, ISO, and other styles
14

Fedel, Gabriel de Souza. "Busca multimodal para apoio à pesquisa em biodiversidade." [s.n.], 2011. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275751.

Full text
Abstract:
Orientador: Cláudia Maria Bauzer Medeiros
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-18T07:07:49Z (GMT). No. of bitstreams: 1 Fedel_GabrieldeSouza_M.pdf: 14390093 bytes, checksum: 63058da33a22121e927f1cdbaff297d3 (MD5) Previous issue date: 2011
Resumo: A pesquisa em computação aplicada à biodiversidade apresenta muitos desafios, que vão desde o grande volume de dados altamente heterogêneos até a variedade de tipos de usuários. Isto gera a necessidade de ferramentas versáteis de recuperação. As ferramentas disponíveis ainda são limitadas e normalmente só consideram dados textuais, deixando de explorar a potencialidade da busca por dados de outra natureza, como imagens ou sons. Esta dissertação analisa os problemas de realizar consultas multimodais a partir de predicados que envolvem texto e imagem para o domínio de biodiversidade, especificando e implementando um conjunto de ferramentas para processar tais consultas. As contribuições do trabalho, validado com dados reais, incluem a construção de uma ontologia taxonômica associada a nomes vulgares e a possibilidade de apoiar dois perfis de usuários (especialistas e leigos). Estas características estendem o escopo da consultas atualmente disponíveis em sistemas de biodiversidade. Este trabalho está inserido no projeto Bio-CORE, uma parceria entre pesquisadores de computação e biologia para criar ferramentas computacionais para dar apoio à pesquisa em biodiversidade
Abstract: Research on Computing applied to biodiversity present several challenges, ranging from the massive volumes of highly heterogeneous data to the variety in user profiles. This kind of scenario requires versatile data retrieval and management tools. Available tools are still limited. Most often, they only consider textual data and do not take advantage of the multiple data types available, such as images or sounds. This dissertation discusses issues concerning multimodal queries that involve both text and images as search parameters, for the domanin of biodiversity. It presents the specification and implementation of a set of tools to process such queries, which were validate with real data from Unicamp's Zoology Museum. The aim contributions also include the construction of a taxonomic ontology that includes species common names, and support to both researchers and non-experts in queries. Such features extend the scop of queries available in biodiversity information systems. This research is associated with the Biocore project, jointly conducted by researchers in computing and biology, to design and develop computational tools to support research in biodiversity
Mestrado
Banco de Dados
Mestre em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
15

Dyar, Samuel S. "A multimodal speech interface for dynamic creation and retrieval of geographical landmarks on a mobile device." Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/62638.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 140).
As mobile devices become more powerful, researchers look to develop innovative applications that use new and effective means of input. Furthermore, developers must exploit the device's many capabilities (GPS, camera, touch screen, etc) in order to make equally powerful applications. This thesis presents the development of a multimodal system that allows users to create and share informative geographical landmarks using Android-powered smart-phones. The content associated with each landmark is dynamically integrated into the system's vocabulary, which allows users to easily use speech to access landmarks by the information related to them. The initial results of releasing the application on the Android Market have been encouraging, but also suggest that improvements need to be made to the system.
by Samuel S. Dyar.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
16

Calumby, Rodrigo Tripodi 1985. "Recuperação multimodal de imagens com realimentação de relevância baseada em programação genética." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275814.

Full text
Abstract:
Orientador: Ricardo da Silva Torres
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-16T05:18:58Z (GMT). No. of bitstreams: 1 Calumby_RodrigoTripodi_M.pdf: 15749586 bytes, checksum: 2493b0b703adc1973eeabf7eb70ad21c (MD5) Previous issue date: 2010
Resumo: Este trabalho apresenta uma abordagem para recuperação multimodal de imagens com realimentação de relevância baseada em programação genética. Supõe-se que cada imagem da coleção possui informação textual associada (metadado, descrição textual, etc.), além de ter suas propriedades visuais (por exemplo, cor e textura) codificadas em vetores de características. A partir da informação obtida ao longo das iterações de realimentação de relevância, programação genética é utilizada para a criação de funções de combinação de medidas de similaridades eficazes. Com essas novas funções, valores de similaridades diversos são combinados em uma única medida, que mais adequadamente reflete as necessidades do usuário. As principais contribuições deste trabalho consistem na proposta e implementação de dois arcabouços. O primeiro, RFCore, é um arcabouço genérico para atividades de realimentação de relevância para manipulação de objetos digitais. O segundo, MMRFGP, é um arcabouço para recuperação de objetos digitais com realimentação de relevância baseada em programação genética, construído sobre o RFCore. O método proposto de recuperação multimodal de imagens foi validado sobre duas coleções de imagens, uma desenvolvida pela Universidade de Washington e outra da ImageCLEF Photographic Retrieval Task. A abordagem proposta mostrou melhores resultados para recuperação multimodal frente a utilização das modalidades isoladas. Além disso, foram obtidos resultados para recuperação visual e multimodal melhores do que as melhores submissões para a ImageCLEF Photographic Retrieval Task 2008
Abstract: This work presents an approach for multimodal content-based image retrieval with relevance feedback based on genetic programming. We assume that there is textual information (e.g., metadata, textual descriptions) associated with collection images. Furthermore, image content properties (e.g., color and texture) are characterized by image descriptores. Given the information obtained over the relevance feedback iterations, genetic programming is used to create effective combination functions that combine similarities associated with different features. Hence using these new functions the different similarities are combined into a unique measure that more properly meets the user needs. The main contribution of this work is the proposal and implementation of two frameworks. The first one, RFCore, is a generic framework for relevance feedback tasks over digital objects. The second one, MMRF-GP, is a framework for digital object retrieval with relevance feedback based on genetic programming and it was built on top of RFCore. We have validated the proposed multimodal image retrieval approach over 2 datasets, one from the University of Washington and another from the ImageCLEF Photographic Retrieval Task. Our approach has yielded the best results for multimodal image retrieval when compared with one-modality approaches. Furthermore, it has achieved better results for visual and multimodal image retrieval than the best submissions for ImageCLEF Photographic Retrieval Task 2008
Mestrado
Sistemas de Recuperação da Informação
Mestre em Ciência da Computação
APA, Harvard, Vancouver, ISO, and other styles
17

Durak, Nurcan. "Semantic Video Modeling And Retrieval With Visual, Auditory, Textual Sources." Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/12605438/index.pdf.

Full text
Abstract:
The studies on content-based video indexing and retrieval aim at accessing video content from different aspects more efficiently and effectively. Most of the studies have concentrated on the visual component of video content in modeling and retrieving the video content. Beside visual component, much valuable information is also carried in other media components, such as superimposed text, closed captions, audio, and speech that accompany the pictorial component. In this study, semantic content of video is modeled using visual, auditory, and textual components. In the visual domain, visual events, visual objects, and spatial characteristics of visual objects are extracted. In the auditory domain, auditory events and auditory objects are extracted. In textual domain, speech transcripts and visible texts are considered. With our proposed model, users can access video content from different aspects and get desired information more quickly. Beside multimodality, our model is constituted on semantic hierarchies that enable querying the video content at different semantic levels. There are sequence-scene hierarchies in visual domain, background-foreground hierarchies in auditory domain, and subject hierarchies in speech domain. Presented model has been implemented and multimodal content queries, hierarchical queries, fuzzy spatial queries, fuzzy regional queries, fuzzy spatio-temporal queries, and temporal queries have been applied on video content successfully.
APA, Harvard, Vancouver, ISO, and other styles
18

Oztarak, Hakan. "Structural And Event Based Multimodal Video Data Modeling." Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/12606919/index.pdf.

Full text
Abstract:
Investments on multimedia technology enable us to store many more reflections of the real world in digital world as videos. By recording videos about real world entities, we carry a lot of information to the digital world directly. In order to store and efficiently query this information, a video database system (VDBS) is necessary. In this thesis work, we propose a structural, event based and multimodal (SEBM) video data model for VDBSs. SEBM video data model supports three different modalities that are visual, auditory and textual modalities and we propose that we can dissolve these three modalities with a single SEBM video data model. This proposal is supported by the interpretation of the video data by human. Hence we can answer the content based, spatio-temporal and fuzzy queries of the user more easily, since we store the video data as the way that s/he interprets the real world data. We follow divide and conquer technique when answering very complicated queries. We have implemented the SEBM video data model in a Java based system that uses XML for representing the SEBM data model and Berkeley XML DBMS for storing the data based on the SEBM prototype system.
APA, Harvard, Vancouver, ISO, and other styles
19

Matera, Tomáš. "Visipedia - Multi-dimensional Object Embedding Based on Perceptual Similarity." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236115.

Full text
Abstract:
Problémy jako je jemnozrnná kategorizace či výpočty s využitím lidských zdrojů se v posledních letech v komunitě stávají stále populárnějšími, což dosvědčuje i značné množství publikací na tato témata. Zatímco většina těchto prací využívá "klasických'' obrazových příznaků extrahovaných počítačem, tato se zaměřuje především na percepční vlastnosti, které nemohou být snadno zachyceny počítači a vyžadují zapojení lidí do procesu sběru dat. Práce zkoumá možnosti levného a efektivního získávání percepčních podobností od uživatelů rovněž ve vztahu ke škálovatelnosti. Dále vyhodnocuje několik relevantních experimentů a představuje metody zlepšující efektivitu sběru dat. Jsou zde také shrnuty a porovnány metody učení multidimenzionálního indexování a prohledávání tohoto prostoru. Získané výsledky jsou následně užity v komplexním experimentu vyhodnoceném na datasetu obrázků jídel. Procedura začíná získáváním podobností od uživatelů, pokračuje vytvořením multidimenzionálního prostoru jídel a končí prohledáváním tohoto prostoru.
APA, Harvard, Vancouver, ISO, and other styles
20

Vukotic, Verdran. "Deep Neural Architectures for Automatic Representation Learning from Multimedia Multimodal Data." Thesis, Rennes, INSA, 2017. http://www.theses.fr/2017ISAR0015/document.

Full text
Abstract:
La thèse porte sur le développement d'architectures neuronales profondes permettant d'analyser des contenus textuels ou visuels, ou la combinaison des deux. De manière générale, le travail tire parti de la capacité des réseaux de neurones à apprendre des représentations abstraites. Les principales contributions de la thèse sont les suivantes: 1) Réseaux récurrents pour la compréhension de la parole: différentes architectures de réseaux sont comparées pour cette tâche sur leurs facultés à modéliser les observations ainsi que les dépendances sur les étiquettes à prédire. 2) Prédiction d’image et de mouvement : nous proposons une architecture permettant d'apprendre une représentation d'une image représentant une action humaine afin de prédire l'évolution du mouvement dans une vidéo ; l'originalité du modèle proposé réside dans sa capacité à prédire des images à une distance arbitraire dans une vidéo. 3) Encodeurs bidirectionnels multimodaux : le résultat majeur de la thèse concerne la proposition d'un réseau bidirectionnel permettant de traduire une modalité en une autre, offrant ainsi la possibilité de représenter conjointement plusieurs modalités. L'approche été étudiée principalement en structuration de collections de vidéos, dons le cadre d'évaluations internationales où l'approche proposée s'est imposée comme l'état de l'art. 4) Réseaux adverses pour la fusion multimodale: la thèse propose d'utiliser les architectures génératives adverses pour apprendre des représentations multimodales en offrant la possibilité de visualiser les représentations dans l'espace des images
In this dissertation, the thesis that deep neural networks are suited for analysis of visual, textual and fused visual and textual content is discussed. This work evaluates the ability of deep neural networks to learn automatic multimodal representations in either unsupervised or supervised manners and brings the following main contributions:1) Recurrent neural networks for spoken language understanding (slot filling): different architectures are compared for this task with the aim of modeling both the input context and output label dependencies.2) Action prediction from single images: we propose an architecture that allow us to predict human actions from a single image. The architecture is evaluated on videos, by utilizing solely one frame as input.3) Bidirectional multimodal encoders: the main contribution of this thesis consists of neural architecture that translates from one modality to the other and conversely and offers and improved multimodal representation space where the initially disjoint representations can translated and fused. This enables for improved multimodal fusion of multiple modalities. The architecture was extensively studied an evaluated in international benchmarks within the task of video hyperlinking where it defined the state of the art today.4) Generative adversarial networks for multimodal fusion: continuing on the topic of multimodal fusion, we evaluate the possibility of using conditional generative adversarial networks to lean multimodal representations in addition to providing multimodal representations, generative adversarial networks permit to visualize the learned model directly in the image domain
APA, Harvard, Vancouver, ISO, and other styles
21

Meseguer, Brocal Gabriel. "Multimodal analysis : informed content estimation and audio source separation." Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS111.

Full text
Abstract:
Cette thèse propose l'étude de l'apprentissage multimodal dans le contexte de signaux musicaux. Tout au long de ce manuscrit, nous nous concentrerons sur l'interaction entre les signaux audio et les informations textuelles. Parmi les nombreuses sources de texte liées à la musique qui peuvent être utilisées (par exemple les critiques, les métadonnées ou les commentaires des réseaux sociaux), nous nous concentrerons sur les paroles. La voix chantée relie directement le signal audio et les informations textuelles d'une manière unique, combinant mélodie et paroles où une dimension linguistique complète l'abstraction des instruments de musique. Notre étude se focalise sur l'interaction audio et paroles pour cibler la séparation de sources et l'estimation de contenu informé. Les stimuli du monde réel sont produits par des phénomènes complexes et leur interaction constante dans divers domaines. Notre compréhension apprend des abstractions utiles qui fusionnent différentes modalités en une représentation conjointe. L'apprentissage multimodal décrit des méthodes qui analysent les phénomènes de différentes modalités et leur interaction afin de s'attaquer à des tâches complexes. Il en résulte des représentations meilleures et plus riches qui améliorent les performances des méthodes d'apprentissage automatique actuelles. Pour développer notre analyse multimodale, nous devons d'abord remédier au manque de données contenant une voix chantée avec des paroles alignées. Ces données sont obligatoires pour développer nos idées. Par conséquent, nous étudierons comment créer une telle base de données en exploitant automatiquement les ressources du World Wide Web. La création de ce type de base de données est un défi en soi qui soulève de nombreuses questions de recherche. Nous travaillons constamment avec le paradoxe classique de la `` poule ou de l'œuf '': l'acquisition et le nettoyage de ces données nécessitent des modèles précis, mais il est difficile de former des modèles sans données. Nous proposons d'utiliser le paradigme enseignant-élève pour développer une méthode où la création de bases de données et l'apprentissage de modèles ne sont pas considérés comme des tâches indépendantes mais plutôt comme des efforts complémentaires. Dans ce processus, les paroles et les annotations non-expertes de karaoké décrivent les paroles comme une séquence de notes alignées sur le temps avec leurs informations textuelles associées. Nous lions ensuite chaque annotation à l'audio correct et alignons globalement les annotations dessus
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation. Real-world stimuli are produced by complex phenomena and their constant interaction in various domains. Our understanding learns useful abstractions that fuse different modalities into a joint representation. Multimodal learning describes methods that analyse phenomena from different modalities and their interaction in order to tackle complex tasks. This results in better and richer representations that improve the performance of the current machine learning methods. To develop our multimodal analysis, we need first to address the lack of data containing singing voice with aligned lyrics. This data is mandatory to develop our ideas. Therefore, we investigate how to create such a dataset automatically leveraging resources from the World Wide Web. Creating this type of dataset is a challenge in itself that raises many research questions. We are constantly working with the classic ``chicken or the egg'' problem: acquiring and cleaning this data requires accurate models, but it is difficult to train models without data. We propose to use the teacher-student paradigm to develop a method where dataset creation and model learning are not seen as independent tasks but rather as complementary efforts. In this process, non-expert karaoke time-aligned lyrics and notes describe the lyrics as a sequence of time-aligned notes with their associated textual information. We then link each annotation to the correct audio and globally align the annotations to it. For this purpose, we use the normalized cross-correlation between the voice annotation sequence and the singing voice probability vector automatically, which is obtained using a deep convolutional neural network. Using the collected data we progressively improve that model. Every time we have an improved version, we can in turn correct and enhance the data
APA, Harvard, Vancouver, ISO, and other styles
22

SIMONETTA, FEDERICO. "MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS." Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.

Full text
Abstract:
This Thesis discusses the development of technologies for the automatic resynthesis of music recordings using digital synthesizers. First, the main issue is identified in the understanding of how Music Information Processing (MIP) methods can take into consideration the influence of the acoustic context on the music performance. For this, a novel conceptual and mathematical framework named “Music Interpretation Analysis” (MIA) is presented. In the proposed framework, a distinction is made between the “performance” – the physical action of playing – and the “interpretation” – the action that the performer wishes to achieve. Second, the Thesis describes further works aiming at the democratization of music production tools via automatic resynthesis: 1) it elaborates software and file formats for historical music archiving and multimodal machine-learning datasets; 2) it explores and extends MIP technologies; 3) it presents the mathematical foundations of the MIA framework and shows preliminary evaluations to demonstrate the effectiveness of the approach
APA, Harvard, Vancouver, ISO, and other styles
23

Ekeberg, Tomas. "Flash Diffractive Imaging in Three Dimensions." Doctoral thesis, Uppsala universitet, Molekylär biofysik, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-179643.

Full text
Abstract:
During the last years we have seen the birth of free-electron lasers, a new type of light source ten billion times brighter than syncrotrons and able to produce pulses only a few femtoseconds long. One of the main motivations for building these multi-million dollar machines was the prospect of imaging biological samples such as proteins and viruses in 3D without the need for crystallization or staining. This thesis contains some of the first biological results from free-electron lasers. These results include 2D images, both of whole cells and of the giant mimivirus and also con- tains a 3D density map of the mimivirus assembled from diffraction patterns from many virus particles. These are important proof-of-concept experiments but they also mark the point where free-electron lasers start to produce biologically relevant results. The most noteworthy of these results is the unexpectedly non-uniform density distribution of the internals of the mimivirus. We also present Hawk, the only open-source software toolkit for analysing single particle diffraction data. The Uppsala-developed program suite supports a wide range fo algorithms and takes advantage of Graphics Processing Units which makes it very computationally efficient. Last, the problem introduced by structural variability in samples is discussed. This includes a description of the problem and how it can be overcome, and also how it could be turned into an advantage that allows us to image samples in all of their conformational states.
APA, Harvard, Vancouver, ISO, and other styles
24

Bucher, Maxime. "Apprentissage et exploitation de représentations sémantiques pour la classification et la recherche d'images." Thesis, Normandie, 2018. http://www.theses.fr/2018NORMC250/document.

Full text
Abstract:
Dans cette thèse nous étudions différentes questions relatives à la mise en pratique de modèles d'apprentissage profond. En effet malgré les avancées prometteuses de ces algorithmes en vision par ordinateur, leur emploi dans certains cas d'usage réels reste difficile. Une première difficulté est, pour des tâches de classification d'images, de rassembler pour des milliers de catégories suffisamment de données d'entraînement pour chacune des classes. C'est pourquoi nous proposons deux nouvelles approches adaptées à ce scénario d'apprentissage, appelé <>.L'utilisation d'information sémantique pour modéliser les classes permet de définir les modèles par description, par opposition à une modélisation à partir d'un ensemble d'exemples, et rend possible la modélisation sans donnée de référence. L'idée fondamentale du premier chapitre est d'obtenir une distribution d'attributs optimale grâce à l'apprentissage d'une métrique, capable à la fois de sélectionner et de transformer la distribution des données originales. Dans le chapitre suivant, contrairement aux approches standards de la littérature qui reposent sur l'apprentissage d'un espace d'intégration commun, nous proposons de générer des caractéristiques visuelles à partir d'un générateur conditionnel. Une fois générés ces exemples artificiels peuvent être utilisés conjointement avec des données réelles pour l'apprentissage d'un classifieur discriminant. Dans une seconde partie de ce manuscrit, nous abordons la question de l'intelligibilité des calculs pour les tâches de vision par ordinateur. En raison des nombreuses et complexes transformations des algorithmes profonds, il est difficile pour un utilisateur d'interpréter le résultat retourné. Notre proposition est d'introduire un <> dans le processus de traitement. La représentation de l'image est exprimée entièrement en langage naturel, tout en conservant l'efficacité des représentations numériques. L'intelligibilité de la représentation permet à un utilisateur d'examiner sur quelle base l'inférence a été réalisée et ainsi d'accepter ou de rejeter la décision suivant sa connaissance et son expérience humaine
In this thesis, we examine some practical difficulties of deep learning models.Indeed, despite the promising results in computer vision, implementing them in some situations raises some questions. For example, in classification tasks where thousands of categories have to be recognised, it is sometimes difficult to gather enough training data for each category.We propose two new approaches for this learning scenario, called <>. We use semantic information to model classes which allows us to define models by description, as opposed to modelling from a set of examples.In the first chapter we propose to optimize a metric in order to transform the distribution of the original data and to obtain an optimal attribute distribution. In the following chapter, unlike the standard approaches of the literature that rely on the learning of a common integration space, we propose to generate visual features from a conditional generator. The artificial examples can be used in addition to real data for learning a discriminant classifier. In the second part of this thesis, we address the question of computational intelligibility for computer vision tasks. Due to the many and complex transformations of deep learning algorithms, it is difficult for a user to interpret the returned prediction. Our proposition is to introduce what we call a <> in the processing pipeline, which is a crossing point in which the representation of the image is entirely expressed with natural language, while retaining the efficiency of numerical representations. This semantic bottleneck allows to detect failure cases in the prediction process so as to accept or reject the decision
APA, Harvard, Vancouver, ISO, and other styles
25

Inagaki, Yasuyoshi, Katsuhiko Toyama, Nobuo Kawaguchi, Shigeki Matsubara, Satoru Matsunaga, 康善 稲垣, 勝彦 外山, 信夫 河口, 茂樹 松原, and 悟. 松永. "Sync/Mail : 話し言葉の漸進的変換に基づく即時応答インタフェース." 一般社団法人情報処理学会, 1998. http://hdl.handle.net/2237/15382.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

El, Mahdaouy Abdelkader. "Accès à l'information dans les grandes collections textuelles en langue arabe." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM091/document.

Full text
Abstract:
Face à la quantité d'information textuelle disponible sur le web en langue arabe, le développement des Systèmes de Recherche d'Information (SRI) efficaces est devenu incontournable pour retrouver l'information pertinente. La plupart des SRIs actuels de la langue arabe reposent sur la représentation par sac de mots et l'indexation des documents et des requêtes est effectuée souvent par des mots bruts ou des racines. Ce qui conduit à plusieurs problèmes tels que l'ambigüité et la disparité des termes, etc.Dans ce travail de thèse, nous nous sommes intéressés à apporter des solutions aux problèmes d'ambigüité et de disparité des termes pour l'amélioration de la représentation des documents et le processus de l'appariement des documents et des requêtes. Nous apportons quatre contributions au niveau de processus de représentation, d'indexation et de recherche d'information en langue arabe. La première contribution consiste à représenter les documents à la fois par des termes simples et des termes complexes. Cela est justifié par le fait que les termes simples seuls et isolés de leur contexte sont ambigus et moins précis pour représenter le contenu des documents. Ainsi, nous avons proposé une méthode hybride pour l’extraction de termes complexes en langue arabe, en combinant des propriétés linguistiques et des modèles statistiques. Le filtre linguistique repose à la fois sur l'étiquetage morphosyntaxique et la prise en compte des variations pour sélectionner les termes candidats. Pour sectionner les termes candidats pertinents, nous avons introduit une mesure d'association permettant de combiner l'information contextuelle avec les degrés de spécificité et d'unité. La deuxième contribution consiste à explorer et évaluer les systèmes de recherche d’informations permettant de tenir compte de l’ensemble des éléments d’indexation (termes simples et complexes). Par conséquent, nous étudions plusieurs extensions des modèles existants de RI pour l'intégration des termes complexes. En outre, nous explorons une panoplie de modèles de proximité. Pour la prise en compte des dépendances de termes dans les modèles de RI, nous introduisons une condition caractérisant de tels modèle et leur validation théorique. La troisième contribution permet de pallier le problème de disparité des termes en proposant une méthode pour intégrer la similarité entre les termes dans les modèles de RI en s'appuyant sur les représentations distribuées des mots (RDMs). L'idée sous-jacente consiste à permettre aux termes similaires à ceux de la requête de contribuer aux scores des documents. Les extensions des modèles de RI proposées dans le cadre de cette méthode sont validées en utilisant les contraintes heuristiques d'appariement sémantique. La dernière contribution concerne l'amélioration des modèles de rétro-pertinence (Pseudo Relevance Feedback PRF). Étant basée également sur les RDM, notre méthode permet d'intégrer la similarité entre les termes d'expansions et ceux de la requête dans les modèles standards PRF. La validation expérimentale de l'ensemble des contributions apportées dans le cadre de cette thèse est effectuée en utilisant la collection standard TREC 2002/2001 de la langue arabe
Given the amount of Arabic textual information available on the web, developing effective Information Retrieval Systems (IRS) has become essential to retrieve relevant information. Most of the current Arabic SRIs are based on the bag-of-words representation, where documents are indexed using surface words, roots or stems. Two main drawbacks of the latter representation are the ambiguity of Single Word Terms (SWTs) and term mismatch.The aim of this work is to deal with SWTs ambiguity and term mismatch. Accordingly, we propose four contributions to improve Arabic content representation, indexing, and retrieval. The first contribution consists of representing Arabic documents using Multi-Word Terms (MWTs). The latter is motivated by the fact that MWTs are more precise representational units and less ambiguous than isolated SWTs. Hence, we propose a hybrid method to extract Arabic MWTs, which combines linguistic and statistical filtering of MWT candidates. The linguistic filter uses POS tagging to identify MWTs candidates that fit a set of syntactic patterns and handles the problem of MWTs variation. Then, the statistical filter rank MWT candidate using our proposed association measure that combines contextual information and both termhood and unithood measures. In the second contribution, we explore and evaluate several IR models for ranking documents using both SWTs and MWTs. Additionally, we investigate a wide range of proximity-based IR models for Arabic IR. Then, we introduce a formal condition that IR models should satisfy to deal adequately with term dependencies. The third contribution consists of a method based on Distributed Representation of Word vectors, namely Word Embedding (WE), for Arabic IR. It relies on incorporating WE semantic similarities into existing probabilistic IR models in order to deal with term mismatch. The aim is to allow distinct, but semantically similar terms to contribute to documents scores. The last contribution is a method to incorporate WE similarity into Pseud-Relevance Feedback PRF for Arabic Information Retrieval. The main idea is to select expansion terms using their distribution in the set of top pseudo-relevant documents along with their similarity to the original query terms. The experimental validation of all the proposed contributions is performed using standard Arabic TREC 2002/2001 collection
APA, Harvard, Vancouver, ISO, and other styles
27

Couairon, Guillaume. "Text-Based Semantic Image Editing." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS248.

Full text
Abstract:
L’objectif de cette thèse est de proposer des algorithmes pour la tâche d’édition d’images basée sur le texte (TIE), qui consiste à éditer des images numériques selon une instruction formulée en langage naturel. Par exemple, étant donné une image d’un chien et la requête "Changez le chien en un chat", nous voulons produire une nouvelle image où le chien a été remplacé par un chat, en gardant tous les autres aspects de l’image inchangés (couleur et pose de l’animal, arrière- plan). L’objectif de l’étoile du nord est de permettre à tout un chacun de modifier ses images en utilisant uniquement des requêtes en langage naturel. Une des spécificités de l’édition d’images basée sur du texte est qu’il n’y a pratiquement pas de données d’entraînement pour former un algorithme supervisé. Dans cette thèse, nous proposons différentes solutions pour l’édition d’images, basées sur l’adaptation de grands modèles multimodaux entraînés sur d’énormes ensembles de données. Nous étudions tout d’abord une configuration d’édition simplifiée, appelée édition d’image basée sur la recherche, qui ne nécessite pas de modifier directement l’image d’entrée. Au lieu de cela, étant donné l’image et la requête de modification, nous recherchons dans une grande base de données une image qui correspond à la modification demandée. Nous nous appuyons sur des modèles multimodaux d’alignement image/texte entraînés sur des ensembles de données à l’échelle du web (comme CLIP) pour effectuer de telles transformations sans aucun exemple. Nous proposons également le cadre SIMAT pour évaluer l’édition d’images basée sur la recherche. Nous étudions ensuite comment modifier directement l’image d’entrée. Nous proposons FlexIT, une méthode qui modifie itérativement l’image d’entrée jus- qu’à ce qu’elle satisfasse un "objectif d’édition" abstrait défini dans un espace d’intégration multimodal. Nous introduisons des termes de régularisation pour imposer des transformations réalistes. Ensuite, nous nous concentrons sur les modèles de diffusion, qui sont des modèles génératifs puissants capables de synthétiser de nouvelles images conditionnées par une grande variété d’invites textuelles. Nous démontrons leur polyvalence en proposant DiffEdit, un algorithme qui adapte les modèles de diffusion pour l’édition d’images sans réglage fin. Nous proposons une stratégie "zero-shot" pour trouver automatiquement où l’image initiale doit être modifiée pour satisfaire la requête de transformation de texte
The aim of this thesis is to propose algorithms for the task of Text-based Image Editing (TIE), which consists in editing digital images according to an instruction formulated in natural language. For instance, given an image of a dog, and the query "Change the dog into a cat", we want to produce a novel image where the dog has been replaced by a cat, keeping all other image aspects unchanged (animal color and pose, background). The north-star goal is to enable anyone to edit their images using only queries in natural language. One specificity of text-based image editing is that there is practically no training data to train a supervised algorithm. In this thesis, we propose different solutions for editing images, based on the adaptation of large multimodal models trained on huge datasets. We first study a simplified editing setup, named Retrieval-based image edit- ing, which does not require to directly modify the input image. Instead, given the image and modification query, we search in a large database an image that corresponds to the requested edit. We leverage multimodal image/text alignment models trained on web-scale datasets (like CLIP) to perform such transformations without any examples. We also propose the SIMAT framework for evaluating retrieval-based image editing. We then study how to directly modify the input image. We propose FlexIT, a method which iteratively changes the input image until it satisfies an abstract "editing objective" defined in a multimodal embedding space. We introduce a variety of regularization terms to enforce realistic transformations. Next, we focus on diffusion models, which are powerful generative models able to synthetize novel images conditioned on a wide variety of textual prompts. We demonstrate their versatility by proposing DiffEdit, an algorithm which adapts diffusion models for image editing without finetuning. We propose a zero-shot strategy for finding automatically where the initial image should be changed to satisfy the text transformation query. Finally, we study a specific challenge useful in the context of image editing: how to synthetize a novel image by giving as constraint a spatial layout of objects with textual descriptions, a task which is known as Semantic Image Synthesis. We adopt the same strategy, consisting in adapting diffusion models to solve the task without any example. We propose the ZestGuide algorithm, which leverages the spatio-semantic information encoded in the attention layers of diffusion models
APA, Harvard, Vancouver, ISO, and other styles
28

Slizovskaia, Olga. "Audio-visual deep learning methods for musical instrument classification and separation." Doctoral thesis, Universitat Pompeu Fabra, 2020. http://hdl.handle.net/10803/669963.

Full text
Abstract:
In music perception, the information we receive from a visual system and audio system is often complementary. Moreover, visual perception plays an important role in the overall experience of being exposed to a music performance. This fact brings attention to machine learning methods that could combine audio and visual information for automatic music analysis. This thesis addresses two research problems: instrument classification and source separation in the context of music performance videos. A multimodal approach for each task is developed using deep learning techniques to train an encoded representation for each modality. For source separation, we also study two approaches conditioned on instrument labels and examine the influence that two extra sources of information have on separation performance compared with a conventional model. Another important aspect of this work is in the exploration of different fusion methods which allow for better multimodal integration of information sources from associated domains.
En la percepción musical, normalmente recibimos por nuestro sistema visual y por nuestro sistema auditivo informaciones complementarias. Además, la percepción visual juega un papel importante en nuestra experiencia integral ante una interpretación musical. Esta relación entre audio y visión ha incrementado el interés en métodos de aprendizaje automático capaces de combinar ambas modalidades para el análisis musical automático. Esta tesis se centra en dos problemas principales: la clasificación de instrumentos y la separación de fuentes en el contexto de videos musicales. Para cada uno de los problemas, se desarrolla un método multimodal utilizando técnicas de Deep Learning. Esto nos permite obtener -a través del aprendizaje- una representación codificada para cada modalidad. Además, para el problema de la separación de fuentes, también proponemos dos modelos condicionados a las etiquetas de los instrumentos, y examinamos la influencia que tienen dos fuentes de información extra en el rendimiento de la separación -comparándolas contra un modelo convencional-. Otro aspecto importante de este trabajo se basa en la exploración de diferentes modelos de fusión que permiten una mejor integración multimodal de fuentes de información de dominios asociados.
En la percepció visual, és habitual que rebem informacions complementàries des del nostres sistemes visual i auditiu. A més a més, la percepció visual té un paper molt important en la nostra experiència integral davant una interpretació musical. Aquesta relació entre àudio i visió ha fet créixer l'interès en mètodes d’aprenentatge automàtic capaços de combinar ambdues modalitats per l’anàlisi musical automàtic. Aquesta tesi se centra en dos problemes principals: la classificació d'instruments i la separació de fonts en el context dels vídeos musicals. Per a cadascú dels problemes, s'ha desenvolupat un mètode multimodal fent servir tècniques de Deep Learning. Això ens ha permès d'obtenir – gràcies a l’aprenentatge- una representació codificada per a cada modalitat. A més a més, en el cas del problema de separació de fonts, també proposem dos models condicionats a les etiquetes dels instruments, i examinem la influència que tenen dos fonts d’informació extra sobre el rendiment de la separació -tot comparant-les amb un model convencional-. Un altre aspecte d’aquest treball es basa en l’exploració de diferents models de fusió, els quals permeten una millor integració multimodal de fonts d'informació de dominis associats.
APA, Harvard, Vancouver, ISO, and other styles
29

Nguyen, Nhu Van. "Représentations visuelles de concepts textuels pour la recherche et l'annotation interactives d'images." Phd thesis, Université de La Rochelle, 2011. http://tel.archives-ouvertes.fr/tel-00730707.

Full text
Abstract:
En recherche d'images aujourd'hui, nous manipulons souvent de grands volumes d'images, qui peuvent varier ou même arriver en continu. Dans une base d'images, on se retrouve ainsi avec certaines images anciennes et d'autres nouvelles, les premières déjà indexées et possiblement annotées et les secondes en attente d'indexation ou d'annotation. Comme la base n'est pas annotée uniformément, cela rend l'accès difficile par le biais de requêtes textuelles. Nous présentons dans ce travail différentes techniques pour interagir, naviguer et rechercher dans ce type de bases d'images. Premièrement, un modèle d'interaction à court terme est utilisé pour améliorer la précision du système. Deuxièmement, en se basant sur un modèle d'interaction à long terme, nous proposons d'associer mots textuels et caractéristiques visuelles pour la recherche d'images par le texte, par le contenu visuel, ou mixte texte/visuel. Ce modèle de recherche d'images permet de raffiner itérativement l'annotation et la connaissance des images. Nous identifions quatre contributions dans ce travail. La première contribution est un système de recherche multimodale d'images qui intègre différentes sources de données, comme le contenu de l'image et le texte. Ce système permet l'interrogation par l'image, l'interrogation par mot-clé ou encore l'utilisation de requêtes hybrides. La deuxième contribution est une nouvelle technique pour le retour de pertinence combinant deux techniques classiques utilisées largement dans la recherche d'information~: le mouvement du point de requête et l'extension de requêtes. En profitant des images non pertinentes et des avantages de ces deux techniques classiques, notre méthode donne de très bons résultats pour une recherche interactive d'images efficace. La troisième contribution est un modèle nommé "Sacs de KVR" (Keyword Visual Representation) créant des liens entre des concepts sémantiques et des représentations visuelles, en appui sur le modèle de Sac de Mots. Grâce à une stratégie d'apprentissage incrémental, ce modèle fournit l'association entre concepts sémantiques et caractéristiques visuelles, ce qui contribue à améliorer la précision de l'annotation sur l'image et la performance de recherche. La quatrième contribution est un mécanisme de construction incrémentale des connaissances à partir de zéro. Nous ne séparons pas les phases d'annotation et de recherche, et l'utilisateur peut ainsi faire des requêtes dès la mise en route du système, tout en laissant le système apprendre au fur et à mesure de son utilisation. Les contributions ci-dessus sont complétées par une interface permettant la visualisation et l'interrogation mixte textuelle/visuelle. Même si pour l'instant deux types d'informations seulement sont utilisées, soit le texte et le contenu visuel, la généricité du modèle proposé permet son extension vers d'autres types d'informations externes à l'image, comme la localisation (GPS) et le temps.
APA, Harvard, Vancouver, ISO, and other styles
30

Bonardi, Fabien. "Localisation visuelle multimodale visible/infrarouge pour la navigation autonome." Thesis, Normandie, 2017. http://www.theses.fr/2017NORMR028/document.

Full text
Abstract:
On regroupe sous l’expression navigation autonome l’ensemble des méthodes visantà automatiser les déplacements d’un robot mobile. Les travaux présentés seconcentrent sur la problématique de la localisation en milieu extérieur, urbain etpériurbain, et approchent la problématique de la localisation visuelle soumise à lafois à un changement de capteurs (géométrie et modalité) ainsi qu’aux changementsde l’environnement à long terme, contraintes combinées encore très peu étudiéesdans l’état de l’art. Les recherches menées dans le cadre de cette thèse ont porté surl’utilisation exclusive de capteurs de vision. La contribution majeure de cette thèseporte sur la phase de description et compression des données issues des images sousla forme d’un histogramme de mots visuels que nous avons nommée PHROG (PluralHistograms of Restricted Oriented Gradients). Les expériences menées ont été réaliséessur plusieurs bases d’images avec différentes modalités visibles et infrarouges. Lesrésultats obtenus démontrent une amélioration des performances de reconnaissance descènes comparés aux méthodes de l’état de l’art. Par la suite, nous nous intéresseronsà la nature séquentielle des images acquises dans un contexte de navigation afin defiltrer et supprimer des estimations de localisation aberrantes. Les concepts d’un cadreprobabiliste Bayésien permettent deux applications de filtrage probabiliste appliquéesà notre problématique : une première solution définit un modèle de déplacementsimple du robot avec un filtre d’histogrammes et la deuxième met en place un modèleplus évolué faisant appel à l’odométrie visuelle au sein d’un filtre particulaire.123
Autonomous navigation field gathers the set of algorithms which automate the moves of a mobile robot. The case study of this thesis focuses on the outdoor localisation issue with additionnal constraints : the use of visual sensors only with variable specifications (geometry, modality, etc) and long-term apparence changes of the surrounding environment. Both types of constraints are still rarely studied in the state of the art. Our main contribution concerns the description and compression steps of the data extracted from images. We developped a method called PHROG which represents data as a visual-words histogram. Obtained results on several images datasets show an improvment of the scenes recognition performance compared to methods from the state of the art. In a context of navigation, acquired images are sequential such that we can envision a filtering method to avoid faulty localisation estimation. Two probabilistic filtering approaches are proposed : a first one defines a simple movement model with a histograms filter and a second one sets up a more complex model using visual odometry and a particules filter
APA, Harvard, Vancouver, ISO, and other styles
31

ur, Réhman Shafiq. "Expressing emotions through vibration for perception and control." Doctoral thesis, Umeå universitet, Institutionen för tillämpad fysik och elektronik, 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-32990.

Full text
Abstract:
This thesis addresses a challenging problem: “how to let the visually impaired ‘see’ others emotions”. We, human beings, are heavily dependent on facial expressions to express ourselves. A smile shows that the person you are talking to is pleased, amused, relieved etc. People use emotional information from facial expressions to switch between conversation topics and to determine attitudes of individuals. Missing emotional information from facial expressions and head gestures makes the visually impaired extremely difficult to interact with others in social events. To enhance the visually impaired’s social interactive ability, in this thesis we have been working on the scientific topic of ‘expressing human emotions through vibrotactile patterns’. It is quite challenging to deliver human emotions through touch since our touch channel is very limited. We first investigated how to render emotions through a vibrator. We developed a real time “lipless” tracking system to extract dynamic emotions from the mouth and employed mobile phones as a platform for the visually impaired to perceive primary emotion types. Later on, we extended the system to render more general dynamic media signals: for example, render live football games through vibration in the mobile for improving mobile user communication and entertainment experience. To display more natural emotions (i.e. emotion type plus emotion intensity), we developed the technology to enable the visually impaired to directly interpret human emotions. This was achieved by use of machine vision techniques and vibrotactile display. The display is comprised of a ‘vibration actuators matrix’ mounted on the back of a chair and the actuators are sequentially activated to provide dynamic emotional information. The research focus has been on finding a global, analytical, and semantic representation for facial expressions to replace state of the art facial action coding systems (FACS) approach. We proposed to use the manifold of facial expressions to characterize dynamic emotions. The basic emotional expressions with increasing intensity become curves on the manifold extended from the center. The blends of emotions lie between those curves, which could be defined analytically by the positions of the main curves. The manifold is the “Braille Code” of emotions. The developed methodology and technology has been extended for building assistive wheelchair systems to aid a specific group of disabled people, cerebral palsy or stroke patients (i.e. lacking fine motor control skills), who don’t have ability to access and control the wheelchair with conventional means, such as joystick or chin stick. The solution is to extract the manifold of the head or the tongue gestures for controlling the wheelchair. The manifold is rendered by a 2D vibration array to provide user of the wheelchair with action information from gestures and system status information, which is very important in enhancing usability of such an assistive system. Current research work not only provides a foundation stone for vibrotactile rendering system based on object localization but also a concrete step to a new dimension of human-machine interaction.
Taktil Video
APA, Harvard, Vancouver, ISO, and other styles
32

Yesiler, M. Furkan. "Data-driven musical version identification: accuracy, scalability and bias perspectives." Doctoral thesis, Universitat Pompeu Fabra, 2022. http://hdl.handle.net/10803/673264.

Full text
Abstract:
This dissertation aims at developing audio-based musical version identification (VI) systems for industry-scale corpora. To employ such systems in industrial use cases, they must demonstrate high performance on large-scale corpora while not favoring certain musicians or tracks above others. Therefore, the three main aspects we address in this dissertation are accuracy, scalability, and algorithmic bias of VI systems. We propose a data-driven model that incorporates domain knowledge in its network architecture and training strategy. We then take two main directions to further improve our model. Firstly, we experiment with data-driven fusion methods to combine information from models that process harmonic and melodic information, which greatly enhances identification accuracy. Secondly, we investigate embedding distillation techniques to reduce the size of the embeddings produced by our model, which reduces the requirements for data storage and, more importantly, retrieval time. Lastly, we analyze the algorithmic biases of our systems.
En esta tesis se desarrollan sistemas de identificación de versiones musicales basados en audio y aplicables en un entorno industrial. Por lo tanto, los tres aspectos que se abordan en esta tesis son el desempeño, escalabilidad, y los sesgos algorítmicos en los sistemas de identificación de versiones. Se propone un modelo dirigido por datos que incorpora conocimiento musical en su arquitectura de red y estrategia de entrenamiento, para lo cual se experimenta con dos enfoques. Primero, se experimenta con métodos de fusión dirigidos por datos para combinar la información de los modelos que procesan información melódica y armónica, logrando un importante incremento en la exactitud de la identificación. Segundo, se investigan técnicas para la destilación de embeddings para reducir su tamaño, lo cual reduce los requerimientos de almacenamiento de datos, y lo que es más importante, del tiempo de búsqueda. Por último, se analizan los sesgos algorítmicos de nuestros sistemas.
APA, Harvard, Vancouver, ISO, and other styles
33

Lerner, Paul. "Répondre aux questions visuelles à propos d'entités nommées." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG074.

Full text
Abstract:
Cette thèse se positionne à l'intersection de plusieurs domaines de recherche, le traitement automatique des langues, la Recherche d'Information (RI) et la vision par ordinateur, qui se sont unifiés autour des méthodes d'apprentissage de représentation et de pré-entraînement. Dans ce contexte, nous avons défini et étudié une nouvelle tâche multimodale : répondre aux questions visuelles à propos d'entités nommées (KVQAE). Dans ce cadre, nous nous sommes particulièrement intéressés aux interactions cross-modales et aux différentes façons de représenter les entités nommées. Nous avons également été attentifs aux données utilisées pour entraîner mais surtout évaluer les systèmes de question-réponse à travers différentes métriques. Plus précisément, nous avons proposé à cet effet un jeu de données, le premier de KVQAE comprenant divers types d'entités. Nous avons également défini un cadre expérimental pour traiter la KVQAE en deux étapes grâce à une base de connaissances non-structurée et avons identifié la RI comme principal verrou de la KVQAE, en particulier pour les questions à propos d'entités non-personnes. Afin d'améliorer l'étape de RI, nous avons étudié différentes méthodes de fusion multimodale, lesquelles sont pré-entraînées à travers une tâche originale : l'Inverse Cloze Task multimodale. Nous avons trouvé que ces modèles exploitaient une interaction cross-modale que nous n'avions pas considéré à l'origine, et qui permettrait de traiter l'hétérogénéité des représentations visuelles des entités nommées. Ces résultats ont été renforcés par une étude du modèle CLIP qui permet de modéliser cette interaction cross-modale directement. Ces expériences ont été menées tout en restant attentif aux biais présents dans le jeu de données ou les métriques d'évaluation, notamment les biais textuels qui affectent toute tâche multimodale
This thesis is positioned at the intersection of several research fields, Natural Language Processing, Information Retrieval (IR) and Computer Vision, which have unified around representation learning and pre-training methods. In this context, we have defined and studied a new multimodal task: Knowledge-based Visual Question Answering about Named Entities (KVQAE).In this context, we were particularly interested in cross-modal interactions and different ways of representing named entities. We also focused on data used to train and, more importantly, evaluate Question Answering systems through different metrics.More specifically, we proposed a dataset for this purpose, the first in KVQAE comprising various types of entities. We also defined an experimental framework for dealing with KVQAE in two stages through an unstructured knowledge base and identified IR as the main bottleneck of KVQAE, especially for questions about non-person entities. To improve the IR stage, we studied different multimodal fusion methods, which are pre-trained through an original task: the Multimodal Inverse Cloze Task. We found that these models leveraged a cross-modal interaction that we had not originally considered, and which may address the heterogeneity of visual representations of named entities. These results were strengthened by a study of the CLIP model, which allows this cross-modal interaction to be modeled directly. These experiments were carried out while staying aware of biases present in the dataset or evaluation metrics, especially of textual biases, which affect any multimodal task
APA, Harvard, Vancouver, ISO, and other styles
34

Bahceci, Oktay. "Deep Neural Networks for Context Aware Personalized Music Recommendation : A Vector of Curation." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210252.

Full text
Abstract:
Information Filtering and Recommender Systems have been used and has been implemented in various ways from various entities since the dawn of the Internet, and state-of-the-art approaches rely on Machine Learning and Deep Learning in order to create accurate and personalized recommendations for users in a given context. These models require big amounts of data with a variety of features such as time, location and user data in order to find correlations and patterns that other classical models such as matrix factorization and collaborative filtering cannot. This thesis researches, implements and compares a variety of models with the primary focus of Machine Learning and Deep Learning for the task of music recommendation and do so successfully by representing the task of recommendation as a multi-class extreme classification task with 100 000 distinct labels. By comparing fourteen different experiments, all implemented models successfully learn features such as time, location, user features and previous listening history in order to create context-aware personalized music predictions, and solves the cold start problem by using user demographic information, where the best model being capable of capturing the intended label in its top 100 list of recommended items for more than 1/3 of the unseen data in an offine evaluation, when evaluating on randomly selected examples from the unseen following week.
Informationsfiltrering och rekommendationssystem har använts och implementeratspå flera olika sätt från olika enheter sedan gryningen avInternet, och moderna tillvägagångssätt beror påMaskininlärrning samtDjupinlärningför att kunna skapa precisa och personliga rekommendationerför användare i en given kontext. Dessa modeller kräver data i storamängder med en varians av kännetecken såsom tid, plats och användardataför att kunna hitta korrelationer samt mönster som klassiska modellersåsom matris faktorisering samt samverkande filtrering inte kan. Dettaexamensarbete forskar, implementerar och jämför en mängd av modellermed fokus påMaskininlärning samt Djupinlärning för musikrekommendationoch gör det med succé genom att representera rekommendationsproblemetsom ett extremt multi-klass klassifikationsproblem med 100000 unika klasser att välja utav. Genom att jämföra fjorton olika experiment,så lär alla modeller sig kännetäcken såsomtid, plats, användarkänneteckenoch lyssningshistorik för att kunna skapa kontextberoendepersonaliserade musikprediktioner, och löser kallstartsproblemet genomanvändning av användares demografiska kännetäcken, där den bästa modellenklarar av att fånga målklassen i sin rekommendationslista medlängd 100 för mer än 1/3 av det osedda datat under en offline evaluering,när slumpmässigt valda exempel från den osedda kommande veckanevalueras.
APA, Harvard, Vancouver, ISO, and other styles
35

Htait, Amal. "Sentiment analysis at the service of book search." Electronic Thesis or Diss., Aix-Marseille, 2019. http://www.theses.fr/2019AIXM0260.

Full text
Abstract:
Le Web est en croissance continue, et une quantité énorme de données est générée par les réseaux sociaux, permettant aux utilisateurs d'échanger une grande diversité d'informations. En outre, les textes au sein des réseaux sociaux sont souvent subjectifs. L'exploitation de cette subjectivité présente au sein des textes peut être un facteur important lors d'une recherche d'information. En particulier, cette thèse est réalisée pour répondre aux besoins de la plate-forme Books de Open Edition en matière d'amélioration de la recherche et la recommandation de livres, en plusieurs langues. La plateforme offre des informations générées par des utilisateurs, riches en sentiments. Par conséquent, l'analyse précédente, concernant l'exploitation de sentiment en recherche d'information, joue un rôle important dans cette thèse et peut servir l'objectif d'une amélioration de qualité de la recherche de livres en utilisant les informations générées par les utilisateurs. Par conséquent, nous avons choisi de suivre une voie principale dans cette thèse consistant à combiner les domaines analyse de sentiment (AS) et recherche d'information (RI), dans le but d'améliorer les suggestions de la recherche de livres. Nos objectifs peuvent être résumés en plusieurs points: • Une approche d'analyse de sentiment, facilement applicable sur différentes langues, peu coûteuse en temps et en données annotées. • De nouvelles approches pour l'amélioration de la qualité lors de la recherche de livres, basées sur l'utilisation de l'analyse de sentiment dans le filtrage, l'extraction et la classification des informations
The web technology is in an on going growth, and a huge volume of data is generated in the social web, where users would exchange a variety of information. In addition to the fact that social web text may be rich of information, the writers are often guided by provoked sentiments reflected in their writings. Based on that concept, locating sentiment in a text can play an important role for information extraction. The purpose of this thesis is to improve the book search and recommendation quality of the Open Edition's multilingual Books platform. The Books plat- form also offers additional information through users generated information (e.g. book reviews) connected to the books and rich in emotions expressed in the users' writings. Therefore, the previous analysis, concerning locating sentiment in a text for information extraction, plays an important role in this thesis, and can serve the purpose of quality improvement concerning book search, using the shared users generated information. Accordingly, we choose to follow a main path in this thesis to combine sentiment analysis (SA) and information retrieval (IR) fields, for the purpose of improving the quality of book search. Two objectives are summarised in the following, which serve the main purpose of the thesis in the IR quality improvement using SA: • An approach for SA prediction, easily applicable on different languages, low cost in time and annotated data. • New approaches for book search quality improvement, based on SA employment in information filtering, retrieving and classifying
APA, Harvard, Vancouver, ISO, and other styles
36

Guillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://tel.archives-ouvertes.fr/tel-00522278/en/.

Full text
Abstract:
La présente thèse s'intéresse à l'utilisation de méta-données textuelles pour l'analyse d'image. Nous cherchons à utiliser ces informations additionelles comme supervision faible pour l'apprentissage de modèles de reconnaissance visuelle. Nous avons observé un récent et grandissant intérêt pour les méthodes capables d'exploiter ce type de données car celles-ci peuvent potentiellement supprimer le besoin d'annotations manuelles, qui sont coûteuses en temps et en ressources. Nous concentrons nos efforts sur deux types de données visuelles associées à des informations textuelles. Tout d'abord, nous utilisons des images de dépêches qui sont accompagnées de légendes descriptives pour s'attaquer à plusieurs problèmes liés à la reconnaissance de visages. Parmi ces problèmes, la vérification de visages est la tâche consistant à décider si deux images représentent la même personne, et le nommage de visages cherche à associer les visages d'une base de données à leur noms corrects. Ensuite, nous explorons des modèles pour prédire automatiquement les labels pertinents pour des images, un problème connu sous le nom d'annotation automatique d'image. Ces modèles peuvent aussi être utilisés pour effectuer des recherches d'images à partir de mots-clés. Nous étudions enfin un scénario d'apprentissage multimodal semi-supervisé pour la catégorisation d'image. Dans ce cadre de travail, les labels sont supposés présents pour les données d'apprentissage, qu'elles soient manuellement annotées ou non, et absentes des données de test. Nos travaux se basent sur l'observation que la plupart de ces problèmes peuvent être résolus si des mesures de similarité parfaitement adaptées sont utilisées. Nous proposons donc de nouvelles approches qui combinent apprentissage de distance, modèles par plus proches voisins et méthodes par graphes pour apprendre, à partir de données visuelles et textuelles, des similarités visuelles spécifiques à chaque problème. Dans le cas des visages, nos similarités se concentrent sur l'identité des individus tandis que, pour les images, elles concernent des concepts sémantiques plus généraux. Expérimentalement, nos approches obtiennent des performances à l'état de l'art sur plusieurs bases de données complexes. Pour les deux types de données considérés, nous montrons clairement que l'apprentissage bénéficie de l'information textuelle supplémentaire résultant en l'amélioration de la performance des systèmes de reconnaissance visuelle.
APA, Harvard, Vancouver, ISO, and other styles
37

Guillaumin, Matthieu. "Données multimodales pour l'analyse d'image." Phd thesis, Grenoble, 2010. http://www.theses.fr/2010GRENM048.

Full text
Abstract:
La présente thèse s'intéresse à l'utilisation de méta-données textuelles pour l'analyse d'image. Nous cherchons à utiliser ces informations additionelles comme supervision faible pour l'apprentissage de modèles de reconnaissance visuelle. Nous avons observé un récent et grandissant intérêt pour les méthodes capables d'exploiter ce type de données car celles-ci peuvent potentiellement supprimer le besoin d'annotations manuelles, qui sont coûteuses en temps et en ressources. Nous concentrons nos efforts sur deux types de données visuelles associées à des informations textuelles. Tout d'abord, nous utilisons des images de dépêches qui sont accompagnées de légendes descriptives pour s'attaquer à plusieurs problèmes liés à la reconnaissance de visages. Parmi ces problèmes, la vérification de visages est la tâche consistant à décider si deux images représentent la même personne, et le nommage de visages cherche à associer les visages d'une base de données à leur noms corrects. Ensuite, nous explorons des modèles pour prédire automatiquement les labels pertinents pour des images, un problème connu sous le nom d'annotation automatique d'image. Ces modèles peuvent aussi être utilisés pour effectuer des recherches d'images à partir de mots-clés. Nous étudions enfin un scénario d'apprentissage multimodal semi-supervisé pour la catégorisation d'image. Dans ce cadre de travail, les labels sont supposés présents pour les données d'apprentissage, qu'elles soient manuellement annotées ou non, et absentes des données de test. Nos travaux se basent sur l'observation que la plupart de ces problèmes peuvent être résolus si des mesures de similarité parfaitement adaptées sont utilisées. Nous proposons donc de nouvelles approches qui combinent apprentissage de distance, modèles par plus proches voisins et méthodes par graphes pour apprendre, à partir de données visuelles et textuelles, des similarités visuelles spécifiques à chaque problème. Dans le cas des visages, nos similarités se concentrent sur l'identité des individus tandis que, pour les images, elles concernent des concepts sémantiques plus généraux. Expérimentalement, nos approches obtiennent des performances à l'état de l'art sur plusieurs bases de données complexes. Pour les deux types de données considérés, nous montrons clairement que l'apprentissage bénéficie de l'information textuelle supplémentaire résultant en l'amélioration de la performance des systèmes de reconnaissance visuelle
This dissertation delves into the use of textual metadata for image understanding. We seek to exploit this additional textual information as weak supervision to improve the learning of recognition models. There is a recent and growing interest for methods that exploit such data because they can potentially alleviate the need for manual annotation, which is a costly and time-consuming process. We focus on two types of visual data with associated textual information. First, we exploit news images that come with descriptive captions to address several face related tasks, including face verification, which is the task of deciding whether two images depict the same individual, and face naming, the problem of associating faces in a data set to their correct names. Second, we consider data consisting of images with user tags. We explore models for automatically predicting tags for new images, i. E. Image auto-annotation, which can also used for keyword-based image search. We also study a multimodal semi-supervised learning scenario for image categorisation. In this setting, the tags are assumed to be present in both labelled and unlabelled training data, while they are absent from the test data. Our work builds on the observation that most of these tasks can be solved if perfectly adequate similarity measures are used. We therefore introduce novel approaches that involve metric learning, nearest neighbour models and graph-based methods to learn, from the visual and textual data, task-specific similarities. For faces, our similarities focus on the identities of the individuals while, for images, they address more general semantic visual concepts. Experimentally, our approaches achieve state-of-the-art results on several standard and challenging data sets. On both types of data, we clearly show that learning using additional textual information improves the performance of visual recognition systems
APA, Harvard, Vancouver, ISO, and other styles
38

Gong, Rong. "Automatic assessment of singing voice pronunciation: a case study with Jingju music." Doctoral thesis, Universitat Pompeu Fabra, 2018. http://hdl.handle.net/10803/664421.

Full text
Abstract:
Online learning has altered music education remarkable in the last decade. Large and increasing amount of music performing learners participate in online music learning courses due to the easy-accessibility and boundless of time-space constraints. Singing can be considered the most basic form of music performing. Automatic singing voice assessment, as an important task in Music Information Retrieval (MIR), aims to extract musically meaningful information and measure the quality of learners' singing voice. Singing correctness and quality is culture-specific and its assessment requires culture-aware methodologies. Jingju (also known as Beijing opera) music is one of the representative music traditions in China and has spread to many places in the world where there are Chinese communities. Our goal is to tackle unexplored automatic singing voice pronunciation assessment problems in jingju music, to make the current eurogeneric assessment approaches more culture-aware, and in return, to develop new assessment approaches which can be generalized to other musical traditions.
El aprendizaje en línea ha cambiado notablemente la educación musical en la pasada década. Una cada vez mayor cantidad de estudiantes de interpretación musical participan en cursos de aprendizaje musical en línea por su fácil accesibilidad y no estar limitada por restricciones de tiempo y espacio. Puede considerarse el canto como la forma más básica de interpretación. La evaluación automática de la voz cantada, como tarea importante en la disciplina de Recuperación de Información Musical (MIR por sus siglas en inglés) tiene como objetivo la extracción de información musicalmente significativa y la medición de la calidad de la voz cantada del estudiante. La corrección y calidad del canto son específicas a cada cultura y su evaluación requiere metodologías con especificidad cultural. La música del jingju (también conocido como ópera de Beijing) es una de las tradiciones musicales más representativas de China y se ha difundido a muchos lugares del mundo donde existen comunidades chinas.Nuestro objetivo es abordar problemas aún no explorados sobre la evaluación automática de la voz cantada en la música del jingju, hacer que las propuestas eurogenéticas actuales sobre evaluación sean más específicas culturalmente, y al mismo tiempo, desarrollar nuevas propuestas sobre evaluación que puedan ser generalizables para otras tradiciones musicales.
APA, Harvard, Vancouver, ISO, and other styles
39

Pinho, Eduardo Miguel Coutinho Gomes de. "Multimodal information retrieval in medical imaging archives." Doctoral thesis, 2019. http://hdl.handle.net/10773/29206.

Full text
Abstract:
The proliferation of digital medical imaging modalities in hospitals and other diagnostic facilities has created huge repositories of valuable data, often not fully explored. Moreover, the past few years show a growing trend of data production. As such, studying new ways to index, process and retrieve medical images becomes an important subject to be addressed by the wider community of radiologists, scientists and engineers. Content-based image retrieval, which encompasses various methods, can exploit the visual information of a medical imaging archive, and is known to be beneficial to practitioners and researchers. However, the integration of the latest systems for medical image retrieval into clinical workflows is still rare, and their effectiveness still show room for improvement. This thesis proposes solutions and methods for multimodal information retrieval, in the context of medical imaging repositories. The major contributions are a search engine for medical imaging studies supporting multimodal queries in an extensible archive; a framework for automated labeling of medical images for content discovery; and an assessment and proposal of feature learning techniques for concept detection from medical images, exhibiting greater potential than feature extraction algorithms that were pertinently used in similar tasks. These contributions, each in their own dimension, seek to narrow the scientific and technical gap towards the development and adoption of novel multimodal medical image retrieval systems, to ultimately become part of the workflows of medical practitioners, teachers, and researchers in healthcare.
A proliferação de modalidades de imagem médica digital, em hospitais, clínicas e outros centros de diagnóstico, levou à criação de enormes repositórios de dados, frequentemente não explorados na sua totalidade. Além disso, os últimos anos revelam, claramente, uma tendência para o crescimento da produção de dados. Portanto, torna-se importante estudar novas maneiras de indexar, processar e recuperar imagens médicas, por parte da comunidade alargada de radiologistas, cientistas e engenheiros. A recuperação de imagens baseada em conteúdo, que envolve uma grande variedade de métodos, permite a exploração da informação visual num arquivo de imagem médica, o que traz benefícios para os médicos e investigadores. Contudo, a integração destas soluções nos fluxos de trabalho é ainda rara e a eficácia dos mais recentes sistemas de recuperação de imagem médica pode ser melhorada. A presente tese propõe soluções e métodos para recuperação de informação multimodal, no contexto de repositórios de imagem médica. As contribuições principais são as seguintes: um motor de pesquisa para estudos de imagem médica com suporte a pesquisas multimodais num arquivo extensível; uma estrutura para a anotação automática de imagens; e uma avaliação e proposta de técnicas de representation learning para deteção automática de conceitos em imagens médicas, exibindo maior potencial do que as técnicas de extração de features visuais outrora pertinentes em tarefas semelhantes. Estas contribuições procuram reduzir as dificuldades técnicas e científicas para o desenvolvimento e adoção de sistemas modernos de recuperação de imagem médica multimodal, de modo a que estes façam finalmente parte das ferramentas típicas dos profissionais, professores e investigadores da área da saúde.
Programa Doutoral em Informática
APA, Harvard, Vancouver, ISO, and other styles
40

Hong, Wei Jhe, and 洪煒哲. "An XML-based Metadata Embedding System for Context-aware Image Retrieval." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/y6am66.

Full text
Abstract:
碩士
國立虎尾科技大學
光電與材料科技研究所
97
A variety of digital electronic devices appear in the modern society, therefore, a significant increase in all kinds of multimedia information. Because the factor of a large number of multimedia information, so a variety of image management programs appear. A variety of image management program requirements in order to better search and management efficiency, it will set the framework for the description of information structure and index. However, the structure between the different programs, there is no standard specification. This problem can not be made to describe the information exchange between different programs. This paper presents a method of embedding image metadata in JPEG image file. This method uses XML image metadata to create index of metadata for accelerating the capture and search efficiency. In the paper, it is proposed to use the JPEG standard image format for embedding metadata. Because the JPEG standard has application segments can be expanded for different application, so we will embed metadata in specific application segment. In this way, embedded metadata can be got in specific place. Metadata is created by XML standard, and the use of XML Schema that describes the structure of metadata. Because the metadata embedded in the picture described, so we have to spend a lot of file access time for each picture to capture information and search. In order to solve the problem, so we propose a method to use index of metadata and miniature. Metadata index is described by the same description of the classification of information into a different set of dictionaries. In order to accelerate the search speed, index of the dictionary will use the hash function computation. Metadata can be used to modify metadata consistency and quick search. Index miniature is created by combining single miniature。 The created index miniature''s file size is reduced to the size of the original 20%. In order to provide a more accurate search, Embedded metadata will be combined with keywords, MPEG-7 image feature and user-defined information Keyword: Content-aware, XML, Embedded Metadata, Dictionary index, Image search
APA, Harvard, Vancouver, ISO, and other styles
41

"Video2Vec: Learning Semantic Spatio-Temporal Embedding for Video Representations." Master's thesis, 2016. http://hdl.handle.net/2286/R.I.40765.

Full text
Abstract:
abstract: High-level inference tasks in video applications such as recognition, video retrieval, and zero-shot classification have become an active research area in recent years. One fundamental requirement for such applications is to extract high-quality features that maintain high-level information in the videos. Many video feature extraction algorithms have been purposed, such as STIP, HOG3D, and Dense Trajectories. These algorithms are often referred to as “handcrafted” features as they were deliberately designed based on some reasonable considerations. However, these algorithms may fail when dealing with high-level tasks or complex scene videos. Due to the success of using deep convolution neural networks (CNNs) to extract global representations for static images, researchers have been using similar techniques to tackle video contents. Typical techniques first extract spatial features by processing raw images using deep convolution architectures designed for static image classifications. Then simple average, concatenation or classifier-based fusion/pooling methods are applied to the extracted features. I argue that features extracted in such ways do not acquire enough representative information since videos, unlike images, should be characterized as a temporal sequence of semantically coherent visual contents and thus need to be represented in a manner considering both semantic and spatio-temporal information. In this thesis, I propose a novel architecture to learn semantic spatio-temporal embedding for videos to support high-level video analysis. The proposed method encodes video spatial and temporal information separately by employing a deep architecture consisting of two channels of convolutional neural networks (capturing appearance and local motion) followed by their corresponding Fully Connected Gated Recurrent Unit (FC-GRU) encoders for capturing longer-term temporal structure of the CNN features. The resultant spatio-temporal representation (a vector) is used to learn a mapping via a Fully Connected Multilayer Perceptron (FC-MLP) to the word2vec semantic embedding space, leading to a semantic interpretation of the video vector that supports high-level analysis. I evaluate the usefulness and effectiveness of this new video representation by conducting experiments on action recognition, zero-shot video classification, and semantic video retrieval (word-to-video) retrieval, using the UCF101 action recognition dataset.
Dissertation/Thesis
Masters Thesis Computer Science 2016
APA, Harvard, Vancouver, ISO, and other styles
42

Duan, Lingyu. "Multimodal mid-level representations for semantic analysis of broadcast video." Thesis, 2008. http://hdl.handle.net/1959.13/25819.

Full text
Abstract:
Research Doctorate - Doctor of Philosophy (PhD)
This thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.
APA, Harvard, Vancouver, ISO, and other styles
43

Duan, Lingyu. "Multimodal mid-level representations for semantic analysis of broadcast video." 2008. http://hdl.handle.net/1959.13/25819.

Full text
Abstract:
Research Doctorate - Doctor of Philosophy (PhD)
This thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.
APA, Harvard, Vancouver, ISO, and other styles
44

Lu, Hung-Tsung, and 盧宏宗. "Semantic Retrieval of Personal Photos Using Multimodal Deep Autoencoder Fusing Visual and Speech Features." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/58fvxy.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Mourão, André Belchior. "Towards an Architecture for Efficient Distributed Search of Multimodal Information." Doctoral thesis, 2018. http://hdl.handle.net/10362/38850.

Full text
Abstract:
The creation of very large-scale multimedia search engines, with more than one billion images and videos, is a pressing need of digital societies where data is generated by multiple connected devices. Distributing search indexes in cloud environments is the inevitable solution to deal with the increasing scale of image and video collections. The distribution of such indexes in this setting raises multiple challenges such as the even partitioning of data space, load balancing across index nodes and the fusion of the results computed over multiple nodes. The main question behind this thesis is how to reduce and distribute the multimedia retrieval computational complexity? This thesis studies the extension of sparse hash inverted indexing to distributed settings. The main goal is to ensure that indexes are uniformly distributed across computing nodes while keeping similar documents on the same nodes. Load balancing is performed at both node and index level, to guarantee that the retrieval process is not delayed by nodes that have to inspect larger subsets of the index. Multimodal search requires the combination of the search results from individual modalities and document features. This thesis studies rank fusion techniques focused on reducing complexity by automatically selecting only the features that improve retrieval effectiveness. The achievements of this thesis span both distributed indexing and rank fusion research. Experiments across multiple datasets show that sparse hashes can be used to distribute documents and queries across index entries in a balanced and redundant manner across nodes. Rank fusion results show that is possible to reduce retrieval complexity and improve efficiency by searching only a subset of the feature indexes.
APA, Harvard, Vancouver, ISO, and other styles
46

Carvalho, José Ricardo de Abreu. "Pesquisa multimodal de imagens em dispositivos móveis." Master's thesis, 2021. http://hdl.handle.net/10400.13/3984.

Full text
Abstract:
Apesar das evoluções no campo de Reverse Image Search, com algoritmos cada vez mais robustos e eficazes, continua a haver interesse para que as técnicas de pesquisa possam ser aprimoradas, melhorando a experiência do utilizador na procura das imagens que tem em mente. O objetivo principal deste trabalho foi desenvolver uma aplicação para dispositivos móveis (smartphones) que permitisse ao utilizador encontrar imagens através de inputs multimodais. Assim, esta dissertação, para além de propor pesquisas por diversos modos (palavras-chave, desenho, e imagens da câmara ou existentes no dispositivo), propõe que o utilizador consiga criar uma imagem por si só através de desenho, ou editar/alterar uma imagem existente, tendo feedback no momento aquando de cada alteração/interação. Ao longo da experiência de pesquisa, o utilizador consegue usar as imagens encontradas (que achar relevantes) e ir aprimorando a pesquisa através dessa edição, indo de encontro ao que pensa encontrar. A implementação desta proposta teve como base a Cloud Vision API da Google responsável pela obtenção dos resultados através do input de imagem, a Google Custom Search API para a obtenção de imagens através do input por texto, e a framework ATsketchkit que permitia a criação de desenho, para o sistema iOS da Apple. Foram realizados testes com um conjunto de utilizadores com diversos níveis de experiência em pesquisa de imagens e na habilidade de desenho, permitindo aferir a preferência nos diferentes métodos de input, a satisfação na obtenção dos resultados, bem como da usabilidade do protótipo.
Despite the evolution in the field of reverse image search, with algorithms becoming more robust and effective, there still interest for improving search techniques, improving the user experience when searching for the images the user has in mind. The main goal of this work was to develop an application for mobile devices (smartphones) that would allow the user to find images through multimodal inputs. Thus, this dissertation, in addition to propose the search for images in different ways (keywords, drawing/sketching, and camera or device images), proposes that the user can create an image by himself through drawing, editing / changing an existing image, having feedback at the time of each change / interaction. Throughout the search experience, the user can use the images found (which it finds relevant) and improve the search through its edition, going against what it thinks to find. The implementation of this proposal was based on a Google Cloud Vision API responsible for obtaining the results, and the ATsketchkit framework that allowed the creation of drawings, for Apple's iOS system. Tests were carried out with a set of users with different levels of experience in image research and different drawing ability, allowing to assess preference in different input methods, satisfaction with the images retrieved, as well as the usability of the prototype.
APA, Harvard, Vancouver, ISO, and other styles
47

"Representation, Exploration, and Recommendation of Music Playlists." Master's thesis, 2019. http://hdl.handle.net/2286/R.I.54843.

Full text
Abstract:
abstract: Playlists have become a significant part of the music listening experience today because of the digital cloud-based services such as Spotify, Pandora, Apple Music. Owing to the meteoric rise in usage of playlists, recommending playlists is crucial to music services today. Although there has been a lot of work done in playlist prediction, the area of playlist representation hasn't received that level of attention. Over the last few years, sequence-to-sequence models, especially in the field of natural language processing have shown the effectiveness of learned embeddings in capturing the semantic characteristics of sequences. Similar concepts can be applied to music to learn fixed length representations for playlists and the learned representations can then be used for downstream tasks such as playlist comparison and recommendation. In this thesis, the problem of learning a fixed-length representation is formulated in an unsupervised manner, using Neural Machine Translation (NMT), where playlists are interpreted as sentences and songs as words. This approach is compared with other encoding architectures and evaluated using the suite of tasks commonly used for evaluating sentence embeddings, along with a few additional tasks pertaining to music. The aim of the evaluation is to study the traits captured by the playlist embeddings such that these can be leveraged for music recommendation purposes. This work lays down the foundation for analyzing music playlists and learning the patterns that exist in the playlists in an end-to-end manner. This thesis finally concludes with a discussion on the future direction for this research and its potential impact in the domain of Music Information Retrieval.
Dissertation/Thesis
Masters Thesis Computer Science 2019
APA, Harvard, Vancouver, ISO, and other styles
48

He, Kun. "Learning deep embeddings by learning to rank." Thesis, 2018. https://hdl.handle.net/2144/34773.

Full text
Abstract:
We study the problem of embedding high-dimensional visual data into low-dimensional vector representations. This is an important component in many computer vision applications involving nearest neighbor retrieval, as embedding techniques not only perform dimensionality reduction, but can also capture task-specific semantic similarities. In this thesis, we use deep neural networks to learn vector embeddings, and develop a gradient-based optimization framework that is capable of optimizing ranking-based retrieval performance metrics, such as the widely used Average Precision (AP) and Normalized Discounted Cumulative Gain (NDCG). Our framework is applied in three applications. First, we study Supervised Hashing, which is concerned with learning compact binary vector embeddings for fast retrieval, and propose two novel solutions. The first solution optimizes Mutual Information as a surrogate ranking objective, while the other directly optimizes AP and NDCG, based on the discovery of their closed-form expressions for discrete Hamming distances. These optimization problems are NP-hard, therefore we derive their continuous relaxations to enable gradient-based optimization with neural networks. Our solutions establish the state-of-the-art on several image retrieval benchmarks. Next, we learn deep neural networks to extract Local Feature Descriptors from image patches. Local features are used universally in low-level computer vision tasks that involve sparse feature matching, such as image registration and 3D reconstruction, and their matching is a nearest neighbor retrieval problem. We leverage our AP optimization technique to learn both binary and real-valued descriptors for local image patches. Compared to competing approaches, our solution eliminates complex heuristics, and performs more accurately in the tasks of patch verification, patch retrieval, and image matching. Lastly, we tackle Deep Metric Learning, the general problem of learning real-valued vector embeddings using deep neural networks. We propose a learning to rank solution through optimizing a novel quantization-based approximation of AP. For downstream tasks such as retrieval and clustering, we demonstrate promising results on standard benchmarks, especially in the few-shot learning scenario, where the number of labeled examples per class is limited.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography