Dissertationen zum Thema „Multimodal retrieval“
Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an
Machen Sie sich mit Top-34 Dissertationen für die Forschung zum Thema "Multimodal retrieval" bekannt.
Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.
Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.
Sehen Sie die Dissertationen für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.
Adebayo, Kolawole John <1986>. „Multimodal Legal Information Retrieval“. Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amsdottorato.unibo.it/8634/1/ADEBAYO-JOHN-tesi.pdf.
Der volle Inhalt der QuelleChen, Jianan. „Deep Learning Based Multimodal Retrieval“. Electronic Thesis or Diss., Rennes, INSA, 2023. http://www.theses.fr/2023ISAR0019.
Der volle Inhalt der QuelleMultimodal tasks play a crucial role in the progression towards achieving general artificial intelligence (AI). The primary goal of multimodal retrieval is to employ machine learning algorithms to extract relevant semantic information, bridging the gap between different modalities such as visual images, linguistic text, and other data sources. It is worth noting that the information entropy associated with heterogeneous data for the same high-level semantics varies significantly, posing a significant challenge for multimodal models. Deep learning-based multimodal network models provide an effective solution to tackle the difficulties arising from substantial differences in information entropy. These models exhibit impressive accuracy and stability in large-scale cross-modal information matching tasks, such as image-text retrieval. Furthermore, they demonstrate strong transfer learning capabilities, enabling a well-trained model from one multimodal task to be fine-tuned and applied to a new multimodal task, even in scenarios involving few-shot or zero-shot learning. In our research, we develop a novel generative multimodal multi-view database specifically designed for the multimodal referential segmentation task. Additionally, we establish a state-of-the-art (SOTA) benchmark and multi-view metric for referring expression segmentation models in the multimodal domain. The results of our comparative experiments are presented visually, providing clear and comprehensive insights
Böckmann, Christine, Jens Biele, Roland Neuber und Jenny Niebsch. „Retrieval of multimodal aerosol size distribution by inversion of multiwavelength data“. Universität Potsdam, 1997. http://opus.kobv.de/ubp/volltexte/2007/1436/.
Der volle Inhalt der QuelleZhu, Meng. „Cross-modal semantic-associative labelling, indexing and retrieval of multimodal data“. Thesis, University of Reading, 2010. http://centaur.reading.ac.uk/24828/.
Der volle Inhalt der QuelleKahn, Itamar. „Remembering the past : multimodal imaging of cortical contributions to episodic retrieval“. Thesis, Massachusetts Institute of Technology, 2005. http://hdl.handle.net/1721.1/33171.
Der volle Inhalt der QuelleThis electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Includes bibliographical references.
What is the nature of the neural processes that allow humans to remember past events? The theoretical framework adopted in this thesis builds upon cognitive models that suggest that episodic retrieval can be decomposed into two classes of computations: (1) recovery processes that serve to reactivate stored memories, making information from a past episode readily available, and (2) control processes that serve to guide the retrieval attempt and monitor/evaluate information arising from the recovery processes. A multimodal imaging approach that combined fMRI and MEG was adopted to gain insight into the spatial and temporal brain mechanisms supporting episodic retrieval. Chapter 1 reviews major findings and theories in the episodic retrieval literature grounding the open questions and controversies within the suggested framework. Chapter 2 describes an fMRI and MEG experiment that identified medial temporal cortical structures that signal item memory strength, thus supporting the perception of item familiarity. Chapter 3 describes an fMRI experiment that demonstrated that retrieval of contextual details involves reactivation of neural patterns engaged at encoding.
(cont.) Further, leveraging this pattern of reactivation, it was demonstrated that false recognition may be accompanied by recollection. The fMRI experiment reported in Chapter 3, when combined with an MEG experiment reported in Chapter 4, directly addressed questions regarding the control processes engaged during episodic retrieval. In particular, Chapter 3 showed that parietal and prefrontal cortices contribute to controlling the act of arriving at a retrieval decision. Chapter 4 then illuminates the temporal characteristics of parietal activation during episodic retrieval, providing novel evidence about the nature of parietal responses and thus constraints on theories of parietal involvement in episodic retrieval. The conducted research targeted distinct aspects of the multi-faceted act of remembering the past. The obtained data contribute to the building of an anatomical and temporal "blueprint" documenting the cascade of neural events that unfold during attempts to remember, as well as when such attempts are met with success or lead to memory errors. In the course of framing this research within the context of cognitive models of retrieval, the obtained neural data reflect back on and constrain these theories of remembering.
by Itamar Kahn.
Ph.D.
Nag, Chowdhury Sreyasi [Verfasser]. „Text-image synergy for multimodal retrieval and annotation / Sreyasi Nag Chowdhury“. Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2021. http://d-nb.info/1240674139/34.
Der volle Inhalt der QuelleLolich, María, und Susana Azzollini. „Phenomenological retrieval style of autobiographical memories in a sample of major depressed individuals“. Pontificia Universidad Católica del Perú, 2016. http://repositorio.pucp.edu.pe/index/handle/123456789/99894.
Der volle Inhalt der QuelleLa evocación de recuerdos autobiográficos se caracteriza por presentar distintos compo nentes fenomenológicos. Dada la ausencia de trabajos previos realizados en poblaciones hispanoparlantes, se realizaron 34 entrevistas en profundidad a individuos con y sin tras torno depresivo mayor de la ciudad de Buenos Aires (Argentina). Fueron explorados los componentes fenomenológicos presentes en la evocación de recuerdos autobiográficos significativos. Los datos fueron analizados cualitativamente por medio de la Teoría Fun damentada en los Hechos. Durante el análisis descriptivo, se detectaron siete categorías fenomenológicas emergentes del discurso. Del análisis axial y selectivo fueron identificados dos ejes discursivos: retórico-proposicional y especificidad-generalidad. Las implicancias, en la regulación afectiva, derivadas de la asunción de un estilo amodal o multimodal de proce samiento de información autobiográfica merecen mayor atención.
A evocação de memórias autobiográficas é caracterizada por diferentes componentes feno menológicos. Dada a falta de trabalhos prévios sobre o tema em populações de língua espanhola, 34 entrevistas em profundidade foram conduzidas em indivíduos com e sem transtorno depressivo maior na cidade de Buenos Aires (Argentina). Foram explorados os componentes fenomenológicos presentes na evocação de memórias autobiográficas signi ficativas. Os dados foram analisados qualitativamente através da Teoria Fundamentada. Durante a análise descritiva, foram detectadas sete categorias fenomenológicas emer gentes no discurso. Dos analises axial e seletivo foram identificados dois eixos discursivos: retórico-proposicional e especificidade-generalidade. As implicações, na regulação afetiva, decorrentes da assunção de um estilo amodal ou um estilo multimodal no processamento de informações autobiográficas merecem mais atenção.
Valero-Mas, Jose J. „Towards Interactive Multimodal Music Transcription“. Doctoral thesis, Universidad de Alicante, 2017. http://hdl.handle.net/10045/71275.
Der volle Inhalt der QuelleQuack, Till. „Large scale mining and retrieval of visual data in a multimodal context“. Konstanz Hartung-Gorre, 2009. http://d-nb.info/993614620/04.
Der volle Inhalt der QuelleSaragiotis, Panagiotis. „Cross-modal classification and retrieval of multimodal data using combinations of neural networks“. Thesis, University of Surrey, 2006. http://epubs.surrey.ac.uk/843338/.
Der volle Inhalt der QuelleFedel, Gabriel de Souza. „Busca multimodal para apoio à pesquisa em biodiversidade“. [s.n.], 2011. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275751.
Der volle Inhalt der QuelleDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-18T07:07:49Z (GMT). No. of bitstreams: 1 Fedel_GabrieldeSouza_M.pdf: 14390093 bytes, checksum: 63058da33a22121e927f1cdbaff297d3 (MD5) Previous issue date: 2011
Resumo: A pesquisa em computação aplicada à biodiversidade apresenta muitos desafios, que vão desde o grande volume de dados altamente heterogêneos até a variedade de tipos de usuários. Isto gera a necessidade de ferramentas versáteis de recuperação. As ferramentas disponíveis ainda são limitadas e normalmente só consideram dados textuais, deixando de explorar a potencialidade da busca por dados de outra natureza, como imagens ou sons. Esta dissertação analisa os problemas de realizar consultas multimodais a partir de predicados que envolvem texto e imagem para o domínio de biodiversidade, especificando e implementando um conjunto de ferramentas para processar tais consultas. As contribuições do trabalho, validado com dados reais, incluem a construção de uma ontologia taxonômica associada a nomes vulgares e a possibilidade de apoiar dois perfis de usuários (especialistas e leigos). Estas características estendem o escopo da consultas atualmente disponíveis em sistemas de biodiversidade. Este trabalho está inserido no projeto Bio-CORE, uma parceria entre pesquisadores de computação e biologia para criar ferramentas computacionais para dar apoio à pesquisa em biodiversidade
Abstract: Research on Computing applied to biodiversity present several challenges, ranging from the massive volumes of highly heterogeneous data to the variety in user profiles. This kind of scenario requires versatile data retrieval and management tools. Available tools are still limited. Most often, they only consider textual data and do not take advantage of the multiple data types available, such as images or sounds. This dissertation discusses issues concerning multimodal queries that involve both text and images as search parameters, for the domanin of biodiversity. It presents the specification and implementation of a set of tools to process such queries, which were validate with real data from Unicamp's Zoology Museum. The aim contributions also include the construction of a taxonomic ontology that includes species common names, and support to both researchers and non-experts in queries. Such features extend the scop of queries available in biodiversity information systems. This research is associated with the Biocore project, jointly conducted by researchers in computing and biology, to design and develop computational tools to support research in biodiversity
Mestrado
Banco de Dados
Mestre em Ciência da Computação
Dyar, Samuel S. „A multimodal speech interface for dynamic creation and retrieval of geographical landmarks on a mobile device“. Thesis, Massachusetts Institute of Technology, 2010. http://hdl.handle.net/1721.1/62638.
Der volle Inhalt der QuelleCataloged from PDF version of thesis.
Includes bibliographical references (p. 140).
As mobile devices become more powerful, researchers look to develop innovative applications that use new and effective means of input. Furthermore, developers must exploit the device's many capabilities (GPS, camera, touch screen, etc) in order to make equally powerful applications. This thesis presents the development of a multimodal system that allows users to create and share informative geographical landmarks using Android-powered smart-phones. The content associated with each landmark is dynamically integrated into the system's vocabulary, which allows users to easily use speech to access landmarks by the information related to them. The initial results of releasing the application on the Android Market have been encouraging, but also suggest that improvements need to be made to the system.
by Samuel S. Dyar.
M.Eng.
Calumby, Rodrigo Tripodi 1985. „Recuperação multimodal de imagens com realimentação de relevância baseada em programação genética“. [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275814.
Der volle Inhalt der QuelleDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-16T05:18:58Z (GMT). No. of bitstreams: 1 Calumby_RodrigoTripodi_M.pdf: 15749586 bytes, checksum: 2493b0b703adc1973eeabf7eb70ad21c (MD5) Previous issue date: 2010
Resumo: Este trabalho apresenta uma abordagem para recuperação multimodal de imagens com realimentação de relevância baseada em programação genética. Supõe-se que cada imagem da coleção possui informação textual associada (metadado, descrição textual, etc.), além de ter suas propriedades visuais (por exemplo, cor e textura) codificadas em vetores de características. A partir da informação obtida ao longo das iterações de realimentação de relevância, programação genética é utilizada para a criação de funções de combinação de medidas de similaridades eficazes. Com essas novas funções, valores de similaridades diversos são combinados em uma única medida, que mais adequadamente reflete as necessidades do usuário. As principais contribuições deste trabalho consistem na proposta e implementação de dois arcabouços. O primeiro, RFCore, é um arcabouço genérico para atividades de realimentação de relevância para manipulação de objetos digitais. O segundo, MMRFGP, é um arcabouço para recuperação de objetos digitais com realimentação de relevância baseada em programação genética, construído sobre o RFCore. O método proposto de recuperação multimodal de imagens foi validado sobre duas coleções de imagens, uma desenvolvida pela Universidade de Washington e outra da ImageCLEF Photographic Retrieval Task. A abordagem proposta mostrou melhores resultados para recuperação multimodal frente a utilização das modalidades isoladas. Além disso, foram obtidos resultados para recuperação visual e multimodal melhores do que as melhores submissões para a ImageCLEF Photographic Retrieval Task 2008
Abstract: This work presents an approach for multimodal content-based image retrieval with relevance feedback based on genetic programming. We assume that there is textual information (e.g., metadata, textual descriptions) associated with collection images. Furthermore, image content properties (e.g., color and texture) are characterized by image descriptores. Given the information obtained over the relevance feedback iterations, genetic programming is used to create effective combination functions that combine similarities associated with different features. Hence using these new functions the different similarities are combined into a unique measure that more properly meets the user needs. The main contribution of this work is the proposal and implementation of two frameworks. The first one, RFCore, is a generic framework for relevance feedback tasks over digital objects. The second one, MMRF-GP, is a framework for digital object retrieval with relevance feedback based on genetic programming and it was built on top of RFCore. We have validated the proposed multimodal image retrieval approach over 2 datasets, one from the University of Washington and another from the ImageCLEF Photographic Retrieval Task. Our approach has yielded the best results for multimodal image retrieval when compared with one-modality approaches. Furthermore, it has achieved better results for visual and multimodal image retrieval than the best submissions for ImageCLEF Photographic Retrieval Task 2008
Mestrado
Sistemas de Recuperação da Informação
Mestre em Ciência da Computação
Durak, Nurcan. „Semantic Video Modeling And Retrieval With Visual, Auditory, Textual Sources“. Master's thesis, METU, 2004. http://etd.lib.metu.edu.tr/upload/12605438/index.pdf.
Der volle Inhalt der QuelleOztarak, Hakan. „Structural And Event Based Multimodal Video Data Modeling“. Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/12606919/index.pdf.
Der volle Inhalt der QuelleRubio, Romano Antonio. „Fashion discovery : a computer vision approach“. Doctoral thesis, TDX (Tesis Doctorals en Xarxa), 2021. http://hdl.handle.net/10803/672423.
Der volle Inhalt der QuelleLa interpretación semántica de imágenes del mundo de la moda es sin duda uno de los dominios más desafiantes para la visión por computador. Leves variaciones en color y forma pueden conferir significados o interpretaciones distintas a una imagen. Es un dominio estrechamente ligado a la comprensión humana subjetiva, pero también a la interpretación y reconocimiento de escenarios y contextos. Ser capaz de extraer información específica sobre moda de imágenes e interpretarla de manera correcta puede ser útil en muchas situaciones y puede ayudar a entender la información subyacente en una imagen. Además, la moda es uno de los negocios más importantes a nivel global, con un valor estimado de tres trillones de dólares y un mercado online en constante crecimiento, lo cual aumenta el interés de los algoritmos basados en imágenes para buscar, clasificar o recomendar prendas. Esta tesis doctoral pretende resolver problemas específicos relacionados con el tratamiento de datos de tiendas virtuales de moda, yendo desde la información más básica a nivel de píxel hasta un entendimiento más abstracto que permita extraer conclusiones sobre las prendas presentes en una imagen, aprovechando para ello la Multi-modalidad de los datos disponibles para desarrollar algunas de las soluciones. Las contribuciones incluyen: - Un nuevo método de extracción de superpíxeles enfocado a mejorar el proceso de anotación de imágenes de moda. - La construcción de un espacio común para representar imágenes y textos referentes a moda. - La aplicación de ese espacio en la tarea de identificar el producto principal dentro de una imagen que muestra un conjunto de prendas. En resumen, la moda es un dominio complejo a muchos niveles en términos de visión por computador y aprendizaje automático, y desarrollar algoritmos específicos capaces de capturar la información esencial a partir de imágenes y textos no es una tarea trivial. Con el fin de resolver algunos de los desafíos que esta plantea, y considerando que este es un doctorado industrial, contribuimos al tema con una variedad de soluciones que pueden mejorar el rendimiento de muchas tareas extremadamente útiles para la industria de la moda online
Automàtica, robòtica i visió
SIMONETTA, FEDERICO. „MUSIC INTERPRETATION ANALYSIS. A MULTIMODAL APPROACH TO SCORE-INFORMED RESYNTHESIS OF PIANO RECORDINGS“. Doctoral thesis, Università degli Studi di Milano, 2022. http://hdl.handle.net/2434/918909.
Der volle Inhalt der QuelleIsmail, Nor Azman. „Flexible photo retrieval (FlexPhoReS) : a prototype for multimodel personal digital photo retrieval“. Thesis, Loughborough University, 2007. https://dspace.lboro.ac.uk/2134/12924.
Der volle Inhalt der QuelleBonardi, Fabien. „Localisation visuelle multimodale visible/infrarouge pour la navigation autonome“. Thesis, Normandie, 2017. http://www.theses.fr/2017NORMR028/document.
Der volle Inhalt der QuelleAutonomous navigation field gathers the set of algorithms which automate the moves of a mobile robot. The case study of this thesis focuses on the outdoor localisation issue with additionnal constraints : the use of visual sensors only with variable specifications (geometry, modality, etc) and long-term apparence changes of the surrounding environment. Both types of constraints are still rarely studied in the state of the art. Our main contribution concerns the description and compression steps of the data extracted from images. We developped a method called PHROG which represents data as a visual-words histogram. Obtained results on several images datasets show an improvment of the scenes recognition performance compared to methods from the state of the art. In a context of navigation, acquired images are sequential such that we can envision a filtering method to avoid faulty localisation estimation. Two probabilistic filtering approaches are proposed : a first one defines a simple movement model with a histograms filter and a second one sets up a more complex model using visual odometry and a particules filter
Nguyen, Nhu Van. „Représentations visuelles de concepts textuels pour la recherche et l'annotation interactives d'images“. Phd thesis, Université de La Rochelle, 2011. http://tel.archives-ouvertes.fr/tel-00730707.
Der volle Inhalt der QuelleInagaki, Yasuyoshi, Katsuhiko Toyama, Nobuo Kawaguchi, Shigeki Matsubara, Satoru Matsunaga, 康善 稲垣, 勝彦 外山, 信夫 河口, 茂樹 松原 und 悟. 松永. „Sync/Mail : 話し言葉の漸進的変換に基づく即時応答インタフェース“. 一般社団法人情報処理学会, 1998. http://hdl.handle.net/2237/15382.
Der volle Inhalt der QuelleGuillaumin, Matthieu. „Données multimodales pour l'analyse d'image“. Phd thesis, Grenoble, 2010. http://tel.archives-ouvertes.fr/tel-00522278/en/.
Der volle Inhalt der QuelleGuillaumin, Matthieu. „Données multimodales pour l'analyse d'image“. Phd thesis, Grenoble, 2010. http://www.theses.fr/2010GRENM048.
Der volle Inhalt der QuelleThis dissertation delves into the use of textual metadata for image understanding. We seek to exploit this additional textual information as weak supervision to improve the learning of recognition models. There is a recent and growing interest for methods that exploit such data because they can potentially alleviate the need for manual annotation, which is a costly and time-consuming process. We focus on two types of visual data with associated textual information. First, we exploit news images that come with descriptive captions to address several face related tasks, including face verification, which is the task of deciding whether two images depict the same individual, and face naming, the problem of associating faces in a data set to their correct names. Second, we consider data consisting of images with user tags. We explore models for automatically predicting tags for new images, i. E. Image auto-annotation, which can also used for keyword-based image search. We also study a multimodal semi-supervised learning scenario for image categorisation. In this setting, the tags are assumed to be present in both labelled and unlabelled training data, while they are absent from the test data. Our work builds on the observation that most of these tasks can be solved if perfectly adequate similarity measures are used. We therefore introduce novel approaches that involve metric learning, nearest neighbour models and graph-based methods to learn, from the visual and textual data, task-specific similarities. For faces, our similarities focus on the identities of the individuals while, for images, they address more general semantic visual concepts. Experimentally, our approaches achieve state-of-the-art results on several standard and challenging data sets. On both types of data, we clearly show that learning using additional textual information improves the performance of visual recognition systems
Slizovskaia, Olga. „Audio-visual deep learning methods for musical instrument classification and separation“. Doctoral thesis, Universitat Pompeu Fabra, 2020. http://hdl.handle.net/10803/669963.
Der volle Inhalt der QuelleEn la percepción musical, normalmente recibimos por nuestro sistema visual y por nuestro sistema auditivo informaciones complementarias. Además, la percepción visual juega un papel importante en nuestra experiencia integral ante una interpretación musical. Esta relación entre audio y visión ha incrementado el interés en métodos de aprendizaje automático capaces de combinar ambas modalidades para el análisis musical automático. Esta tesis se centra en dos problemas principales: la clasificación de instrumentos y la separación de fuentes en el contexto de videos musicales. Para cada uno de los problemas, se desarrolla un método multimodal utilizando técnicas de Deep Learning. Esto nos permite obtener -a través del aprendizaje- una representación codificada para cada modalidad. Además, para el problema de la separación de fuentes, también proponemos dos modelos condicionados a las etiquetas de los instrumentos, y examinamos la influencia que tienen dos fuentes de información extra en el rendimiento de la separación -comparándolas contra un modelo convencional-. Otro aspecto importante de este trabajo se basa en la exploración de diferentes modelos de fusión que permiten una mejor integración multimodal de fuentes de información de dominios asociados.
En la percepció visual, és habitual que rebem informacions complementàries des del nostres sistemes visual i auditiu. A més a més, la percepció visual té un paper molt important en la nostra experiència integral davant una interpretació musical. Aquesta relació entre àudio i visió ha fet créixer l'interès en mètodes d’aprenentatge automàtic capaços de combinar ambdues modalitats per l’anàlisi musical automàtic. Aquesta tesi se centra en dos problemes principals: la classificació d'instruments i la separació de fonts en el context dels vídeos musicals. Per a cadascú dels problemes, s'ha desenvolupat un mètode multimodal fent servir tècniques de Deep Learning. Això ens ha permès d'obtenir – gràcies a l’aprenentatge- una representació codificada per a cada modalitat. A més a més, en el cas del problema de separació de fonts, també proposem dos models condicionats a les etiquetes dels instruments, i examinem la influència que tenen dos fonts d’informació extra sobre el rendiment de la separació -tot comparant-les amb un model convencional-. Un altre aspecte d’aquest treball es basa en l’exploració de diferents models de fusió, els quals permeten una millor integració multimodal de fonts d'informació de dominis associats.
Karlsson, Kristina. „Semantic represenations of retrieved memory information depend on cue-modality“. Thesis, Stockholms universitet, Psykologiska institutionen, 2011. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-58817.
Der volle Inhalt der QuellePoignant, Johann. „Identification non-supervisée de personnes dans les flux télévisés“. Phd thesis, Université de Grenoble, 2013. http://tel.archives-ouvertes.fr/tel-00958774.
Der volle Inhalt der QuelleTran, Thi Quynh Nhi. „Robust and comprehensive joint image-text representations“. Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096/document.
Der volle Inhalt der QuelleThis thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification
Bursuc, Andrei. „Indexation et recherche de contenus par objet visuel“. Phd thesis, Ecole Nationale Supérieure des Mines de Paris, 2012. http://pastel.archives-ouvertes.fr/pastel-00873966.
Der volle Inhalt der QuellePinho, Eduardo Miguel Coutinho Gomes de. „Multimodal information retrieval in medical imaging archives“. Doctoral thesis, 2019. http://hdl.handle.net/10773/29206.
Der volle Inhalt der QuelleA proliferação de modalidades de imagem médica digital, em hospitais, clínicas e outros centros de diagnóstico, levou à criação de enormes repositórios de dados, frequentemente não explorados na sua totalidade. Além disso, os últimos anos revelam, claramente, uma tendência para o crescimento da produção de dados. Portanto, torna-se importante estudar novas maneiras de indexar, processar e recuperar imagens médicas, por parte da comunidade alargada de radiologistas, cientistas e engenheiros. A recuperação de imagens baseada em conteúdo, que envolve uma grande variedade de métodos, permite a exploração da informação visual num arquivo de imagem médica, o que traz benefícios para os médicos e investigadores. Contudo, a integração destas soluções nos fluxos de trabalho é ainda rara e a eficácia dos mais recentes sistemas de recuperação de imagem médica pode ser melhorada. A presente tese propõe soluções e métodos para recuperação de informação multimodal, no contexto de repositórios de imagem médica. As contribuições principais são as seguintes: um motor de pesquisa para estudos de imagem médica com suporte a pesquisas multimodais num arquivo extensível; uma estrutura para a anotação automática de imagens; e uma avaliação e proposta de técnicas de representation learning para deteção automática de conceitos em imagens médicas, exibindo maior potencial do que as técnicas de extração de features visuais outrora pertinentes em tarefas semelhantes. Estas contribuições procuram reduzir as dificuldades técnicas e científicas para o desenvolvimento e adoção de sistemas modernos de recuperação de imagem médica multimodal, de modo a que estes façam finalmente parte das ferramentas típicas dos profissionais, professores e investigadores da área da saúde.
Programa Doutoral em Informática
Duan, Lingyu. „Multimodal mid-level representations for semantic analysis of broadcast video“. Thesis, 2008. http://hdl.handle.net/1959.13/25819.
Der volle Inhalt der QuelleThis thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.
Duan, Lingyu. „Multimodal mid-level representations for semantic analysis of broadcast video“. 2008. http://hdl.handle.net/1959.13/25819.
Der volle Inhalt der QuelleThis thesis investigates the problem of seeking multimodal mid-level representations for semantic analysis of broadcast video. The problem is of interest as humans tend to use high-level semantic concepts when querying and browsing ever increasing multimedia databases, yet generic low-level content metadata available from automated processing deals only with representing perceived content, but not its semantics. Multimodal mid-level representations refer to intermediate representations of multimedia signals that make various kinds of knowledge explicit and that expose various kinds of constraints within the context and knowledge assumed by the analysis system. Semantic multimedia analysis tries to establish the links from the feature descriptors and the syntactic elements to the domain semantics. The goal of this thesis is to devise a mid-level representation framework for detecting semantics from broadcast video, using supervised and data-driven approaches to represent domain knowledge in a manner to facilitate inferencing, i.e., answering the questions asked by higher-level analysis. In our framework, we attempt to address three sub-problems: context-dependent feature extraction, semantic video shot classification, and integration of multimodal cues towards semantic analysis. We propose novel models for the representations of low-level multimedia features. We employ dominant modes in the feature space to characterize color and motion in a nonparametric manner. With the combined use of data-driven mode seeking and supervised learning, we are able to capture contextual information of broadcast video and yield semantic meaningful color and motion features. We present the novel concepts of semantic video shot classes towards an effective approach for reverse engineering of the broadcast video capturing and editing processes. Such concepts link the computational representations of low-level multimedia features with video shot size and the main subject within a shot in the broadcast video stream. The linking, subject to the domain constraints, is achieved by statistical learning. We develop solutions for detecting sports events and classifying commercial spots from broad-cast video streams. This is realized by integrating multiple modalities, in particular the text-based external resources. The alignment across modalities is based on semantic video shot classes. With multimodal mid-level representations, we are able to automatically extract rich semantics from sports programs and commercial spots, with promising accuracies. These findings demonstrate the potential of our framework of constructing mid-level representations to narrow the semantic gap, and it has broad outlook in adapting to new content domains.
Lu, Hung-Tsung, und 盧宏宗. „Semantic Retrieval of Personal Photos Using Multimodal Deep Autoencoder Fusing Visual and Speech Features“. Thesis, 2017. http://ndltd.ncl.edu.tw/handle/58fvxy.
Der volle Inhalt der QuelleMourão, André Belchior. „Towards an Architecture for Efficient Distributed Search of Multimodal Information“. Doctoral thesis, 2018. http://hdl.handle.net/10362/38850.
Der volle Inhalt der QuelleCarvalho, José Ricardo de Abreu. „Pesquisa multimodal de imagens em dispositivos móveis“. Master's thesis, 2021. http://hdl.handle.net/10400.13/3984.
Der volle Inhalt der QuelleDespite the evolution in the field of reverse image search, with algorithms becoming more robust and effective, there still interest for improving search techniques, improving the user experience when searching for the images the user has in mind. The main goal of this work was to develop an application for mobile devices (smartphones) that would allow the user to find images through multimodal inputs. Thus, this dissertation, in addition to propose the search for images in different ways (keywords, drawing/sketching, and camera or device images), proposes that the user can create an image by himself through drawing, editing / changing an existing image, having feedback at the time of each change / interaction. Throughout the search experience, the user can use the images found (which it finds relevant) and improve the search through its edition, going against what it thinks to find. The implementation of this proposal was based on a Google Cloud Vision API responsible for obtaining the results, and the ATsketchkit framework that allowed the creation of drawings, for Apple's iOS system. Tests were carried out with a set of users with different levels of experience in image research and different drawing ability, allowing to assess preference in different input methods, satisfaction with the images retrieved, as well as the usability of the prototype.