Accedi

Bibliografie tematiche / Multi-Modal representations / Tesi

Tesi sul tema "Multi-Modal representations"

Segui questo link per vedere altri tipi di pubblicazioni sul tema: Multi-Modal representations.

Autore: Grafiati

Pubblicato: 7 luglio 2024

Ultima modifica: 7 luglio 2024

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili

Scegli il tipo di fonte:

Vedi i top-20 saggi (tesi di laurea o di dottorato) per l'attività di ricerca sul tema "Multi-Modal representations".

Accanto a ogni fonte nell'elenco di riferimenti c'è un pulsante "Aggiungi alla bibliografia". Premilo e genereremo automaticamente la citazione bibliografica dell'opera scelta nello stile citazionale di cui hai bisogno: APA, MLA, Harvard, Chicago, Vancouver ecc.

Puoi anche scaricare il testo completo della pubblicazione scientifica nel formato .pdf e leggere online l'abstract (il sommario) dell'opera se è presente nei metadati.

Vedi le tesi di molte aree scientifiche e compila una bibliografia corretta.

1

Gu, Jian. "Multi-modal Neural Representations for Semantic Code Search". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-279101.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In recent decades, various software systems have gradually become the basis of our society. Programmers search existing code snippets from time to time in their daily life. It would be beneficial and meaningful to have better solutions for the task of semantic code search, which is to find the most semantically relevant code snippets for a given query. Our approach is to introduce tree representations by multi-modal learning. The core idea is to enrich semantic information for code snippets by preparing data of different modalities, and meanwhile ignore syntactic information. We design one novel tree structure named Simplified Semantic Tree and then extract RootPath representations from that. We utilize RootPath representation to complement the conventional sequential representation, namely the token sequence of the code snippet. Our multi-modal model receives code-query pair as input and computes similarity score as output, following the pseudo-siamese architecture. For each pair, besides the ready-made code sequence and query sequence, we extra one extra tree sequence from Simplified Semantic Tree. There are three encoders in our model, and they respectively encode these three sequences as vectors of the same length. Then we combine the code vector with the tree vector for one joint vector, which is still of the same length, as the multi-modal representation for the code snippet. We introduce triplet loss to ensure vectors of code and query in the same pair be close at the shared vector space. We conduct experiments in one large-scale multi-language corpus, with comparisons of strong baseline models by specified performance metrics. Among baseline models, the simplest Neural Bag-of-Words model is with the most satisfying performance. It indicates that syntactic information is likely to distract complex models from critical semantic information. Results show that our multi-modal representation approach performs better because it surpasses baseline models by far in most cases. The key to our multi-modal model is that it is totally about semantic information, and it learns from data of multiple modalities.
Under de senaste decennierna har olika programvarusystem gradvis blivit basen i vårt samhälle. Programmerare söker i befintliga kodavsnitt från tid till annan i deras dagliga liv. Det skulle vara fördelaktigt och meningsfullt att ha bättre lösningar för uppgiften att semantisk kodsökning, vilket är att hitta de mest semantiskt relevanta kodavsnitten för en given fråga. Vår metod är att introducera trädrepresentationer genom multimodal inlärning. Grundidén är att berika semantisk information för kodavsnitt genom att förbereda data med olika modaliteter och samtidigt ignorera syntaktisk information. Vi designar en ny trädstruktur med namnet Simplified Semantic Tree och extraherar sedan RootPath-representationer från det. Vi använder RootPath-representation för att komplettera den konventionella sekvensrepresentationen, nämligen kodsekvensens symbolsekvens. Vår multimodala modell får kodfrågeställningar som inmatning och beräknar likhetspoäng som utgång efter den pseudo-siamesiska arkitekturen. För varje par, förutom den färdiga kodsekvensen och frågesekvensen, extrager vi en extra trädsekvens från Simplified Semantic Tree. Det finns tre kodare i vår modell, och de kodar respektive tre sekvenser som vektorer av samma längd. Sedan kombinerar vi kodvektorn med trädvektorn för en gemensam vektor, som fortfarande är av samma längd som den multimodala representationen för kodavsnittet. Vi introducerar tripletförlust för att säkerställa att vektorer av kod och fråga i samma par är nära det delade vektorn. Vi genomför experiment i ett storskaligt flerspråkigt korpus, med jämförelser av starka baslinjemodeller med specificerade prestandametriker. Bland baslinjemodellerna är den enklaste Neural Bag-of-Words-modellen med den mest tillfredsställande prestanda. Det indikerar att syntaktisk information sannolikt kommer att distrahera komplexa modeller från kritisk semantisk information. Resultaten visar att vår multimodala representationsmetod fungerar bättre eftersom den överträffar basmodellerna i de flesta fall. Nyckeln till vår multimodala modell är att den helt handlar om semantisk information, och den lär sig av data om flera modaliteter.

2

Liu, Yahui. "Exploring Multi-Domain and Multi-Modal Representations for Unsupervised Image-to-Image Translation". Doctoral thesis, Università degli studi di Trento, 2022. http://hdl.handle.net/11572/342634.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Unsupervised image-to-image translation (UNIT) is a challenging task in the image manipulation field, where input images in a visual domain are mapped into another domain with desired visual patterns (also called styles). An ideal direction in this field is to build a model that can map an input image in a domain to multiple target domains and generate diverse outputs in each target domain, which is termed as multi-domain and multi-modal unsupervised image-to-image translation (MMUIT). Recent studies have shown remarkable results in UNIT but they suffer from four main limitations: (1) State-of-the-art UNIT methods are either built from several two-domain mappings that are required to be learned independently or they generate low-diversity results, a phenomenon also known as model collapse. (2) Most of the manipulation is with the assistance of visual maps or digital labels without exploring natural languages, which could be more scalable and flexible in practice. (3) In an MMUIT system, the style latent space is usually disentangled between every two image domains. While interpolations within domains are smooth, interpolations between two different domains often result in unrealistic images with artifacts when interpolating between two randomly sampled style representations from two different domains. Improving the smoothness of the style latent space can lead to gradual interpolations between any two style latent representations even between any two domains. (4) It is expensive to train MMUIT models from scratch at high resolution. Interpreting the latent space of pre-trained unconditional GANs can achieve pretty good image translations, especially high-quality synthesized images (e.g., 1024x1024 resolution). However, few works explore building an MMUIT system with such pre-trained GANs. In this thesis, we focus on these vital issues and propose several techniques for building better MMUIT systems. First, we base on the content-style disentangled framework and propose to fit the style latent space with Gaussian Mixture Models (GMMs). It allows a well-trained network using a shared disentangled style latent space to model multi-domain translations. Meanwhile, we can randomly sample different style representations from a Gaussian component or use a reference image for style transfer. Second, we show how the GMM-modeled latent style space can be combined with a language model (e.g., a simple LSTM network) to manipulate multiple styles by using textual commands. Then, we not only propose easy-to-use constraints to improve the smoothness of the style latent space in MMUIT models, but also design a novel metric to quantitatively evaluate the smoothness of the style latent space. Finally, we build a new model to use pretrained unconditional GANs to do MMUIT tasks.

3

Song, Pingfan. "Multi-modal image processing via joint sparse representations induced by coupled dictionaries". Thesis, University College London (University of London), 2018. http://discovery.ucl.ac.uk/10061963/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Real-world image processing tasks often involve various image modalities captured by different sensors. However, given that different sensors exhibit different characteristics, such multi-modal images are typically acquired with different resolutions, different blurring kernels, or even noise levels. In view of the fact that images associated with the same scene share some attributes, such as edges, textures or other primitives, it is natural to ask whether one can improve standard image processing tasks by leveraging the availability of multimodal images. This thesis introduces a sparsity-based machine learning framework along with algorithms to address such multimodal image processing problems. In particular, the thesis introduces a new coupled dictionary learning framework that is able to capture complex relationships and disparities between different image types in a learned sparse-representation domain in lieu of the original image domain. The thesis then introduces representative applications of this framework in key multimodal image processing problems. First, the thesis considers multi-modal image super-resolution problems where one wishes to super-resolve a certain low-resolution image modality given the availability of another high-resolution image modality of the same scene. It develops both a coupled dictionary learning algorithm and a coupled super-resolution algorithm to address this task arising in [1,2]. Second, the thesis considers multi-modal image denoising problems where one wishes to denoise a certain noisy image modality given the availability of another less noisy image modality of the same scene. The thesis develops an online coupled dictionary learning algorithm and a coupled sparse denoising algorithm to address this task arising in [3,4]. Finally, the thesis considers emerging medical imaging applications where one wishes to perform multi-contrast MRI reconstruction, including guided reconstruction and joint reconstruction. We propose an iterative framework to implement coupled dictionary learning, coupled sparse denoising and k-space consistency to address this task arising in [5,6]. The proposed framework is capable of capturing complex dependencies, including both similarities and disparities among multi-modal data. This enables transferring appropriate guidance information to the target image without introducing noticeable texture-copying artifacts. Practical experiments on multi-modal images also demonstrate that the proposed framework contributes to significant performance improvement in various image processing tasks, such as multi-modal image super-resolution, denoising and multi-contrast MRI reconstruction.

4

Suthana, Nanthia Ananda. "Investigating human medical temporal representations of episodic information a multi-modal approach /". Diss., Restricted to subscribing institutions, 2009. http://proquest.umi.com/pqdweb?did=1905692921&sid=1&Fmt=2&clientId=1564&RQT=309&VName=PQD.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

5

Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations". Thesis, Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

La présente thèse étudie la modélisation conjointe des contenus visuels et textuels extraits à partir des documents multimédias pour résoudre les problèmes intermodaux. Ces tâches exigent la capacité de ``traduire'' l'information d'une modalité vers une autre. Un espace de représentation commun, par exemple obtenu par l'Analyse Canonique des Corrélation ou son extension kernelisée est une solution généralement adoptée. Sur cet espace, images et texte peuvent être représentés par des vecteurs de même type sur lesquels la comparaison intermodale peut se faire directement.Néanmoins, un tel espace commun souffre de plusieurs déficiences qui peuvent diminuer la performance des ces tâches. Le premier défaut concerne des informations qui sont mal représentées sur cet espace pourtant très importantes dans le contexte de la recherche intermodale. Le deuxième défaut porte sur la séparation entre les modalités sur l'espace commun, ce qui conduit à une limite de qualité de traduction entre modalités. Pour faire face au premier défaut concernant les données mal représentées, nous avons proposé un modèle qui identifie tout d'abord ces informations et puis les combine avec des données relativement bien représentées sur l'espace commun. Les évaluations sur la tâche d'illustration de texte montrent que la prise en compte de ces information fortement améliore les résultats de la recherche intermodale. La contribution majeure de la thèse se concentre sur la séparation entre les modalités sur l'espace commun pour améliorer la performance des tâches intermodales. Nous proposons deux méthodes de représentation pour les documents bi-modaux ou uni-modaux qui regroupent à la fois des informations visuelles et textuelles projetées sur l'espace commun. Pour les documents uni-modaux, nous suggérons un processus de complétion basé sur un ensemble de données auxiliaires pour trouver les informations correspondantes dans la modalité absente. Ces informations complémentaires sont ensuite utilisées pour construire une représentation bi-modale finale pour un document uni-modal. Nos approches permettent d'obtenir des résultats de l'état de l'art pour la recherche intermodale ou la classification bi-modale et intermodale
This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification

6

Tran, Thi Quynh Nhi. "Robust and comprehensive joint image-text representations". Electronic Thesis or Diss., Paris, CNAM, 2017. http://www.theses.fr/2017CNAM1096.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

La présente thèse étudie la modélisation conjointe des contenus visuels et textuels extraits à partir des documents multimédias pour résoudre les problèmes intermodaux. Ces tâches exigent la capacité de ``traduire'' l'information d'une modalité vers une autre. Un espace de représentation commun, par exemple obtenu par l'Analyse Canonique des Corrélation ou son extension kernelisée est une solution généralement adoptée. Sur cet espace, images et texte peuvent être représentés par des vecteurs de même type sur lesquels la comparaison intermodale peut se faire directement.Néanmoins, un tel espace commun souffre de plusieurs déficiences qui peuvent diminuer la performance des ces tâches. Le premier défaut concerne des informations qui sont mal représentées sur cet espace pourtant très importantes dans le contexte de la recherche intermodale. Le deuxième défaut porte sur la séparation entre les modalités sur l'espace commun, ce qui conduit à une limite de qualité de traduction entre modalités. Pour faire face au premier défaut concernant les données mal représentées, nous avons proposé un modèle qui identifie tout d'abord ces informations et puis les combine avec des données relativement bien représentées sur l'espace commun. Les évaluations sur la tâche d'illustration de texte montrent que la prise en compte de ces information fortement améliore les résultats de la recherche intermodale. La contribution majeure de la thèse se concentre sur la séparation entre les modalités sur l'espace commun pour améliorer la performance des tâches intermodales. Nous proposons deux méthodes de représentation pour les documents bi-modaux ou uni-modaux qui regroupent à la fois des informations visuelles et textuelles projetées sur l'espace commun. Pour les documents uni-modaux, nous suggérons un processus de complétion basé sur un ensemble de données auxiliaires pour trouver les informations correspondantes dans la modalité absente. Ces informations complémentaires sont ensuite utilisées pour construire une représentation bi-modale finale pour un document uni-modal. Nos approches permettent d'obtenir des résultats de l'état de l'art pour la recherche intermodale ou la classification bi-modale et intermodale
This thesis investigates the joint modeling of visual and textual content of multimedia documents to address cross-modal problems. Such tasks require the ability to match information across modalities. A common representation space, obtained by eg Kernel Canonical Correlation Analysis, on which images and text can be both represented and directly compared is a generally adopted solution.Nevertheless, such a joint space still suffers from several deficiencies that may hinder the performance of cross-modal tasks. An important contribution of this thesis is therefore to identify two major limitations of such a space. The first limitation concerns information that is poorly represented on the common space yet very significant for a retrieval task. The second limitation consists in a separation between modalities on the common space, which leads to coarse cross-modal matching. To deal with the first limitation concerning poorly-represented data, we put forward a model which first identifies such information and then finds ways to combine it with data that is relatively well-represented on the joint space. Evaluations on emph{text illustration} tasks show that by appropriately identifying and taking such information into account, the results of cross-modal retrieval can be strongly improved. The major work in this thesis aims to cope with the separation between modalities on the joint space to enhance the performance of cross-modal tasks.We propose two representation methods for bi-modal or uni-modal documents that aggregate information from both the visual and textual modalities projected on the joint space. Specifically, for uni-modal documents we suggest a completion process relying on an auxiliary dataset to find the corresponding information in the absent modality and then use such information to build a final bi-modal representation for a uni-modal document. Evaluations show that our approaches achieve state-of-the-art results on several standard and challenging datasets for cross-modal retrieval or bi-modal and cross-modal classification

7

Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning". Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature

8

Li, Lin. "Multi-scale spectral embedding representation registration (MSERg) for multi-modal imaging registration". Case Western Reserve University School of Graduate Studies / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=case1467902012.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

9

Gay, Joanna. "Structural representation models for multi-modal image registration in biomedical applications". Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-410820.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In clinical applications it is often beneficial to use multiple imaging technologies to obtain information about different biomedical aspects of the subject under investigation, and to make best use of such sets of images they need to first be registered or aligned. Registration of multi-modal images is a challenging task and is currently the topic of much research, with new methods being published frequently. Structural representation models extract underlying features such as edges from images, distilling them into a common format that can be easily compared across different image modalities. This study compares the performance of two recent structural representation models on the task of aligning multi-modal biomedical images, specifically Second Harmonic Generation and Two Photon Excitation Fluorescence Microscopy images collected from skin samples. Performance is also evaluated on Brightfield Microscopy images. The two models evaluated here are PCANet-based Structural Representations (PSR, Zhu et al. (2018)) and Discriminative Local Derivative Patterns (dLDP, Jiang et al. (2017)). Mutual Information is used to provide a baseline for comparison. Although dLDP in particular gave promising results, worthy of further investigation, neither method outperformed the classic Mutual Information approach, as demonstrated in a series of experiments to register these particularly diverse modalities.

10

Aissa, Wafa. "Réseaux de modules neuronaux pour un raisonnement visuel compositionnel". Electronic Thesis or Diss., Paris, HESAM, 2023. http://www.theses.fr/2023HESAC033.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Cette thèse de doctorat porte sur le raisonnement visuel compositionnel. Lorsqu'on présente une paire image-question à un modèle de réseau de neurones, notre objectif est que le modèle réponde à la question en suivant une chaîne de raisonnement définie par un programme. Nous évaluons la capacité de raisonnement du modèle dans le cadre de la Question Réponse Visuelle (QRV). La QRV compositionnelle décompose les questions complexes en sous-problèmes modulaires plus simples. Ces sous-problèmes incluent des compétences de raisonnement telles que la détection d'objets et d'attributs, la détection de relations, les opérations logiques, le dénombrement et les comparaisons. Chaque sous-problème est attribué à un module différent. Cette approche décourage les raccourcis, exigeant une compréhension explicite du problème. Elle favorise également la transparence et l'explicabilité.Les réseaux de modules neuronaux (NMN) sont utilisés pour permettre un raisonnement compositionnel. Il sont basés sur un cadre de générateur-exécuteur, le générateur apprend la traduction de la question vers son programme de fonctions. L'exécuteur instancie un NMN où chaque fonction est attribuée à un module spécifique. Nous développons également un catalogue de modules neuronaux et définissons leurs fonctions et leurs structures. Les entraînements et les évaluations sont effectués sur l'ensemble de données GQA [3], qui comprend des questions, des programmes fonctionnels, des images et des réponses.L'une des principales contributions implique l'intégration de représentations pré-entraînées multi-modales dans la QRV modulaire. Cette intégration sert à initialiser le processus de raisonnement. Les expériences démontrent que les représentations multimodales surpassent les unimodales. Ceci permet de capturer des relations complexes intra-modales tout en facilitant l'alignement entre les différentes modalités, améliorant ainsi la précision globale de notre NMN.De plus, nous explorons différentes techniques d'entraînement pour améliorer le processus d'apprentissage et l'efficacité du coût de calcul. En plus d'optimiser les modules au sein de la chaîne de raisonnement pour produire collectivement des réponses précises, nous introduisons une approche d'apprentissage guidé pour optimiser les modules intermédiaires de la chaîne de raisonnement. Cela garantit que ces modules effectuent leurs sous-tâches de raisonnement spécifiques sans prendre de raccourcis ou compromettre l'intégrité du processus de raisonnement. L'une des techniques proposées s'inspire de la méthode d'apprentissage guidé couramment utilisée dans les modèles séquentiels. Des analyses comparatives démontrent les avantages de notre approche pour les NMN, comme détaillé dans notre article [1].Nous introduisons également une nouvelle stratégie d'apprentissage par Curriculum (CL) adaptée aux NMN pour réorganiser les exemples d'entraînement et définir une stratégie d'apprentissage progressif. Nous commençons par apprendre des programmes plus simples et augmentons progressivement la complexité des programmes d'entraînement. Nous utilisons plusieurs critères de difficulté pour définir l'approche du CL. Nos résultats montrent qu'en sélectionnant la méthode de CL appropriée, nous pouvons réduire considérablement le coût de l'entraînement et la quantité de données d'entraînement requise, avec un impact limité sur la précision finale de la QRV. Cette contribution importante constitue le cœur de notre article [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Proceedings of ACIVS 2023. Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019
The context of this PhD thesis is compositional visual reasoning. When presented with an image and a question pair, our objective is to have neural networks models answer the question by following a reasoning chain defined by a program. We assess the model's reasoning ability through a Visual Question Answering (VQA) setup.Compositional VQA breaks down complex questions into modular easier sub-problems.These sub-problems include reasoning skills such as object and attribute detection, relation detection, logical operations, counting, and comparisons. Each sub-problem is assigned to a different module. This approach discourages shortcuts, demanding an explicit understanding of the problem. It also promotes transparency and explainability.Neural module networks (NMN) are used to enable compositional reasoning. The framework is based on a generator-executor framework, the generator learns the translation of the question to its function program. The executor instantiates a neural module network where each function is assigned to a specific module. We also design a neural modules catalog and define the function and the structure of each module. The training and evaluations are conducted using the pre-processed GQA dataset cite{gqa}, which includes natural language questions, functional programs representing the reasoning chain, images, and corresponding answers.The research contributions revolve around the establishment of an NMN framework for the VQA task.One primary contribution involves the integration of vision and language pre-trained (VLP) representations into modular VQA. This integration serves as a ``warm-start" mechanism for initializing the reasoning process.The experiments demonstrate that cross-modal vision and language representations outperform uni-modal ones. This utilization enables the capture of intricate relationships within each individual modality while also facilitating alignment between different modalities, consequently enhancing overall accuracy of our NMN.Moreover, we explore various training techniques to enhance the learning process and improve cost-efficiency. In addition to optimizing the modules within the reasoning chain to collaboratively produce accurate answers, we introduce a teacher-guidance approach to optimize the intermediate modules in the reasoning chain. This ensures that these modules perform their specific reasoning sub-tasks without taking shortcuts or compromising the reasoning process's integrity. We propose and implement several teacher-guidance techniques, one of which draws inspiration from the teacher-forcing method commonly used in sequential models. Comparative analyses demonstrate the advantages of our teacher-guidance approach for NMNs, as detailed in our paper [1].We also introduce a novel Curriculum Learning (CL) strategy tailored for NMNs to reorganize the training examples and define a start-small training strategy. We begin by learning simpler programs and progressively increase the complexity of the training programs. We use several difficulty criteria to define the CL approach. Our findings demonstrate that by selecting the appropriate CL method, we can significantly reduce the training cost and required training data, with only a limited impact on the final VQA accuracy. This significant contribution forms the core of our paper [2].[1] W. Aissa, M. Ferecatu, and M. Crucianu. Curriculum learning for compositional visual reasoning. In Proceedings of VISIGRAPP 2023, Volume 5: VISAPP, 2023.[2] W. Aissa, M. Ferecatu, and M. Crucianu. Multimodal representations for teacher-guidedcompositional visual reasoning. In Advanced Concepts for Intelligent Vision Systems, 21st International Conference (ACIVS 2023). Springer International Publishing, 2023.[3] D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. 2019

11

Xu, Dan. "Exploring Multi-Modal and Structured Representation Learning for Visual Image and Video Understanding". Doctoral thesis, Università degli studi di Trento, 2018. https://hdl.handle.net/11572/367610.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

As the explosive growth of the visual data, it is particularly important to develop intelligent visual understanding techniques for dealing with a large amount of data. Many efforts have been made in recent years to build highly effective and large-scale visual processing algorithms and systems. One of the core aspects in the research line is how to learn robust representations to better describe the data. In this thesis we study the problem of visual image and video understanding and specifically, we address the problem via designing and implementing novel multi-modal and structured representation learning approaches, both of which are fundamental research hot-spots in machine learning. Multi-modal representation learning involves relating information from multiple input sources, and the structured representation learning works on exploring rich structural information hidden in the data for robust feature learning. We investigate both the shallow representation learning frameworks such as dictionary learning and the deep representation learning frameworks such as deep neural networks, and present different modules devised in our works, consisting of cross-paced representation learning, cross-modal feature learning and transferring, multi-scale structured prediction and fusion, multi-modal prediction and distillation. These techniques are further applied in various visual understanding topics, i.e. sketch-based-image retrieval (SBIR), video pedestrian detection, monocular depth estimation and scene parsing, showing superior performance.

12

Xu, Dan. "Exploring Multi-Modal and Structured Representation Learning for Visual Image and Video Understanding". Doctoral thesis, University of Trento, 2018. http://eprints-phd.biblio.unitn.it/2918/1/disclaimer.pdf.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

As the explosive growth of the visual data, it is particularly important to develop intelligent visual understanding techniques for dealing with a large amount of data. Many efforts have been made in recent years to build highly effective and large-scale visual processing algorithms and systems. One of the core aspects in the research line is how to learn robust representations to better describe the data. In this thesis we study the problem of visual image and video understanding and specifically, we address the problem via designing and implementing novel multi-modal and structured representation learning approaches, both of which are fundamental research hot-spots in machine learning. Multi-modal representation learning involves relating information from multiple input sources, and the structured representation learning works on exploring rich structural information hidden in the data for robust feature learning. We investigate both the shallow representation learning frameworks such as dictionary learning and the deep representation learning frameworks such as deep neural networks, and present different modules devised in our works, consisting of cross-paced representation learning, cross-modal feature learning and transferring, multi-scale structured prediction and fusion, multi-modal prediction and distillation. These techniques are further applied in various visual understanding topics, i.e. sketch-based-image retrieval (SBIR), video pedestrian detection, monocular depth estimation and scene parsing, showing superior performance.

13

Steiling, David. "Icon, representation and virtuality in reading the graphic narrative". [Tampa, Fla] : University of South Florida, 2006. http://purl.fcla.edu/usf/dc/et/SFE0001818.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

14

Siless, Viviana. "Multi-modal registration of T1 brain image and geometric descriptors of white matter tracts". Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112147/document.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Le recalage des images du cerveau vise à réduire la variabilité anatomique entre les differentes sujets, et à créer un espace commun pour l'analyse de groupe. Les approches multi-modales essaient de minimiser les variations de forme du cortex et des structures internes telles que des faisceaux de fibres nerveuses. Ces approches nécessitent une identification préalable de ces structures, ce qui s'avère une tâche difficile en l'absence d'un atlas complet de référence. Nous proposons une extension de l'algorithme de recalage difféomorphe des Démons pour recaler conjointement des images et des faisceaux de fibres. Dans cette thèse, nous analysons différentes représentations des faisceaux de fibres comme une séquence de points, un nuage de points, les courants et les mesures. Différentes distances sont analysées et étudiées dans l'algorithme de recalage. Pour simplifier la représentation de la matière blanche nous utilisons et étendons les algorithmes de classification existants. En étendant le recalage d'images afin d'ajouter des descripteurs de la géométrie des fibres nerveuses, nous espérons améliorer les futures analyses concernant les matières grise et blanche. Nous avons démontré l'efficacité de notre algorithme en recalant conjointement des images anatomiques pondérées en T1 et des faisceaux de fibres. Nous avons comparé nos résultats à des approches concurrentes, l'une multimodale s'appuyant sur l'anisotropie fractionnaire et la pondération T1, l'autre sur les tenseurs de diffusion, et obtenu de meilleures performances à l'aide de notre algorithme. Enfin, nous mettons en évidence sur des études de groupe en IRMf que notre méthodologie et notre implémentation apportent un gain en sensibilité de détection des activations cérébrales
Brain image registration aims at reducing anatomical variability across subjects to create a common space for group analysis. Multi-modal approaches intend to minimize cortex shape variations along with internal structures, such as fiber bundles. These approaches require prior identification of the structures, which remains a challenging task in the absence of a complete reference atlas. We propose an extension of the Diffeomorphic Demons image registration to jointly register images and fiber bundles. In this thesis we analyze differents representations of the fiber bundles such as ordered points, clouds of points, Currents and Measures. Different distances are analyzed and implemented into the registration algorithm. To simplify white matter representation we also analyze, use and extend existing clustering algorithms. By extending the image registration to include geometric fiber bundles descriptors we hope to improve future analyses regarding both, grey and white matter. We demonstrate the efficacy of our algorithm by registering simultaneously T1 images and fiber bundles and compare results with a multi-modal T1+Fractional Anisotropy (FA) and a tensor-based registration algorithms and obtain superior performance with our approach. We provide preliminary evidence that our implementation improves the sensitivity of activation detection in fMRI group studies

15

Soto-Iglesias, David. "Development and evaluation of mapping strategies for the integration and joint analysis of multi-modal data of the heart". Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/395191.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The development of novel technologies is allowing a complete description of the heart structure and function, including geometrical, myocardial tissue viability and electrical activation information. The joint analysis of this information helps to improve clinical interventions such as radio-frequency ablation of cardiac arrhythmias. However, the acquired data and related indices need to be integrated onto a common reference space for its analysis. This integration is not straightforward due to the different characteristics of acquisition systems. The aim of this thesis was to develop and evaluate different mapping strategies for the integration and joint analysis of multi-modal data of the heart. A new integration methodology was developed and compared with state-of-the-art techniques in several applications with synthetic and clinical data, within the framework of three different clinical scenarios: a) integration of electrical with tissue viability; b) analysis of electrical activation data; and c) validation of myocardial tissue characterization with histological data.
El desarrollo de nuevas tecnologías permite una completa descripción de la estructura y funcionalidad cardíaca incluyendo la geometría la viabilidad del tejido y la información de activación eléctrica. Un análisis conjunto de esta información permite mejorar intervenciones clínicas como la ablación por radio frecuencia en arritmias. Sin embargo, los datos adquiridos deben ser integrados en un mismo sistema de referencia para su análisis. Esta integración no es trivial debido a las diferentes características de adquisición de datos. El objetivo de esta tesis es desarrollar y evaluar diferentes estrategias para la integración y el análisis de datos multimodales del corazón. La nueva metodología de integración ha sido comparada y evaluada con otras técnicas en datos sintéticos y clínicos. Se han evaluado en tres escenarios clínicos distintos: a) integración de datos eléctricos con viabilidad de tejido; b) análisis de activación eléctrica; y c) validación de la caracterización del tejido con datos histológicos.

16

Huang, Di. "Robust face recognition based on three dimensional data". Phd thesis, Ecole Centrale de Lyon, 2011. http://tel.archives-ouvertes.fr/tel-00693158.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The face is one of the best biometrics for person identification and verification related applications, because it is natural, non-intrusive, and socially weIl accepted. Unfortunately, an human faces are similar to each other and hence offer low distinctiveness as compared with other biometrics, e.g., fingerprints and irises. Furthermore, when employing facial texture images, intra-class variations due to factors as diverse as illumination and pose changes are usually greater than inter-class ones, making 2D face recognition far from reliable in the real condition. Recently, 3D face data have been extensively investigated by the research community to deal with the unsolved issues in 2D face recognition, Le., illumination and pose changes. This Ph.D thesis is dedicated to robust face recognition based on three dimensional data, including only 3D shape based face recognition, textured 3D face recognition as well as asymmetric 3D-2D face recognition. In only 3D shape-based face recognition, since 3D face data, such as facial pointclouds and facial scans, are theoretically insensitive to lighting variations and generally allow easy pose correction using an ICP-based registration step, the key problem mainly lies in how to represent 3D facial surfaces accurately and achieve matching that is robust to facial expression changes. In this thesis, we design an effective and efficient approach in only 3D shape based face recognition. For facial description, we propose a novel geometric representation based on extended Local Binary Pattern (eLBP) depth maps, and it can comprehensively describe local geometry changes of 3D facial surfaces; while a 81FT -based local matching process further improved by facial component and configuration constraints is proposed to associate keypoints between corresponding facial representations of different facial scans belonging to the same subject. Evaluated on the FRGC v2.0 and Gavab databases, the proposed approach proves its effectiveness. Furthermore, due tq the use of local matching, it does not require registration for nearly frontal facial scans and only needs a coarse alignment for the ones with severe pose variations, in contrast to most of the related tasks that are based on a time-consuming fine registration step. Considering that most of the current 3D imaging systems deliver 3D face models along with their aligned texture counterpart, a major trend in the literature is to adopt both the 3D shape and 2D texture based modalities, arguing that the joint use of both clues can generally provides more accurate and robust performance than utilizing only either of the single modality. Two important factors in this issue are facial representation on both types of data as well as result fusion. In this thesis, we propose a biological vision-based facial representation, named Oriented Gradient Maps (OGMs), which can be applied to both facial range and texture images. The OGMs simulate the response of complex neurons to gradient information within a given neighborhood and have properties of being highly distinctive and robust to affine illumination and geometric transformations. The previously proposed matching process is then adopted to calculate similarity measurements between probe and gallery faces. Because the biological vision-based facial representation produces an OGM for each quantized orientation of facial range and texture images, we finally use a score level fusion strategy that optimizes weights by a genetic algorithm in a learning pro cess. The experimental results achieved on the FRGC v2.0 and 3DTEC datasets display the effectiveness of the proposed biological vision-based facial description and the optimized weighted sum fusion. [...]

17

Po, Ming Jack. "Multi-scale Representations for Classification of Protein Crystal Images and Multi-Modal Registration of the Lung". Thesis, 2015. https://doi.org/10.7916/D87M06MZ.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In recent years, multi-resolution techniques have become increasingly popular in the image processing community. New techniques have been developed with applications ranging from edge detection, texture recognition, image registration, multi-resolution features for image classification and more. The central focus of this two-part thesis is the multi-resolution analysis of images. In the first part, we used multi-resolution approaches to help with the classification of a set of protein crystal images. In the second, similar approaches were used to help register a set of 3D image volumes that would otherwise be computationally prohibitive without leveraging multi-resolution techniques. Specifically, the first part of this work proposes a classification framework that is being developed in collaboration with NorthEast Structural Genomics Consoritum (NESG) to assist in the automated screening of protein crystal images. Several groups have previously proposed automated algorithms to expedite such analysis. However, none of the classifiers described in the literature are sufficiently accurate or fast enough to be practical in a structural genomics production pipeline. The second part of this work proposes a 3D image registration algorithm to register regions of emphysema as quantified by densitometry on lung CT with MR lung volumes. The ability to register quantitatively-determined regions of emphysema with perfusion MRI will allow for further exploration of the pathophysiology of Chronic Obstructive Pulmonary Disorder (COPD). The registration method involves the registration of CT volumes at different levels of inspiration (total lung capacity to functional residual capacity [FRC]) followed by another registration between FRC-CT and FRC-MR volume pairs.

18

Weiss, Martin. "Deep reinforcement learning for multi-modal embodied navigation". Thesis, 2020. http://hdl.handle.net/1866/25106.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Ce travail se concentre sur une tâche de micro-navigation en plein air où le but est de naviguer vers une adresse de rue spécifiée en utilisant plusieurs modalités (par exemple, images, texte de scène et GPS). La tâche de micro-navigation extérieure s’avère etre un défi important pour de nombreuses personnes malvoyantes, ce que nous démontrons à travers des entretiens et des études de marché, et nous limitons notre définition des problèmes à leurs besoins. Nous expérimentons d’abord avec un monde en grille partiellement observable (Grid-Street et Grid City) contenant des maisons, des numéros de rue et des régions navigables. Ensuite, nous introduisons le Environnement de Trottoir pour la Navigation Visuelle (ETNV), qui contient des images panoramiques avec des boîtes englobantes pour les numéros de maison, les portes et les panneaux de nom de rue, et des formulations pour plusieurs tâches de navigation. Dans SEVN, nous formons un modèle de politique pour fusionner des observations multimodales sous la forme d’images à résolution variable, de texte visible et de données GPS simulées afin de naviguer vers une porte d’objectif. Nous entraînons ce modèle en utilisant l’algorithme d’apprentissage par renforcement, Proximal Policy Optimization (PPO). Nous espérons que cette thèse fournira une base pour d’autres recherches sur la création d’agents pouvant aider les membres de la communauté des gens malvoyantes à naviguer le monde.
This work focuses on an Outdoor Micro-Navigation (OMN) task in which the goal is to navigate to a specified street address using multiple modalities including images, scene-text, and GPS. This task is a significant challenge to many Blind and Visually Impaired (BVI) people, which we demonstrate through interviews and market research. To investigate the feasibility of solving this task with Deep Reinforcement Learning (DRL), we first introduce two partially observable grid-worlds, Grid-Street and Grid City, containing houses, street numbers, and navigable regions. In these environments, we train an agent to find specific houses using local observations under a variety of training procedures. We parameterize our agent with a neural network and train using reinforcement learning methods. Next, we introduce the Sidewalk Environment for Visual Navigation (SEVN), which contains panoramic images with labels for house numbers, doors, and street name signs, and formulations for several navigation tasks. In SEVN, we train another neural network model using Proximal Policy Optimization (PPO) to fuse multi-modal observations in the form of variable resolution images, visible text, and simulated GPS data, and to use this representation to navigate to goal doors. Our best model used all available modalities and was able to navigate to over 100 goals with an 85% success rate. We found that models with access to only a subset of these modalities performed significantly worse, supporting the need for a multi-modal approach to the OMN task. We hope that this thesis provides a foundation for further research into the creation of agents to assist members of the BVI community to safely navigate.

19

Sylvain, Tristan. "Locality and compositionality in representation learning for complex visual tasks". Thesis, 2021. http://hdl.handle.net/1866/25594.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

L'utilisation d'architectures neuronales profondes associée à des innovations spécifiques telles que les méthodes adversarielles, l’entraînement préalable sur de grands ensembles de données et l'estimation de l'information mutuelle a permis, ces dernières années, de progresser rapidement dans de nombreuses tâches de vision par ordinateur complexes telles que la classification d'images de catégories préalablement inconnues (apprentissage zéro-coups), la génération de scènes ou la classification multimodale. Malgré ces progrès, il n’est pas certain que les méthodes actuelles d’apprentissage de représentations suffiront à atteindre une performance équivalente au niveau humain sur des tâches visuelles arbitraires et, de fait, cela pose des questions quant à la direction de la recherche future. Dans cette thèse, nous nous concentrerons sur deux aspects des représentations qui semblent nécessaires pour atteindre de bonnes performances en aval pour l'apprentissage des représentations : la localité et la compositionalité. La localité peut être comprise comme la capacité d'une représentation à retenir des informations locales. Ceci sera pertinent dans de nombreux cas, et bénéficiera particulièrement à la vision informatique, domaine dans lequel les images naturelles comportent intrinsèquement des informations locales, par exemple des parties pertinentes d’une image, des objets multiples présents dans une scène... D'autre part, une représentation compositionnelle peut être comprise comme une représentation qui résulte d'une combinaison de parties plus simples. Les réseaux neuronaux convolutionnels sont intrinsèquement compositionnels, et de nombreuses images complexes peuvent être considérées comme la composition de sous-composantes pertinentes : les objets et attributs individuels dans une scène, les attributs sémantiques dans l'apprentissage zéro-coups en sont deux exemples. Nous pensons que ces deux propriétés détiennent la clé pour concevoir de meilleures méthodes d'apprentissage de représentations. Dans cette thèse, nous présentons trois articles traitant de la localité et/ou de la compositionnalité, et de leur application à l'apprentissage de représentations pour des tâches visuelles complexes. Dans le premier article, nous introduisons des méthodes de mesure de la localité et de la compositionnalité pour les représentations d'images, et nous démontrons que les représentations locales et compositionnelles sont plus performantes dans l'apprentissage zéro-coups. Nous utilisons également ces deux notions comme base pour concevoir un nouvel algorithme d'apprentissage des représentations qui atteint des performances de pointe dans notre cadre expérimental, une variante de l'apprentissage "zéro-coups" plus difficile où les informations externes, par exemple un pré-entraînement sur d'autres ensembles de données d'images, ne sont pas autorisées. Dans le deuxième article, nous montrons qu'en encourageant un générateur à conserver des informations locales au niveau de l'objet, à l'aide d'un module dit de similarité de graphes de scène, nous pouvons améliorer les performances de génération de scènes. Ce modèle met également en évidence l'importance de la composition, car de nombreux composants fonctionnent individuellement sur chaque objet présent. Pour démontrer pleinement la portée de notre approche, nous effectuons une analyse détaillée et proposons un nouveau cadre pour évaluer les modèles de génération de scènes. Enfin, dans le troisième article, nous montrons qu'en encourageant une forte information mutuelle entre les représentations multimodales locales et globales des images médicales en 2D et 3D, nous pouvons améliorer la classification et la segmentation des images. Ce cadre général peut être appliqué à une grande variété de contextes et démontre les avantages non seulement de la localité, mais aussi de la compositionnalité, car les représentations multimodales sont combinées pour obtenir une représentation plus générale.
The use of deep neural architectures coupled with specific innovations such as adversarial methods, pre-training on large datasets and mutual information estimation has in recent years allowed rapid progress in many complex vision tasks such as zero-shot learning, scene generation, or multi-modal classification. Despite such progress, it is still not clear if current representation learning methods will be enough to attain human-level performance on arbitrary visual tasks, and if not, what direction should future research take. In this thesis, we will focus on two aspects of representations that seem necessary to achieve good downstream performance for representation learning: locality and compositionality. Locality can be understood as a representation's ability to retain local information. This will be relevant in many cases, and will specifically benefit computer vision where natural images inherently feature local information, i.e. relevant patches of an image, multiple objects present in a scene... On the other hand, a compositional representation can be understood as one that arises from a combination of simpler parts. Convolutional neural networks are inherently compositional, and many complex images can be seen as composition of relevant sub-components: individual objects and attributes in a scene, semantic attributes in zero-shot learning are two examples. We believe both properties hold the key to designing better representation learning methods. In this thesis, we present 3 articles dealing with locality and/or compositionality, and their application to representation learning for complex visual tasks. In the first article, we introduce ways of measuring locality and compositionality for image representations, and demonstrate that local and compositional representations perform better at zero-shot learning. We also use these two notions as the basis for designing class-matching deep info-max, a novel representation learning algorithm that achieves state-of-the-art performance on our proposed "Zero-shot from scratch" setting, a harder zero-shot setting where external information, e.g. pre-training on other image datasets is not allowed. In the second article, we show that by encouraging a generator to retain local object-level information, using a scene-graph similarity module, we can improve scene generation performance. This model also showcases the importance of compositionality as many components operate individually on each object present. To fully demonstrate the reach of our approach, we perform detailed analysis, and propose a new framework to evaluate scene generation models. Finally, in the third article, we show that encouraging high mutual information between local and global multi-modal representations of 2D and 3D medical images can lead to improvements in image classification and segmentation. This general framework can be applied to a wide variety of settings, and demonstrates the benefits of not only locality, but also of compositionality as multi-modal representations are combined to obtain a more general one.

20

"Representing and Reasoning about Dynamic Multi-Agent Domains: An Action Language Approach". Doctoral diss., 2018. http://hdl.handle.net/2286/R.I.49093.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

abstract: Reasoning about actions forms the basis of many tasks such as prediction, planning, and diagnosis in a dynamic domain. Within the reasoning about actions community, a broad class of languages, called action languages, has been developed together with a methodology for their use in representing and reasoning about dynamic domains. With a few notable exceptions, the focus of these efforts has largely centered around single-agent systems. Agents rarely operate in a vacuum however, and almost in parallel, substantial work has been done within the dynamic epistemic logic community towards understanding how the actions of an agent may effect not just his own knowledge and/or beliefs, but those of his fellow agents as well. What is less understood by both communities is how to represent and reason about both the direct and indirect effects of both ontic and epistemic actions within a multi-agent setting. This dissertation presents ongoing research towards a framework for representing and reasoning about dynamic multi-agent domains involving both classes of actions. The contributions of this work are as follows: the formulation of a precise mathematical model of a dynamic multi-agent domain based on the notion of a transition diagram; the development of the multi-agent action languages mA+ and mAL based upon this model, as well as preliminary investigations of their properties and implementations via logic programming under the answer set semantics; precise formulations of the temporal projection, and planning problems within a multi-agent context; and an investigation of the application of the proposed approach to the representation of, and reasoning about, scenarios involving the modalities of knowledge and belief.
Dissertation/Thesis
Doctoral Dissertation Computer Science 2018