Dissertations / Theses: 'Visual representation learning'

1

Wang, Zhaoqing. "Self-supervised Visual Representation Learning." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/29595.

Full text

Abstract:

In general, large-scale annotated data are essential to training deep neural networks in order to achieve better performance in visual feature learning for various computer vision applications. Unfortunately, the amount of annotations is challenging to obtain, requiring a high cost of money and human resources. The dependence on large-scale annotated data has become a crucial bottleneck in developing an advanced intelligence perception system. Self-supervised visual representation learning, a subset of unsupervised learning, has gained popularity because of its ability to avoid the high cost of annotated data. A series of methods designed various pretext tasks to explore the general representations from unlabeled data and use these general representations for different downstream tasks. Although previous methods achieved great success, the label noise problem exists in these pretext tasks due to the lack of human-annotation supervision, which causes harmful effects on the transfer performance. This thesis discusses two types of the noise problem in self-supervised learning and designs the corresponding methods to alleviate the negative effects and explore the transferable representations. Firstly, in pixel-level self-supervised learning, the pixel-level correspondences are easily noisy because of complicated context relationships (e.g., misleading pixels in the background). Secondly, two views of the same image share the foreground object and some background information. As optimizing the pretext task (e.g., contrastive learning), the model is easily to capture the foreground object and noisy background information, simultaneously. Such background information can be harmful to the transfer performance on downstream tasks, including image classification, object detection, and instance segmentation. To address the above mentioned issues, our core idea is to leverage the data regularities and prior knowledge. Experimental results demonstrate that the proposed methods effectively alleviate the negative effects of label noise in self-supervised learning and surpass a series of previous methods.

APA, Harvard, Vancouver, ISO, and other styles

2

Zhou, Bolei. "Interpretable representation learning for visual intelligence." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/117837.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 131-140).
Recent progress of deep neural networks in computer vision and machine learning has enabled transformative applications across robotics, healthcare, and security. However, despite the superior performance of the deep neural networks, it remains challenging to understand their inner workings and explain their output predictions. This thesis investigates several novel approaches for opening up the "black box" of neural networks used in visual recognition tasks and understanding their inner working mechanism. I first show that objects and other meaningful concepts emerge as a consequence of recognizing scenes. A network dissection approach is further introduced to automatically identify the internal units as the emergent concept detectors and quantify their interpretability. Then I describe an approach that can efficiently explain the output prediction for any given image. It sheds light on the decision-making process of the networks and why the predictions succeed or fail. Finally, I show some ongoing efforts toward learning efficient and interpretable deep representations for video event understanding and some future directions.
by Bolei Zhou.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

3

Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Full text

Abstract:

La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature

APA, Harvard, Vancouver, ISO, and other styles

4

Sharif, Razavian Ali. "Convolutional Network Representation for Visual Recognition." Doctoral thesis, KTH, Robotik, perception och lärande, RPL, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-197919.

Full text

Abstract:

Image representation is a key component in visual recognition systems. In visual recognition problem, the solution or the model should be able to learn and infer the quality of certain visual semantics in the image. Therefore, it is important for the model to represent the input image in a way that the semantics of interest can be inferred easily and reliably. This thesis is written in the form of a compilation of publications and tries to look into the Convolutional Networks (CovnNets) representation in visual recognition problems from an empirical perspective. Convolutional Network is a special class of Neural Networks with a hierarchical structure where every layer’s output (except for the last layer) will be the input of another one. It was shown that ConvNets are powerful tools to learn a generic representation of an image. In this body of work, we first showed that this is indeed the case and ConvNet representation with a simple classifier can outperform highly-tuned pipelines based on hand-crafted features. To be precise, we first trained a ConvNet on a large dataset, then for every image in another task with a small dataset, we feedforward the image to the ConvNet and take the ConvNets activation on a certain layer as the image representation. Transferring the knowledge from the large dataset (source task) to the small dataset (target task) proved to be effective and outperformed baselines on a variety of tasks in visual recognition. We also evaluated the presence of spatial visual semantics in ConvNet representation and observed that ConvNet retains significant spatial information despite the fact that it has never been explicitly trained to preserve low-level semantics. We then tried to investigate the factors that affect the transferability of these representations. We studied various factors on a diverse set of visual recognition tasks and found a consistent correlation between the effect of those factors and the similarity of the target task to the source task. This intuition alongside the experimental results provides a guideline to improve the performance of visual recognition tasks using ConvNet features. Finally, we addressed the task of visual instance retrieval specifically as an example of how these simple intuitions can increase the performance of the target task massively.

QC 20161209

APA, Harvard, Vancouver, ISO, and other styles

5

Yu, Mengyang. "Feature reduction and representation learning for visual applications." Thesis, Northumbria University, 2016. http://nrl.northumbria.ac.uk/30222/.

Full text

Abstract:

Computation on large-scale data spaces has been involved in many active problems in computer vision and pattern recognition. However, in realistic applications, most existing algorithms are heavily restricted by the large number of features, and tend to be inefficient and even infeasible. In this thesis, the solution to this problem is addressed in the following ways: (1) projecting features onto a lower-dimensional subspace; (2) embedding features into a Hamming space. Firstly, a novel subspace learning algorithm called Local Feature Discriminant Projection (LFDP) is proposed for discriminant analysis of local features. LFDP is able to efficiently seek a subspace to improve the discriminability of local features for classification. Extensive experimental validation on three benchmark datasets demonstrates that the proposed LFDP outperforms other dimensionality reduction methods and achieves state-of-the-art performance for image classification. Secondly, for action recognition, a novel binary local representation for RGB-D video data fusion is presented. In this approach, a general local descriptor called Local Flux Feature (LFF) is obtained for both RGB and depth data by computing the local fluxes of the gradient fields of video data. Then the LFFs from RGB and depth channels are fused into a Hamming space via the Structure Preserving Projection (SPP), which preserves not only the pairwise feature structure, but also a higher level connection between samples and classes. Comprehensive experimental results show the superiority of both LFF and SPP. Thirdly, in respect of unsupervised learning, SPP is extended to the Binary Set Embedding (BSE) for cross-modal retrieval. BSE outputs meaningful hash codes for local features from the image domain and word vectors from text domain. Extensive evaluation on two widely-used image-text datasets demonstrates the superior performance of BSE compared with state-of-the-art cross-modal hashing methods. Finally, a generalized multiview spectral embedding algorithm called Kernelized Multiview Projection (KMP) is proposed to fuse the multimedia data from multiple sources. Different features/views in the reproducing kernel Hilbert spaces are linearly fused together and then projected onto a low-dimensional subspace by KMP, whose performance is thoroughly evaluated on both image and video datasets compared with other multiview embedding methods.

APA, Harvard, Vancouver, ISO, and other styles

6

Venkataramanan, Shashanka. "Metric learning for instance and category-level visual representation." Electronic Thesis or Diss., Université de Rennes (2023-....), 2024. http://www.theses.fr/2024URENS022.

Full text

Abstract:

Le principal objectif de la vision par ordinateur est de permettre aux machines d'extraire des informations significatives à partir de données visuelles, telles que des images et des vidéos, et de tirer parti de ces informations pour effectuer une large gamme de tâches. À cette fin, de nombreuses recherches se sont concentrées sur le développement de modèles d'apprentissage profond capables de coder des représentations visuelles complètes et robustes. Une stratégie importante dans ce contexte consiste à préentraîner des modèles sur des ensembles de données à grande échelle, tels qu'ImageNet, pour apprendre des représentations qui peuvent présenter une applicabilité transversale aux tâches et faciliter la gestion réussie de diverses tâches en aval avec un minimum d'effort. Pour faciliter l'apprentissage sur ces ensembles de données à grande échelle et coder de bonnes représentations, des stratégies complexes d'augmentation des données ont été utilisées. Cependant, ces augmentations peuvent être limitées dans leur portée, étant soit conçues manuellement et manquant de diversité, soit générant des images qui paraissent artificielles. De plus, ces techniques d'augmentation se sont principalement concentrées sur le jeu de données ImageNet et ses tâches en aval, limitant leur applicabilité à un éventail plus large de problèmes de vision par ordinateur. Dans cette thèse, nous visons à surmonter ces limitations en explorant différentes approches pour améliorer l'efficacité et l'efficience de l'apprentissage des représentations. Le fil conducteur des travaux présentés est l'utilisation de techniques basées sur l'interpolation, telles que mixup, pour générer des exemples d'entraînement diversifiés et informatifs au-delà du jeu de données original. Dans le premier travail, nous sommes motivés par l'idée de la déformation comme un moyen naturel d'interpoler des images plutôt que d'utiliser une combinaison convexe. Nous montrons que l'alignement géométrique des deux images dans l'espace des caractéristiques permet une interpolation plus naturelle qui conserve la géométrie d'une image et la texture de l'autre, la reliant au transfert de style. En nous appuyant sur ces observations, nous explorons la combinaison de mix6up et de l'apprentissage métrique profond. Nous développons une formulation généralisée qui intègre mix6up dans l'apprentissage métrique, conduisant à des représentations améliorées qui explorent des zones de l'espace d'embedding au-delà des classes d'entraînement. En nous appuyant sur ces insights, nous revisitons la motivation originale de mixup et générons un plus grand nombre d'exemples interpolés au-delà de la taille du mini-lot en interpolant dans l'espace d'embedding. Cette approche nous permet d'échantillonner sur l'ensemble de l'enveloppe convexe du mini-lot, plutôt que juste le long des segments linéaires entre les paires d'exemples. Enfin, nous explorons le potentiel de l'utilisation d'augmentations naturelles d'objets à partir de vidéos. Nous introduisons un ensemble de données "Walking Tours" de vidéos égocentriques en première personne, qui capturent une large gamme d'objets et d'actions dans des transitions de scènes naturelles. Nous proposons ensuite une nouvelle méthode de préentraînement auto-supervisée appelée DoRA, qui détecte et suit des objets dans des images vidéo, dérivant de multiples vues à partir des suivis et les utilisant de manière auto-supervisée
The primary goal in computer vision is to enable machines to extract meaningful information from visual data, such as images and videos, and leverage this information to perform a wide range of tasks. To this end, substantial research has focused on developing deep learning models capable of encoding comprehensive and robust visual representations. A prominent strategy in this context involves pretraining models on large-scale datasets, such as ImageNet, to learn representations that can exhibit cross-task applicability and facilitate the successful handling of diverse downstream tasks with minimal effort. To facilitate learning on these large-scale datasets and encode good representations, com- plex data augmentation strategies have been used. However, these augmentations can be limited in their scope, either being hand-crafted and lacking diversity, or generating images that appear unnatural. Moreover, the focus of these augmentation techniques has primarily been on the ImageNet dataset and its downstream tasks, limiting their applicability to a broader range of computer vision problems. In this thesis, we aim to tackle these limitations by exploring different approaches to en- hance the efficiency and effectiveness in representation learning. The common thread across the works presented is the use of interpolation-based techniques, such as mixup, to generate diverse and informative training examples beyond the original dataset. In the first work, we are motivated by the idea of deformation as a natural way of interpolating images rather than using a convex combination. We show that geometrically aligning the two images in the fea- ture space, allows for more natural interpolation that retains the geometry of one image and the texture of the other, connecting it to style transfer. Drawing from these observations, we explore the combination of mixup and deep metric learning. We develop a generalized formu- lation that accommodates mixup in metric learning, leading to improved representations that explore areas of the embedding space beyond the training classes. Building on these insights, we revisit the original motivation of mixup and generate a larger number of interpolated examples beyond the mini-batch size by interpolating in the embedding space. This approach allows us to sample on the entire convex hull of the mini-batch, rather than just along lin- ear segments between pairs of examples. Finally, we investigate the potential of using natural augmentations of objects from videos. We introduce a "Walking Tours" dataset of first-person egocentric videos, which capture a diverse range of objects and actions in natural scene transi- tions. We then propose a novel self-supervised pretraining method called DoRA, which detects and tracks objects in video frames, deriving multiple views from the tracks and using them in a self-supervised manner

APA, Harvard, Vancouver, ISO, and other styles

7

Li, Nuo Ph D. Massachusetts Institute of Technology. "Unsupervised learning of invariant object representation in primate visual cortex." Thesis, Massachusetts Institute of Technology, 2011. http://hdl.handle.net/1721.1/65288.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Brain and Cognitive Sciences, 2011.
Cataloged from PDF version of thesis.
Includes bibliographical references.
Visual object recognition (categorization and identification) is one of the most fundamental cognitive functions for our survival. Our visual system has the remarkable ability to convey to us visual object and category information in a manner that is largely tolerant ("invariant") to the exact position, size, pose of the object, illumination, and clutter. The ventral visual stream in non-human primate has solved this problem. At the highest stage of the visual hierarchy, the inferior temporal cortex (IT), neurons have selectivity for objects and maintain that selectivity across variations in the images. A reasonably sized population of these tolerant neurons can support object recognition. However, we do not yet understand how IT neurons construct this neuronal tolerance. The aim of this thesis is to tackle this question and to examine the hypothesis that the ventral visual stream may leverage experience to build its neuronal tolerance. One potentially powerful idea is that time can act as an implicit teacher, in that each object's identity tends to remain temporally stable, thus different retinal images of the same object are temporally contiguous. In theory, the ventral stream could take advantage of this natural tendency and learn to associate together the neuronal representations of temporally contiguous retinal images to yield tolerant object selectivity in IT cortex. In this thesis, I report neuronal support for this hypothesis in IT of non-human primates. First, targeted alteration of temporally contiguous experience with object images at different retinal positions rapidly reshaped IT neurons' position tolerance. Second, similar temporal contiguity manipulation of experience with object images at different sizes similarly reshaped IT size tolerance. These instances of experience-induced effect were similar in magnitude, grew gradually stronger with increasing visual experience, and the size of the effect was large. Taken together, these studies show that unsupervised, temporally contiguous experience can reshape and build at least two types of IT tolerance, and that they can do so under a wide range of spatiotemporal regimes encountered during natural visual exploration. These results suggest that the ventral visual stream uses temporal contiguity visual experience with a general unsupervised tolerance learning (UTL) mechanism to build its invariant object representation.
by Nuo Li.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

8

Dalens, Théophile. "Learnable factored image representation for visual discovery." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE036.

Full text

Abstract:

L'objectif de cette thèse est de développer des outils pour analyser les collections d'images temporelles afin d'identifier et de mettre en évidence les tendances visuelles à travers le temps. Cette thèse propose une approche pour l'analyse de données visuelles non appariées annotées avec le temps en générant à quoi auraient ressemblé les images si elles avaient été d'époques différentes. Pour isoler et transférer les variations d'apparence dépendantes du temps, nous introduisons un nouveau module bilinéaire de séparation de facteurs qui peut être entraîné. Nous analysons sa relation avec les représentations factorisées classiques et les auto-encodeurs basés sur la concaténation. Nous montrons que ce nouveau module présente des avantages par rapport à un module standard de concaténation lorsqu'il est utilisé dans une architecture de réseau de neurones convolutionnel encodeur-décodeur à goulot. Nous montrons également qu'il peut être inséré dans une architecture récente de traduction d'images à adversaire, permettant la transformation d'images à différentes périodes de temps cibles en utilisant un seul réseau
This thesis proposes an approach for analyzing unpaired visual data annotated with time stamps by generating how images would have looked like if they were from different times. To isolate and transfer time dependent appearance variations, we introduce a new trainable bilinear factor separation module. We analyze its relation to classical factored representations and concatenation-based auto-encoders. We demonstrate this new module has clear advantages compared to standard concatenation when used in a bottleneck encoder-decoder convolutional neural network architecture. We also show that it can be inserted in a recent adversarial image translation architecture, enabling the image transformation to multiple different target time periods using a single network

APA, Harvard, Vancouver, ISO, and other styles

9

Jonaityte, Inga <1981&gt. "Visual representation and financial decision making." Doctoral thesis, Università Ca' Foscari Venezia, 2014. http://hdl.handle.net/10579/4593.

Full text

Abstract:

This thesis addresses experimentally three topics concerning the effects of visual representations on financial decision making. First, we hypothesize that visual representation of financial information affects comprehension and decision-making processes and outcomes. To test our hypothesis, we conducted online experiments demonstrating that the choice of visual representation leads to shifts in attention, comprehension, and evaluation of the information. The second study focuses on the ability of financial advisers to provide expert judgment to aid naïve consumers facing financial decisions. We found that advertising content significantly affects both experts and novices. Our results provide a previously underexplored viewpoint of decision making by finance professionals. The third topic concerns our ability to learn from multiple cues, adapt to changes, and develop new strategies. We investigated the effects of salient cues and environmental changes on learning, and found, among other things, that “abrupt” transformations in an environment are more harmful than “smooth” ones.
Questa tesi affronta sperimentalmente gli effetti delle rappresentazioni visive sulle decisioni finanziarie. Ipotizziamo che le rappresentazioni visive dell'informazione finanziaria possano influenzare le decisioni. Per testare tali ipotesi, abbiamo condotto esperimenti online e mostrato che la scelta della rappresentazione visiva conduce a cambiamenti nell'attenzione, comprensione, e valutazione dell'informazione. Il secondo studio riguarda l'abilità dei consulenti finanziari di offrire giudizio esperto per aiutare consumatori inesperti nelle decisioni finanziarie. Abbiamo trovato che il contenuto della pubblicità influenza significativamente tanto l'esperto quanto l'inesperto, il che offre una nuova prospettiva sulle decisioni dei consulenti finanziari. Il terzo tema riguarda l'apprendimento da informazioni multidimensionali, l'adattamento al cambiamento e lo sviluppo di nuove strategie. Abbiamo investigato gli effetti dell'importanza delle "cues" e di cambiamenti dell'ambiente decisionale sull'apprendimento. Trasformazioni improvvise nell'ambiente decisionale sono più dannose di trasformazioni graduali.

APA, Harvard, Vancouver, ISO, and other styles

10

Büchler, Uta [Verfasser], and Björn [Akademischer Betreuer] Ommer. "Visual Representation Learning with Minimal Supervision / Uta Büchler ; Betreuer: Björn Ommer." Heidelberg : Universitätsbibliothek Heidelberg, 2021. http://d-nb.info/1225868505/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Sanakoyeu, Artsiom [Verfasser], and Björn [Akademischer Betreuer] Ommer. "Visual Representation Learning with Limited Supervision / Artsiom Sanakoyeu ; Betreuer: Björn Ommer." Heidelberg : Universitätsbibliothek Heidelberg, 2021. http://d-nb.info/1231632488/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Anand, Gaurangi. "Unsupervised visual perception-based representation learning for time-series and trajectories." Thesis, Queensland University of Technology, 2021. https://eprints.qut.edu.au/212901/1/Gaurangi_Anand_Thesis.pdf.

Full text

Abstract:

Representing time-series without relying on the domain knowledge and independent of the end-task is a challenging problem. The same situation applies to trajectory data as well, where sufficient labelled information is often unavailable to learn effective representations. This thesis addresses this problem and explores unsupervised ways of representing the temporal data. The novel methods imitate the human visual perception of the pictorial depiction of such data based on deep learning.

APA, Harvard, Vancouver, ISO, and other styles

13

Jones, Carl. "Localisation and representation of visual memory in the domestic chick." Thesis, University of Sussex, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.324183.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Wang, Qian. "Zero-shot visual recognition via latent embedding learning." Thesis, University of Manchester, 2018. https://www.research.manchester.ac.uk/portal/en/theses/zeroshot-visual-recognition-via-latent-embedding-learning(bec510af-6a53-4114-9407-75212e1a08e1).html.

Full text

Abstract:

Traditional supervised visual recognition methods require a great number of annotated examples for each concerned class. The collection and annotation of visual data (e.g., images and videos) could be laborious, tedious and time-consuming when the number of classes involved is very large. In addition, there are such situations where the test instances are from novel classes for which training examples are unavailable in the training stage. These issues can be addressed by zero-shot learning (ZSL), an emerging machine learning technique enabling the recognition of novel classes. The key issue in zero-shot visual recognition is the semantic gap between visual and semantic representations. We address this issue in this thesis from three different perspectives: visual representations, semantic representations and the learning models. We first propose a novel bidirectional latent embedding framework for zero-shot visual recognition. By learning a latent space from visual representations and labelling information of the training examples, instances of different classes can be mapped into the latent space with the preserving of both visual and semantic relatedness, hence the semantic gap can be bridged. We conduct experiments on both object and human action recognition benchmarks to validate the effectiveness of the proposed ZSL framework. Then we extend the ZSL to the multi-label scenarios for multi-label zero-shot human action recognition based on weakly annotated video data. We employ a long short term memory (LSTM) neural network to explore the multiple actions underlying the video data. A joint latent space is learned by two component models (i.e. the visual model and the semantic model) to bridge the semantic gap. The two component embedding models are trained alternately to optimize the ranking based objectives. Extensive experiments are carried out on two multi-label human action datasets to evaluate the proposed framework. Finally, we propose alternative semantic representations for human actions towards narrowing the semantic gap from the perspective of semantic representation. A simple yet effective solution based on the exploration of web data has been investigated to enhance the semantic representations for human actions. The novel semantic representations are proved to benefit the zero-shot human action recognition significantly compared to the traditional attributes and word vectors. In summary, we propose novel frameworks for zero-shot visual recognition towards narrowing and bridging the semantic gap, and achieve state-of-the-art performance in different settings on multiple benchmarks.

APA, Harvard, Vancouver, ISO, and other styles

15

Xu, Dan. "Exploring Multi-Modal and Structured Representation Learning for Visual Image and Video Understanding." Doctoral thesis, Università degli studi di Trento, 2018. https://hdl.handle.net/11572/367610.

Full text

Abstract:

As the explosive growth of the visual data, it is particularly important to develop intelligent visual understanding techniques for dealing with a large amount of data. Many efforts have been made in recent years to build highly effective and large-scale visual processing algorithms and systems. One of the core aspects in the research line is how to learn robust representations to better describe the data. In this thesis we study the problem of visual image and video understanding and specifically, we address the problem via designing and implementing novel multi-modal and structured representation learning approaches, both of which are fundamental research hot-spots in machine learning. Multi-modal representation learning involves relating information from multiple input sources, and the structured representation learning works on exploring rich structural information hidden in the data for robust feature learning. We investigate both the shallow representation learning frameworks such as dictionary learning and the deep representation learning frameworks such as deep neural networks, and present different modules devised in our works, consisting of cross-paced representation learning, cross-modal feature learning and transferring, multi-scale structured prediction and fusion, multi-modal prediction and distillation. These techniques are further applied in various visual understanding topics, i.e. sketch-based-image retrieval (SBIR), video pedestrian detection, monocular depth estimation and scene parsing, showing superior performance.

APA, Harvard, Vancouver, ISO, and other styles

16

Xu, Dan. "Exploring Multi-Modal and Structured Representation Learning for Visual Image and Video Understanding." Doctoral thesis, University of Trento, 2018. http://eprints-phd.biblio.unitn.it/2918/1/disclaimer.pdf.

Full text

Abstract:

As the explosive growth of the visual data, it is particularly important to develop intelligent visual understanding techniques for dealing with a large amount of data. Many efforts have been made in recent years to build highly effective and large-scale visual processing algorithms and systems. One of the core aspects in the research line is how to learn robust representations to better describe the data. In this thesis we study the problem of visual image and video understanding and specifically, we address the problem via designing and implementing novel multi-modal and structured representation learning approaches, both of which are fundamental research hot-spots in machine learning. Multi-modal representation learning involves relating information from multiple input sources, and the structured representation learning works on exploring rich structural information hidden in the data for robust feature learning. We investigate both the shallow representation learning frameworks such as dictionary learning and the deep representation learning frameworks such as deep neural networks, and present different modules devised in our works, consisting of cross-paced representation learning, cross-modal feature learning and transferring, multi-scale structured prediction and fusion, multi-modal prediction and distillation. These techniques are further applied in various visual understanding topics, i.e. sketch-based-image retrieval (SBIR), video pedestrian detection, monocular depth estimation and scene parsing, showing superior performance.

APA, Harvard, Vancouver, ISO, and other styles

17

Lee, Wooyoung. "Learning Statistical Features of Scene Images." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/540.

Full text

Abstract:

Scene perception is a fundamental aspect of vision. Humans are capable of analyzing behaviorally-relevant scene properties such as spatial layouts or scene categories very quickly, even from low resolution versions of scenes. Although humans perform these tasks effortlessly, they are very challenging for machines. Developing methods that well capture the properties of the representation used by the visual system will be useful for building computational models that are more consistent with perception. While it is common to use hand-engineered features that extract information from predefined dimensions, they require careful tuning of parameters and do not generalize well to other tasks or larger datasets. This thesis is driven by the hypothesis that the perceptual representations are adapted to the statistical properties of natural visual scenes. For developing statistical features for global-scale structures (low spatial frequency information that encompasses entire scenes), I propose to train hierarchical probabilistic models on whole scene images. I first investigate statistical clusters of scene images by training a mixture model under the assumption that each image can be decoded by sparse and independent coefficients. Each cluster discovered by the unsupervised classifier is consistent with the high-level semantic categories (such as indoor, outdoor-natural and outdoor-manmade) as well as perceptual layout properties (mean depth, openness and perspective). To address the limitation of mixture models in their assumptions of a discrete number of underlying clusters, I further investigate a continuous representation for the distributions of whole scenes. The model parameters optimized for natural visual scenes reveal a compact representation that encodes their global-scale structures. I develop a probabilistic similarity measure based on the model and demonstrate its consistency with the perceptual similarities. Lastly, to learn the representations that better encode the manifold structures in general high-dimensional image space, I develop the image normalization process to find a set of canonical images that anchors the probabilistic distributions around the real data manifolds. The canonical images are employed as the centers of the conditional multivariate Gaussian distributions. This approach allows to learn more detailed structures of the local manifolds resulting in improved representation of the high level properties of scene images.

APA, Harvard, Vancouver, ISO, and other styles

18

Azizpour, Hossein. "Visual Representations and Models: From Latent SVM to Deep Learning." Doctoral thesis, KTH, Datorseende och robotik, CVAP, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-192289.

Full text

Abstract:

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning. First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection. Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence. Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

QC 20160908

APA, Harvard, Vancouver, ISO, and other styles

19

Varol, Gül. "Learning human body and human action representations from visual data." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE029.

Full text

Abstract:

Le contenu visuel se concentre souvent sur les humains. L’analyse automatique des humains à partir de données visuelles revêt donc une grande importance pour de nombreuses applications. Le but de cette thèse est d’apprendre des représentations visuelles pour l’analyse des humains. Un accent particulier est mis sur deux domaines étroitement liés de la vision artificielle : l’analyse du corps humain et la reconnaissance des actions. En résumé, nos contributions sont les suivantes : (i) nous générons des données synthétiques photoréalistes de personnes permettant l’entraînement de CNNs pour l’analyse du corps humain, (ii) nous proposons une architecture multitâche permettant d’obtenir une représentation volumétrique du corps à partir d’une seule image, (iii) nous étudions les avantages des convolutions temporelles à long terme pour la reconnaissance de l’action humaine à l’aide de CNNs 3D, (iv) nous incorporons une fonction de coût de similarité des vidéos multi-vues pour concevoir des représentations invariantes au changement de vue
The focus of visual content is often people. Automatic analysis of people from visual data is therefore of great importance for numerous applications in content search, autonomous driving, surveillance, health care, and entertainment. The goal of this thesis is to learn visual representations for human understanding. Particular emphasis is given to two closely related areas of computer vision: human body analysis and human action recognition. In summary, our contributions are the following: (i) we generate photo-realistic synthetic data for people that allows training CNNs for human body analysis, (ii) we propose a multi-task architecture to recover a volumetric body shape from a single image, (iii) we study the benefits of long-term temporal convolutions for human action recognition using 3D CNNs, (iv) we incorporate similarity training in multi-view videos to design view-independent representations for action recognition

APA, Harvard, Vancouver, ISO, and other styles

20

Krawec, Jennifer Lee. "Problem Representation and Mathematical Problem Solving of Students of Varying Math Ability." Scholarly Repository, 2010. http://scholarlyrepository.miami.edu/oa_dissertations/455.

Full text

Abstract:

The purpose of this study was to examine differences in math problem solving among students with learning disabilities (LD), low-achieving (LA) students, and average-achieving (AA) students. The primary interest was to analyze the problem representation processes students use to translate and integrate problem information as they solve math word problems. Problem representation processes were operationalized as (a) paraphrasing the problem and (b) visually representing the problem. Paraphrasing accuracy (i.e., paraphrasing relevant information, paraphrasing irrelevant linguistic information, and paraphrasing irrelevant numerical information), visual representation accuracy (i.e., visual representation of relevant information, visual representation of irrelevant linguistic information, and visual representation of irrelevant numerical information), and problem-solving accuracy were measured in eighth-grade students with LD (n = 25), LA students (n = 30), and AA students (n = 29) using a researcher-modified version of the Mathematical Processing Instrument (MPI). Results indicated that problem-solving accuracy was significantly and positively correlated to relevant information in both the paraphrasing and the visual representation phases and significantly negatively correlated to linguistic and numerical irrelevant information for the two constructs. When separated by ability, students with LD showed a different profile as compared to the LA and AA students with respect to the relationships among the problem-solving variables. Mean differences showed that students with LD differed significantly from LA students in that they paraphrased less relevant information and also visually represented less irrelevant numerical information. Paraphrasing accuracy and visual representation accuracy were each shown to account for a statistically significant amount of variance in problem-solving accuracy when entered in a hierarchical model. Finally, the relationship between visual representation of relevant information and problem-solving accuracy was shown to be dependent on ability after controlling for the problem-solving variables and ability. Implications for classroom instruction for students with and without LD are discussed.

APA, Harvard, Vancouver, ISO, and other styles

21

Zhao, Yongheng. "3D feature representations for visual perception and geometric shape understanding." Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3424787.

Full text

Abstract:

In this thesis, we first present a unified look to several well known 3D feature representations, ranging from hand-crafted design to learning based ones. Then, we propose three kinds of feature representations from both RGB-D data and point cloud, addressing different problems and aiming for different functionality. With RGB-D data, we address the existing problems of 2D feature representation in visual perception by integrating with the 3D information. We propose an RGB-D data based feature representation which fuses object's statistical color model and depth information in a probabilistic manner. The depth information is able to not only enhance the discriminative power of the model toward clutters with a different range but also can be used as a constraint to properly update the model and reduce model drifting. The proposed representation is then evaluated in our proposed object tracking algorithm (named MS3D) on a public RGB-D object tracking dataset. It runs in real-time and produces the best results compared against the other state-of-the-art RGB-D trackers. Furthermore, we integrate MS3D tracker in an RGB-D camera network in order to handle long-term and full occlusion. The accuracy and robustness of our algorithm are evaluated in our presented dataset and the results suggest our algorithm is able to track multiple objects accurately and continuously in the long term. For 3D point cloud, the current deep learning based feature representations often discard spatial arrangements in data, hence falling short of respecting the parts-to-whole relationship, which is critical to explain and describe 3D shapes. Addressing this problem, we propose 3D point-capsule networks, an autoencoder designed for unsupervised learning of feature representations from sparse 3D point clouds while preserving spatial arrangements of the input data into different feature attentions. 3D capsule networks arise as a direct consequence of our unified formulation of the common 3D autoencoders. The dynamic routing scheme and the peculiar 2D latent feature representation deployed by our capsule networks bring in improvements for several common point cloud-related tasks, such as object classification, object reconstruction and part segmentation as substantiated by our extensive evaluations. Moreover, it enables new applications such as part interpolation and replacement. Finally, towards rotation equivariance of the 3D feature representation, we present a 3D capsule architecture for processing of point clouds that is equivariant with respect to the SO(3) rotation group, translation, and permutation of the unordered input sets. The network operates on a sparse set of local reference frames, computed from an input point cloud and establishes end-to-end equivariance through a novel 3D quaternion group capsule layer, including an equivariant dynamic routing procedure. The capsule layer enables us to disentangle geometry from the pose, paving the way for more informative descriptions and structured latent space. In the process, we theoretically connect the process of dynamic routing between capsules to the well-known Weiszfeld algorithm, a scheme for solving iterative re-weighted least squares (IRLS) problems with provable convergence properties, enabling robust pose estimation between capsule layers. Due to the sparse equivariant quaternion capsules, our architecture allows joint object classification and orientation estimation, which we validate empirically on common benchmark datasets.

APA, Harvard, Vancouver, ISO, and other styles

22

Rouhafzay, Ghazal. "3D Object Representation and Recognition Based on Biologically Inspired Combined Use of Visual and Tactile Data." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42122.

Full text

Abstract:

Recent research makes use of biologically inspired computation and artificial intelligence as efficient means to solve real-world problems. Humans show a significant performance in extracting and interpreting visual information. In the cases where visual data is not available, or, for example, if it fails to provide comprehensive information due to occlusions, tactile exploration assists in the interpretation and better understanding of the environment. This cooperation between human senses can serve as an inspiration to embed a higher level of intelligence in computational models. In the context of this research, in the first step, computational models of visual attention are explored to determine salient regions on the surface of objects. Two different approaches are proposed. The first approach takes advantage of a series of contributing features in guiding human visual attention, namely color, contrast, curvature, edge, entropy, intensity, orientation, and symmetry are efficiently integrated to identify salient features on the surface of 3D objects. This model of visual attention also learns to adaptively weight each feature based on ground-truth data to ensure a better compatibility with human visual exploration capabilities. The second approach uses a deep Convolutional Neural Network (CNN) for feature extraction from images collected from 3D objects and formulates saliency as a fusion map of regions where the CNN looks at, while classifying the object based on their geometrical and semantic characteristics. The main difference between the outcomes of the two algorithms is that the first approach results in saliencies spread over the surface of the objects while the second approach highlights one or two regions with concentrated saliency. Therefore, the first approach is an appropriate simulation of visual exploration of objects, while the second approach successfully simulates the eye fixation locations on objects. In the second step, the first computational model of visual attention is used to determine scattered salient points on the surface of objects based on which simplified versions of 3D object models preserving the important visual characteristics of objects are constructed. Subsequently, the thesis focuses on the topic of tactile object recognition, leveraging the proposed model of visual attention. Beyond the sensor technologies which are instrumental in ensuring data quality, biological models can also assist in guiding the placement of sensors and support various selective data sampling strategies that allow exploring an object’s surface faster. Therefore, the possibility to guide the acquisition of tactile data based on the identified visually salient features is tested and validated in this research. Different object exploration and data processing approaches were used to identify the most promising solution. Our experiments confirm the effectiveness of computational models of visual attention as a guide for data selection for both simplifying 3D representation of objects as well as enhancing tactile object recognition. In particular, the current research demonstrates that: (1) the simplified representation of objects by preserving visually salient characteristics shows a better compatibility with human visual capabilities compared to uniformly simplified models, and (2) tactile data acquired based on salient visual features are more informative about the objects’ characteristics and can be employed in tactile object manipulation and recognition scenarios. In the last section, the thesis addresses the issue of transfer of learning from vision to touch. Inspired from biological studies that attest similarities between the processing of visual and tactile stimuli in human brain, the thesis studies the possibility of transfer of learning from vision to touch using deep learning architectures and proposes a hybrid CNN that handles both visual and tactile object recognition.

APA, Harvard, Vancouver, ISO, and other styles

23

Plebe, Alice. "Cognitively Guided Modeling of Visual Perception in Intelligent Vehicles." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/299909.

Full text

Abstract:

This work proposes a strategy for visual perception in the context of autonomous driving. Despite the growing research aiming to implement self-driving cars, no artificial system can claim to have reached the driving performance of a human, yet. Humans---when not distracted or drunk---are still the best drivers you can currently find. Hence, the theories about the human mind and its neural organization could reveal precious insights on how to design a better autonomous driving agent. This dissertation focuses specifically on the perceptual aspect of driving, and it takes inspiration from four key theories on how the human brain achieves the cognitive capabilities required by the activity of driving. The first idea lies at the foundation of current cognitive science, and it argues that thinking nearly always involves some sort of mental simulation, which takes the form of imagery when dealing with visual perception. The second theory explains how the perceptual simulation takes place in neural circuits called convergence-divergence zones, which expand and compress information to extract abstract concepts from visual experience and code them into compact representations. The third theory highlights that perception---when specialized for a complex task as driving---is refined by experience in a process called perceptual learning. The fourth theory, namely the free-energy principle of predictive brains, corroborates the role of visual imagination as a fundamental mechanism of inference. In order to implement these theoretical principles, it is necessary to identify the most appropriate computational tools currently available. Within the consolidated and successful field of deep learning, I select the artificial architectures and strategies that manifest a sounding resemblance with their cognitive counterparts. Specifically, convolutional autoencoders have a strong correspondence with the architecture of convergence-divergence zones and the process of perceptual abstraction. The free-energy principle of predictive brains is related to variational Bayesian inference and the use of recurrent neural networks. In fact, this principle can be translated into a training procedure that learns abstract representations predisposed to predicting how the current road scenario will change in the future. The main contribution of this dissertation is a method to learn conceptual representations of the driving scenario from visual information. This approach forces a semantic internal organization, in the sense that distinct parts of the representation are explicitly associated to specific concepts useful in the context of driving. Specifically, the model uses as few as 16 neurons for each of the two basic concepts here considered: vehicles and lanes. At the same time, the approach biases the internal representations towards the ability to predict the dynamics of objects in the scene. This property of temporal coherence allows the representations to be exploited to predict plausible future scenarios and to perform a simplified form of mental imagery. In addition, this work includes a proposal to tackle the problem of opaqueness affecting deep neural networks. I present a method that aims to mitigate this issue, in the context of longitudinal control for automated vehicles. A further contribution of this dissertation experiments with higher-level spaces of prediction, such as occupancy grids, which could conciliate between the direct application to motor controls and the biological plausibility.

APA, Harvard, Vancouver, ISO, and other styles

24

Plebe, Alice. "Cognitively Guided Modeling of Visual Perception in Intelligent Vehicles." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/299909.

Full text

Abstract:

This work proposes a strategy for visual perception in the context of autonomous driving. Despite the growing research aiming to implement self-driving cars, no artificial system can claim to have reached the driving performance of a human, yet. Humans---when not distracted or drunk---are still the best drivers you can currently find. Hence, the theories about the human mind and its neural organization could reveal precious insights on how to design a better autonomous driving agent. This dissertation focuses specifically on the perceptual aspect of driving, and it takes inspiration from four key theories on how the human brain achieves the cognitive capabilities required by the activity of driving. The first idea lies at the foundation of current cognitive science, and it argues that thinking nearly always involves some sort of mental simulation, which takes the form of imagery when dealing with visual perception. The second theory explains how the perceptual simulation takes place in neural circuits called convergence-divergence zones, which expand and compress information to extract abstract concepts from visual experience and code them into compact representations. The third theory highlights that perception---when specialized for a complex task as driving---is refined by experience in a process called perceptual learning. The fourth theory, namely the free-energy principle of predictive brains, corroborates the role of visual imagination as a fundamental mechanism of inference. In order to implement these theoretical principles, it is necessary to identify the most appropriate computational tools currently available. Within the consolidated and successful field of deep learning, I select the artificial architectures and strategies that manifest a sounding resemblance with their cognitive counterparts. Specifically, convolutional autoencoders have a strong correspondence with the architecture of convergence-divergence zones and the process of perceptual abstraction. The free-energy principle of predictive brains is related to variational Bayesian inference and the use of recurrent neural networks. In fact, this principle can be translated into a training procedure that learns abstract representations predisposed to predicting how the current road scenario will change in the future. The main contribution of this dissertation is a method to learn conceptual representations of the driving scenario from visual information. This approach forces a semantic internal organization, in the sense that distinct parts of the representation are explicitly associated to specific concepts useful in the context of driving. Specifically, the model uses as few as 16 neurons for each of the two basic concepts here considered: vehicles and lanes. At the same time, the approach biases the internal representations towards the ability to predict the dynamics of objects in the scene. This property of temporal coherence allows the representations to be exploited to predict plausible future scenarios and to perform a simplified form of mental imagery. In addition, this work includes a proposal to tackle the problem of opaqueness affecting deep neural networks. I present a method that aims to mitigate this issue, in the context of longitudinal control for automated vehicles. A further contribution of this dissertation experiments with higher-level spaces of prediction, such as occupancy grids, which could conciliate between the direct application to motor controls and the biological plausibility.

APA, Harvard, Vancouver, ISO, and other styles

25

Yaner, Patrick William. "From Shape to Function: Acquisition of Teleological Models from Design Drawings by Compositional Analogy." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2007. http://hdl.handle.net/1853/19791.

Full text

Abstract:

Thesis (Ph.D)--Computing, Georgia Institute of Technology, 2008.
Committee Chair: Goel, Ashok; Committee Member: Eastman, Charles; Committee Member: Ferguson, Ronald; Committee Member: Glasgow, Janice; Committee Member: Nersessian, Nancy; Committee Member: Ram, Ashwin.

APA, Harvard, Vancouver, ISO, and other styles

26

Garg, Sourav. "Robust visual place recognition under simultaneous variations in viewpoint and appearance." Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/134410/1/Sourav%20Garg%20Thesis.pdf.

Full text

Abstract:

This thesis explores the problem of visual place recognition and localization for a mobile robot, particularly dealing with the challenges of simultaneous variations in scene appearance and camera viewpoint. The proposed methods draw inspiration from humans and make use of semantic cues to represent places. This approach enables effective place recognition from similar or opposing viewpoints, despite variations in scene appearance caused by different times of day or seasons. The research contributions presented in the thesis advance visual place recognition techniques, making them more useful for deployment in a wide range of robotic and autonomous vehicle scenarios.

APA, Harvard, Vancouver, ISO, and other styles

27

Goh, Hanlin. "Learning deep visual representations." Paris 6, 2013. http://www.theses.fr/2013PA066356.

Full text

Abstract:

Les avancées récentes en apprentissage profond et en traitement d'image présentent l'opportunité d'unifier ces deux champs de recherche complémentaires pour une meilleure résolution du problème de classification d'images dans des catégories sémantiques. L'apprentissage profond apporte au traitement d'image le pouvoir de représentation nécessaire à l'amélioration des performances des méthodes de classification d'images. Cette thèse propose de nouvelles méthodes d'apprentissage de représentations visuelles profondes pour la résolution de cette tache. L'apprentissage profond a été abordé sous deux angles. D'abord nous nous sommes intéressés à l'apprentissage non supervisé de représentations latentes ayant certaines propriétés à partir de données en entrée. Il s'agit ici d'intégrer une connaissance à priori, à travers un terme de régularisation, dans l'apprentissage d'une machine de Boltzmann restreinte (RBM). Nous proposons plusieurs formes de régularisation qui induisent différentes propriétés telles que la parcimonie, la sélectivité et l'organisation en structure topographique. Le second aspect consiste au passage graduel de l'apprentissage non supervisé à l'apprentissage supervisé de réseaux profonds. Ce but est réalisé par l'introduction sous forme de supervision, d'une information relative à la catégorie sémantique. Deux nouvelles méthodes sont proposées. Le premier est basé sur une régularisation top-down de réseaux de croyance profonds à base de RBMs. Le second optimise un cout intégrant un critre de reconstruction et un critre de supervision pour l'entrainement d'autoencodeurs profonds. Les méthodes proposées ont été appliquées au problme de classification d'images. Nous avons adopté le modèle sac-de-mots comme modèle de base parce qu'il offre d'importantes possibilités grâce à l'utilisation de descripteurs locaux robustes et de pooling par pyramides spatiales qui prennent en compte l'information spatiale de l'image. L'apprentissage profonds avec agrÉgation spatiale est utilisé pour apprendre un dictionnaire hiÉrarchique pour l'encodage de reprÉsentations visuelles de niveau intermÉdiaire. Cette mÉthode donne des rÉsultats trs compétitifs en classification de scènes et d'images. Les dictionnaires visuels appris contiennent diverses informations non-redondantes ayant une structure spatiale cohérente. L'inférence est aussi très rapide. Nous avons par la suite optimisé l'étape de pooling sur la base du codage produit par le dictionnaire hiérarchique précédemment appris en introduisant introduit une nouvelle paramétrisation dérivable de l'opération de pooling qui permet un apprentissage par descente de gradient utilisant l'algorithme de rétro-propagation. Ceci est la premire tentative d'unification de l'apprentissage profond et du modèle de sac de mots. Bien que cette fusion puisse sembler évidente, l'union de plusieurs aspects de l'apprentissage profond de représentations visuelles demeure une tache complexe à bien des égards et requiert encore un effort de recherche important
Recent advancements in the areas of deep learning and visual information processing have presented an opportunity to unite both fields. These complementary fields combine to tackle the problem of classifying images into their semantic categories. Deep learning brings learning and representational capabilities to a visual processing model that is adapted for image classification. This thesis addresses problems that lead to the proposal of learning deep visual representations for image classification. The problem of deep learning is tackled on two fronts. The first aspect is the problem of unsupervised learning of latent representations from input data. The main focus is the integration of prior knowledge into the learning of restricted Boltzmann machines (RBM) through regularization. Regularizers are proposed to induce sparsity, selectivity and topographic organization in the coding to improve discrimination and invariance. The second direction introduces the notion of gradually transiting from unsupervised layer-wise learning to supervised deep learning. This is done through the integration of bottom-up information with top-down signals. Two novel implementations supporting this notion are explored. The first method uses top-down regularization to train a deep network of RBMs. The second method combines predictive and reconstructive loss functions to optimize a stack of encoder-decoder networks. The proposed deep learning techniques are applied to tackle the image classification problem. The bag-of-words model is adopted due to its strengths in image modeling through the use of local image descriptors and spatial pooling schemes. Deep learning with spatial aggregation is used to learn a hierarchical visual dictionary for encoding the image descriptors into mid-level representations. This method achieves leading image classification performances for object and scene images. The learned dictionaries are diverse and non-redundant. The speed of inference is also high. From this, a further optimization is performed for the subsequent pooling step. This is done by introducing a differentiable pooling parameterization and applying the error backpropagation algorithm. This thesis represents one of the first attempts to synthesize deep learning and the bag-of-words model. This union results in many challenging research problems, leaving much room for further study in this area

APA, Harvard, Vancouver, ISO, and other styles

28

Sicilia, Gómez Álvaro. "Supporting Tools for Automated Generation and Visual Editing of Relational-to-Ontology Mappings." Doctoral thesis, Universitat Ramon Llull, 2016. http://hdl.handle.net/10803/398843.

Full text

Abstract:

La integració de dades amb formats heterogenis i de diversos dominis mitjançant tecnologies de la web semàntica permet solucionar la seva disparitat estructural i semàntica. L'accés a dades basat en ontologies (OBDA, en anglès) és una solució integral que es basa en l'ús d'ontologies com esquemes mediadors i el mapatge entre les dades i les ontologies per facilitar la consulta de les fonts de dades. No obstant això, una de les principals barreres que pot dificultar més l'adopció de OBDA és la manca d'eines per donar suport a la creació de mapatges entre dades i ontologies. L'objectiu d'aquesta investigació ha estat desenvolupar noves eines que permetin als experts sense coneixements d'ontologies la creació de mapatges entre dades i ontologies. Amb aquesta finalitat, s'han dut a terme dues línies de treball: la generació automàtica de mapatges entre dades relacionals i ontologies i l'edició dels mapatges a través de la seva representació visual. Les eines actualment disponibles per automatitzar la generació de mapatges estan lluny de proporcionar una solució completa, ja que es basen en els esquemes relacionals i amb prou feines tenen en compte els continguts de la font de dades relacional i les característiques de l'ontologia. No obstant això, les dades poden contenir relacions ocultes que poden ajudar a la generació de mapatges. Per superar aquesta limitació, hem desenvolupat AutoMap4OBDA, un sistema que genera automàticament mapatges R2RML a partir de l'anàlisi dels continguts de la font relacional i tenint en compte les característiques de l'ontologia. El sistema fa servir una tècnica d'aprenentatge d'ontologies per inferir jerarquies de classes, selecciona les mètriques de similitud de cadenes en base a les etiquetes de les ontologies i analitza les estructures de grafs per generar els mapatges a partir de l'estructura de l'ontologia. La representació visual per mitjà d'interfícies intuïtives pot ajudar els usuaris sense coneixements tècnics a establir mapatges entre una font relacional i una ontologia. No obstant això, les eines existents per a l'edició visual de mapatges mostren algunes limitacions. En particular, la representació visual de mapatges no contempla les estructures de la font relacional i de l'ontologia de forma conjunta. Per superar aquest inconvenient, hem desenvolupat Map-On, un entorn visual web per a l'edició manual de mapatges. AutoMap4OBDA ha demostrat que supera les prestacions de les solucions existents per a la generació de mapatges. Map-On s'ha aplicat en projectes d'investigació per verificar la seva eficàcia en la gestió de mapatges.
La integración de datos con formatos heterogéneos y de diversos dominios mediante tecnologías de la Web Semántica permite solventar su disparidad estructural y semántica. El acceso a datos basado en ontologías (OBDA, en inglés) es una solución integral que se basa en el uso de ontologías como esquemas mediadores y mapeos entre los datos y las ontologías para facilitar la consulta de las fuentes de datos. Sin embargo, una de las principales barreras que puede dificultar más la adopción de OBDA es la falta de herramientas para apoyar la creación de mapeos entre datos y ontologías. El objetivo de esta investigación ha sido desarrollar nuevas herramientas que permitan a expertos sin conocimientos de ontologías la creación de mapeos entre datos y ontologías. Con este fin, se han llevado a cabo dos líneas de trabajo: la generación automática de mapeos entre datos relacionales y ontologías y la edición de los mapeos a través de su representación visual. Las herramientas actualmente disponibles para automatizar la generación de mapeos están lejos de proporcionar una solución completa, ya que se basan en los esquemas relacionales y apenas tienen en cuenta los contenidos de la fuente de datos relacional y las características de la ontología. Sin embargo, los datos pueden contener relaciones ocultas que pueden ayudar a la generación de mapeos. Para superar esta limitación, hemos desarrollado AutoMap4OBDA, un sistema que genera automáticamente mapeos R2RML a partir del análisis de los contenidos de la fuente relacional y teniendo en cuenta las características de la ontología. El sistema emplea una técnica de aprendizaje de ontologías para inferir jerarquías de clases, selecciona las métricas de similitud de cadenas en base a las etiquetas de las ontologías y analiza las estructuras de grafos para generar los mapeos a partir de la estructura de la ontología. La representación visual por medio de interfaces intuitivas puede ayudar a los usuarios sin conocimientos técnicos a establecer mapeos entre una fuente relacional y una ontología. Sin embargo, las herramientas existentes para la edición visual de mapeos muestran algunas limitaciones. En particular, la representación de mapeos no contempla las estructuras de la fuente relacional y de la ontología de forma conjunta. Para superar este inconveniente, hemos desarrollado Map-On, un entorno visual web para la edición manual de mapeos. AutoMap4OBDA ha demostrado que supera las prestaciones de las soluciones existentes para la generación de mapeos. Map-On se ha aplicado en proyectos de investigación para verificar su eficacia en la gestión de mapeos.
Integration of data from heterogeneous formats and domains based on Semantic Web technologies enables us to solve their structural and semantic heterogeneity. Ontology-based data access (OBDA) is a comprehensive solution which relies on the use of ontologies as mediator schemas and relational-to-ontology mappings to facilitate data source querying. However, one of the greatest obstacles in the adoption of OBDA is the lack of tools to support the creation of mappings between physically stored data and ontologies. The objective of this research has been to develop new tools that allow non-ontology experts to create relational-to-ontology mappings. For this purpose, two lines of work have been carried out: the automated generation of relational-to-ontology mappings, and visual support for mapping editing. The tools currently available to automate the generation of mappings are far from providing a complete solution, since they rely on relational schemas and barely take into account the contents of the relational data source and features of the ontology. However, the data may contain hidden relationships that can help in the process of mapping generation. To overcome this limitation, we have developed AutoMap4OBDA, a system that automatically generates R2RML mappings from the analysis of the contents of the relational source and takes into account the characteristics of ontology. The system employs an ontology learning technique to infer class hierarchies, selects the string similarity metric based on the labels of ontologies, and analyses the graph structures to generate the mappings from the structure of the ontology. The visual representation through intuitive interfaces can help non-technical users to establish mappings between a relational source and an ontology. However, existing tools for visual editing of mappings show somewhat limitations. In particular, the visual representation of mapping does not embrace the structure of the relational source and the ontology at the same time. To overcome this problem, we have developed Map-On, a visual web environment for the manual editing of mappings. AutoMap4OBDA has been shown to outperform existing solutions in the generation of mappings. Map-On has been applied in research projects to verify its effectiveness in managing mappings.

APA, Harvard, Vancouver, ISO, and other styles

29

Åberg, Ludvig. "Multimodal Classification of Second-Hand E-Commerce Ads." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-233324.

Full text

Abstract:

In second-hand e-commerce, categorization of new products is typically done by the seller. Automating this process makes it easier to upload ads and could lower the number of incorrectly categorized ads. Automatic ad categorization also makes it possible for a second-hand e-commerce platform to use a more detailed category system, which could make the shopping experience better for potential buyers. Product ad categorization is typically addressed as a text classification problem as most metadata associated with products are textual. By including image information, i.e. using a multimodal approach, better performance can however be expected. The work done in this thesis evaluates different multimodal deep learning models for the task of ad categorization on data from Blocket.se. We examine late fusion models, where the modalities are combined at decision level, and early fusion models, where the modalities are combined at feature level. We also introduce our own approach Text Based Visual Attention (TBVA), which extends the image CNN Inception v3 with an attention mechanism to incorporate textual information. For all models evaluated, the text classifier fastText is used to process text data and the Inception v3 network to process image data. Our results show that the late fusion models perform best in our setting. We conclude that these models generally learn which of the baseline models to ’trust’, while early fusion and the TBVA models learn more abstract concepts. As future work, we would like to examine how the TBVA models perform on other tasks, such as ad similarity.
Produkter som läggs ut på marknadsplatser, såsom Blocket.se, kategoriseras oftast av säljaren själv. Att automatisera processen för kategorisering gör det därför både enklare och snabbare att lägga upp annonser och kan minska antalet produkter med felaktig kategori. Automatisk kategorisering gör det ocksåmöjligt för marknadsplatsen att använda ett mer detaljerat kategorisystem, vilket skulle kunna effektivisera sökandet efter produkter för potentiella köpare.Produktkategorisering adresseras ofta som ett klassificeringsproblem för text, eftersom den största delen av produktinformationen finns i skriftlig form. Genom att också inkludera produktbilder kan vi dock förvänta oss bättre resultat.I den här uppsatsen evalueras olika metoder för att använda både bild och text för annonsklassificering av data från blocket.se. I synnerhetundersökslate fusion modeller, där informationen från modaliteterna kombineras i samband med klassificeringen, samt early fusion modeller, där modaliteterna istället kombineras på en abstrakt nivå innan klassificeringen. Vi introduserar också vår egen modell Text Based Visual Attention (TBVA), en utvidgning av bildklassificeraren Inception v3 [1], som använder en attention mekanism för att inkorporera textinformation. För alla modeller som beskrivs i denna uppsats används textklassificeraren fast Text[2] för att processa text och bildklassificeraren Inception v3 för att processa bild. Våra resultat visar att late fusion modeller presterar bäst med vår data. I slutsatsen konstateras att late fusion modellerna lär sig vilka fall den ska 'lita' på text eller bild informationen, där early fusion och TBVA modellerna istället lär sig mer abstrakta koncept. Som framtida arbete tror vi det skulle vara av värde att undersöka hur TBVA modellerna presterar på andra uppgifter, såsom att bedöma likheter mellan annonser.

APA, Harvard, Vancouver, ISO, and other styles

30

SAGHIÉ, Najla Fouad. "O ENSINO / APRENDIZAGEM DA LÍNGUA INGLESA NA PERSPECTIVA DA CULTURA VISUAL." Universidade Federal de Goiás, 2008. http://repositorio.bc.ufg.br/tede/handle/tde/2802.

Full text

Abstract:

Made available in DSpace on 2014-07-29T16:27:55Z (GMT). No. of bitstreams: 1 Dissertacao Najla pre-textuais.pdf: 183766 bytes, checksum: dc2b081f1849365bfe897b2ec732e841 (MD5) Previous issue date: 2008-04-07
This work focuses on discussing a methodological approach application of teaching based on reading and interpretation of the electronic and printed media image, intending to recognize linguistic elements (foreign words) linked to the subtitles from the ads. In the educational framework, suggesting an English Language teaching integrated to the Visual Culture and Art, contemplating the imagistic culture. This work, we also research the mediator process, that is, the teacher interference as a guide in the involvement between images, students and their representations. This dissertation is a result from a camp research, which on we applied an interpretative and prescriptive study about images from publicity done to twelfth-grade students (two groups formed by fifteen of teenagers) from a private school in Goiânia-GO. Reflecting about verbal language (written in English) and nonverbal (practices of looking images), in order to contribute to the educational context, that means, in the English Language teaching / learning process. This research was supported by theories of Visual Culture, Education, Discourse Analysis and Transdisciplinarity, argued for several theoreticians: Maingueneau (2004), Duncun (2003), Barbosa (2002) among others authors who had contributed, significantly, in order to a comprehension and analysis about this work. This way, we look forward to this investigation could, in some way, provide to go deeper into an interdisciplinary teaching with Visual Culture and the critical reading about the visual manifestations.
Este trabalho tem por objetivo discutir uma proposta de abordagem metodológica de ensino com base na leitura e interpretação imagética da mídia eletrônica e impressa, com o intuito de compreender os elementos lingüísticos: estrangeirismos vinculados à legenda da propaganda, em um contexto educacional, sugerindo o ensino de Língua Inglesa integrada à Cultura Visual e Arte, e contemplando a cultura imagética. Neste trabalho, pesquisei, também, os processos de mediação, ou seja, a interferência do professor como orientador do envolvimento entre imagens e alunos, e suas representações. Esse trabalho é resultado de uma pesquisa de campo, no qual apliquei um estudo interpretativo e prescritivo das imagens publicitárias, para alunos de sétimo ano do Ensino Fundamental (dois grupos de quinze alunos), de uma escola particular de Goiânia-Go, como forma de reflexão sobre as linguagens verbal (texto escrito em inglês) e não verbal (práticas do ver - imagens), no sentido de contribuir com o contexto educacional, ou melhor, com o processo ensino/aprendizagem da Língua Inglesa. Essa pesquisa foi apoiada pelas teorias da Cultura Visual, Educação, Análise do Discurso e Transdisciplinaridade, pontuadas por vários teóricos: Maingueneau (2004), Duncun (2003), Barbosa (2002) entre outros autores que contribuíram, significativamente, para a compreensão e análise deste trabalho. Desse modo, espero que esta pesquisa possa, de alguma forma, colaborar para o aprofundamento do ensino interdisciplinar com a Cultura Visual, e da leitura crítica das visualidades.

APA, Harvard, Vancouver, ISO, and other styles

31

Kochukhova, Olga. "When, Where and What : The Development of Perceived Spatio-Temporal Continuity." Doctoral thesis, Uppsala : Acta Universitatis Upsaliensis, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-7760.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Kashyap, Karan. "Learning digits via joint audio-visual representations." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113143.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-60).
Our goal is to explore models for language learning in the manner that humans learn languages as children. Namely, children do not have intermediary text transcriptions in correlating visual and audio inputs from the environment; rather, they directly make connections between what they see and what they hear, sometimes even across languages! In this thesis, we present weakly-supervised models for learning representations of numerical digits between two modalities: speech and images. We experiment with architectures of convolutional neural networks taking in spoken utterances of numerical digits and images of handwritten digits as inputs. In nearly all cases we randomly initialize network weights (without pre-training) and evaluate the model's ability to return a matching image for a spoken input or to identify the number of overlapping digits between an utterance and an image. We also provide some visuals as evidence that our models are truly learning correspondences between the two modalities.
by Karan Kashyap.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

33

Liu, Li. "Learning discriminative feature representations for visual categorization." Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/8239/.

Full text

Abstract:

Learning discriminative feature representations has attracted a great deal of attention due to its potential value and wide usage in a variety of areas, such as image/video recognition and retrieval, human activities analysis, intelligent surveillance and human-computer interaction. In this thesis we first introduce a new boosted key-frame selection scheme for action recognition. Specifically, we propose to select a subset of key poses for the representation of each action via AdaBoost and a new classifier, namely WLNBNN, is then developed for final classification. The experimental results of the proposed method are 0.6% - 13.2% better than previous work. After that, a domain-adaptive learning approach based on multiobjective genetic programming (MOGP) has been developed for image classification. In this method, a set of primitive 2-D operators are randomly combined to construct feature descriptors through the MOGP evolving and then evaluated by two objective fitness criteria, i.e., the classification error and the tree complexity. Later, the (near-)optimal feature descriptor can be obtained. The proposed approach can achieve 0.9% ∼ 25.9% better performance compared with state-of-the-art methods. Moreover, effective dimensionality reduction algorithms have also been widely used for obtaining better representations. In this thesis, we have proposed a novel linear unsupervised algorithm, termed Discriminative Partition Sparsity Analysis (DPSA), explicitly considering different probabilistic distributions that exist over the data points, simultaneously preserving the natural locality relationship among the data. All these above methods have been systematically evaluated on several public datasets, showing their accurate and robust performance (0.44% - 6.69% better than the previous) for action and image categorization. Targeting efficient image classification , we also introduce a novel unsupervised framework termed evolutionary compact embedding (ECE) which can automatically learn the task-specific binary hash codes. It is regarded as an optimization algorithm which combines the genetic programming (GP) and a boosting trick. The experimental results manifest ECE significantly outperform others by 1.58% - 2.19% for classification tasks. In addition, a supervised framework, bilinear local feature hashing (BLFH), has also been proposed to learn highly discriminative binary codes on the local descriptors for large-scale image similarity search. We address it as a nonconvex optimization problem to seek orthogonal projection matrices for hashing, which can successfully preserve the pairwise similarity between different local features and simultaneously take image-to-class (I2C) distances into consideration. BLFH produces outstanding results (0.017% - 0.149% better) compared to the state-of-the-art hashing techniques.

APA, Harvard, Vancouver, ISO, and other styles

34

Rahim, Medhat H., and Radcliffe Siddo. "The use of visualization for learning and teaching mathematics." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2012. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-80852.

Full text

Abstract:

In this article, based on Dissection-Motion-Operations, DMO (decomposing a figure into several pieces and composing the resulting pieces into a new figure of equal area), a set of visual representations (models) of mathematical concepts will be introduced. The visual models are producible through manipulation and computer GSP/Cabri software. They are based on the van Hiele’s Levels (van Hiele, 1989) of Thought Development; in particular, Level 2 (Informal Deductive Reasoning) and level 3 (Deductive Reasoning). The basic theme for these models has been visual learning and understanding through manipulatives and computer representations of mathematical concepts vs. rote learning and memorization. The three geometric transformations or motions: Translation, Rotation, Reflection and their possible combinations were used; they are illustrated in several texts. As well, a set of three commonly used dissections or decompositions (Eves, 1972) of objects was utilized.

APA, Harvard, Vancouver, ISO, and other styles

35

Doersch, Carl. "Supervision Beyond Manual Annotations for Learning Visual Representations." Research Showcase @ CMU, 2016. http://repository.cmu.edu/dissertations/787.

Full text

Abstract:

For both humans and machines, understanding the visual world requires relating new percepts with past experience. We argue that a good visual representation for an image should encode what makes it similar to other images, enabling the recall of associated experiences. Current machine implementations of visual representations can capture some aspects of similarity, but fall far short of human ability overall. Even if one explicitly labels objects in millions of images to tell the computer what should be considered similar—a very expensive procedure—the labels still do not capture everything that might be relevant. This thesis shows that one can often train a representation which captures similarity beyond what is labeled in a given dataset. That means we can begin with a dataset that has uninteresting labels, or no labels at all, and still build a useful representation. To do this, we propose to using pretext tasks: tasks that are not useful in and of themselves, but serve as an excuse to learn a more general-purpose representation. The labels for a pretext task can be inexpensive or even free. Furthermore, since this approach assumes training labels differ from the desired outputs, it can handle output spaces where the correct answer is ambiguous, and therefore impossible to annotate by hand. The thesis explores two broad classes of supervision. The first isweak image-level supervision, which is exploited to train mid-level discriminative patch classifiers. For example, given a dataset of street-level imagery labeled only with GPS coordinates, patch classifiers are trained to differentiate one specific geographical region (e.g. the city of Paris) from others. The resulting classifiers each automatically collect and associate a set of patches which all depict the same distinctive architectural element. In this way, we can learn to detect elements like balconies, signs, and lamps without annotations. The second type of supervision requires no information about images other than the pixels themselves. Instead, the algorithm is trained to predict the context around image patches. The context serves as a sort of weak label: to predict well, the algorithm must associate similar-looking patches which also have similar contexts. After training, the feature representation learned using this within-image context indeed captures visual similarity across images, which ultimately makes it useful for real tasks like object detection and geometry estimation.

APA, Harvard, Vancouver, ISO, and other styles

36

Jankowska, Gierus Bogumila. "Learning with visual representations through cognitive load theory." Thesis, McGill University, 2011. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=104827.

Full text

Abstract:

This study examined two different strategies of learning with diagrams: drawing diagrams while learning or learning from pre-constructed diagrams. One hundred ninety six junior high school students were randomly placed in a condition either to draw while learning about how airplanes fly or to study from pre-constructed diagrams. Before the learning, students' prior knowledge and elaboration strategies were measured. During learning in either condition, students reported their mental effort. Afterwards, students' learning was tested on both a similar task and transfer task. Cook's (2006) theoretical framework, which combines prior knowledge and cognitive load theory on visual representations in science education, was used to analyze the results. Results showed that students' mental effort significantly increased in the drawing condition, yet results on the posttest were mixed. Students did not do better, and sometimes did worse, on the posttest measures when they learned by drawing diagrams versus using pre-constructed diagrams to learn. The exception was that students with low initial prior knowledge did do better. Elaborations strategies did not have an effect on students' achievement or mental effort in either condition.
Cette étude a examiné deux stratégies différentes d'apprendre à l'aide des diagrammes: le dessin de diagrammes tout en apprenant ou en apprenant sur la base des diagrammes préconstruits. Cent quatre-vingt-seize étudiants de lycée ont été aléatoirement placés dans une condition où soit ils dessinaient tout en se renseignant sur la façon dont les avions volent ou étudiaient à partir des diagrammes préconstruits. Avant l'étude, les stratégies de connaissance et d'élaboration des étudiants ont été vérifiées. Pendant l'étude sous l'une ou l'autre des conditions, les étudiants signalaient leur effort mental. Suite à cela, l'étude des étudiants est examinée sur une tâche semblable et une tâche de transfert. Cadre théorique de Cook (2006), qui combine la théorie de la connaissance antérieure et de charge cognitive sur les représentations visuelles dans l'éducation de la science, ont été employés pour analyser les résultats. Les résultats ont prouvé que l'effort mental des étudiants a augmenté sensiblement sous condition de dessin, pourtant les résultats sur le post-test étaient mitigés. En règle générale, les étudiants ont fait plus ou moins mauvais sur les mesures de post-test quand ils ont appris en traçant des diagrammes au contraire de l'utilisation des diagrammes préconstruits pour apprendre. Cependant, les étudiants ayant une faible connaissance de base ont mieux exécuté le post-test en traçant leurs propres diagrammes. Les stratégies d'élaborations n'ont pas exercé d' effet sur l'accomplissement ou l'effort mental des étudiants pour chacune des conditions.

APA, Harvard, Vancouver, ISO, and other styles

37

Parekh, Sanjeel. "Learning representations for robust audio-visual scene analysis." Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT015/document.

Full text

Abstract:

L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding

APA, Harvard, Vancouver, ISO, and other styles

38

Parekh, Sanjeel. "Learning representations for robust audio-visual scene analysis." Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLT015.

Full text

Abstract:

L'objectif de cette thèse est de concevoir des algorithmes qui permettent la détection robuste d’objets et d’événements dans des vidéos en s’appuyant sur une analyse conjointe de données audio et visuelle. Ceci est inspiré par la capacité remarquable des humains à intégrer les caractéristiques auditives et visuelles pour améliorer leur compréhension de scénarios bruités. À cette fin, nous nous appuyons sur deux types d'associations naturelles entre les modalités d'enregistrements audiovisuels (réalisés à l'aide d'un seul microphone et d'une seule caméra), à savoir la corrélation mouvement/audio et la co-occurrence apparence/audio. Dans le premier cas, nous utilisons la séparation de sources audio comme application principale et proposons deux nouvelles méthodes dans le cadre classique de la factorisation par matrices non négatives (NMF). L'idée centrale est d'utiliser la corrélation temporelle entre l'audio et le mouvement pour les objets / actions où le mouvement produisant le son est visible. La première méthode proposée met l'accent sur le couplage flexible entre les représentations audio et de mouvement capturant les variations temporelles, tandis que la seconde repose sur la régression intermodale. Nous avons séparé plusieurs mélanges complexes d'instruments à cordes en leurs sources constituantes en utilisant ces approches.Pour identifier et extraire de nombreux objets couramment rencontrés, nous exploitons la co-occurrence apparence/audio dans de grands ensembles de données. Ce mécanisme d'association complémentaire est particulièrement utile pour les objets où les corrélations basées sur le mouvement ne sont ni visibles ni disponibles. Le problème est traité dans un contexte faiblement supervisé dans lequel nous proposons un framework d’apprentissage de représentation pour la classification robuste des événements audiovisuels, la localisation des objets visuels, la détection des événements audio et la séparation de sources.Nous avons testé de manière approfondie les idées proposées sur des ensembles de données publics. Ces expériences permettent de faire un lien avec des phénomènes intuitifs et multimodaux que les humains utilisent dans leur processus de compréhension de scènes audiovisuelles
The goal of this thesis is to design algorithms that enable robust detection of objectsand events in videos through joint audio-visual analysis. This is motivated by humans’remarkable ability to meaningfully integrate auditory and visual characteristics forperception in noisy scenarios. To this end, we identify two kinds of natural associationsbetween the modalities in recordings made using a single microphone and camera,namely motion-audio correlation and appearance-audio co-occurrence.For the former, we use audio source separation as the primary application andpropose two novel methods within the popular non-negative matrix factorizationframework. The central idea is to utilize the temporal correlation between audio andmotion for objects/actions where the sound-producing motion is visible. The firstproposed method focuses on soft coupling between audio and motion representationscapturing temporal variations, while the second is based on cross-modal regression.We segregate several challenging audio mixtures of string instruments into theirconstituent sources using these approaches.To identify and extract many commonly encountered objects, we leverageappearance–audio co-occurrence in large datasets. This complementary associationmechanism is particularly useful for objects where motion-based correlations are notvisible or available. The problem is dealt with in a weakly-supervised setting whereinwe design a representation learning framework for robust AV event classification,visual object localization, audio event detection and source separation.We extensively test the proposed ideas on publicly available datasets. The experimentsdemonstrate several intuitive multimodal phenomena that humans utilize on aregular basis for robust scene understanding

APA, Harvard, Vancouver, ISO, and other styles

39

Silberer, Carina Helga. "Learning visually grounded meaning representations." Thesis, University of Edinburgh, 2015. http://hdl.handle.net/1842/14236.

Full text

Abstract:

Humans possess a rich semantic knowledge of words and concepts which captures the perceivable physical properties of their real-world referents and their relations. Encoding this knowledge or some of its aspects is the goal of computational models of semantic representation and has been the subject of considerable research in cognitive science, natural language processing, and related areas. Existing models have placed emphasis on different aspects of meaning, depending ultimately on the task at hand. Typically, such models have been used in tasks addressing the simulation of behavioural phenomena, e.g., lexical priming or categorisation, as well as in natural language applications, such as information retrieval, document classification, or semantic role labelling. A major strand of research popular across disciplines focuses on models which induce semantic representations from text corpora. These models are based on the hypothesis that the meaning of words is established by their distributional relation to other words (Harris, 1954). Despite their widespread use, distributional models of word meaning have been criticised as ‘disembodied’ in that they are not grounded in perception and action (Perfetti, 1998; Barsalou, 1999; Glenberg and Kaschak, 2002). This lack of grounding contrasts with many experimental studies suggesting that meaning is acquired not only from exposure to the linguistic environment but also from our interaction with the physical world (Landau et al., 1998; Bornstein et al., 2004). This criticism has led to the emergence of new models aiming at inducing perceptually grounded semantic representations. Essentially, existing approaches learn meaning representations from multiple views corresponding to different modalities, i.e. linguistic and perceptual input. To approximate the perceptual modality, previous work has relied largely on semantic attributes collected from humans (e.g., is round, is sour), or on automatically extracted image features. Semantic attributes have a long-standing tradition in cognitive science and are thought to represent salient psychological aspects of word meaning including multisensory information. However, their elicitation from human subjects limits the scope of computational models to a small number of concepts for which attributes are available. In this thesis, we present an approach which draws inspiration from the successful application of attribute classifiers in image classification, and represent images and the concepts depicted by them by automatically predicted visual attributes. To this end, we create a dataset comprising nearly 700K images and a taxonomy of 636 visual attributes and use it to train attribute classifiers. We show that their predictions can act as a substitute for human-produced attributes without any critical information loss. In line with the attribute-based approximation of the visual modality, we represent the linguistic modality by textual attributes which we obtain with an off-the-shelf distributional model. Having first established this core contribution of a novel modelling framework for grounded meaning representations based on semantic attributes, we show that these can be integrated into existing approaches to perceptually grounded representations. We then introduce a model which is formulated as a stacked autoencoder (a variant of multilayer neural networks), which learns higher-level meaning representations by mapping words and images, represented by attributes, into a common embedding space. In contrast to most previous approaches to multimodal learning using different variants of deep networks and data sources, our model is defined at a finer level of granularity—it computes representations for individual words and is unique in its use of attributes as a means of representing the textual and visual modalities. We evaluate the effectiveness of the representations learnt by our model by assessing its ability to account for human behaviour on three semantic tasks, namely word similarity, concept categorisation, and typicality of category members. With respect to the word similarity task, we focus on the model’s ability to capture similarity in both the meaning and appearance of the words’ referents. Since existing benchmark datasets on word similarity do not distinguish between these two dimensions and often contain abstract words, we create a new dataset in a large-scale experiment where participants are asked to give two ratings per word pair expressing their semantic and visual similarity, respectively. Experimental results show that our model learns meaningful representations which are more accurate than models based on individual modalities or different modality integration mechanisms. The presented model is furthermore able to predict textual attributes for new concepts given their visual attribute predictions only, which we demonstrate by comparing model output with human generated attributes. Finally, we show the model’s effectiveness in an image-based task on visual category learning, in which images are used as a stand-in for real-world objects.

APA, Harvard, Vancouver, ISO, and other styles

40

Robert, Thomas. "Improving Latent Representations of ConvNets for Visual Understanding." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS343.

Full text

Abstract:

Depuis le début de la décennie, les réseaux de neurones convolutifs profonds pour le traitement d'images ont démontré leur capacité à produire d'excellent résultats. Pour cela, ces modèles transforment une image en une succession de représentations latentes. Dans cette thèse, nous travaillerons à l'amélioration de la qualité de ces représentations latentes. Dans un premier temps, nous travaillons à la régularisation de ces représentations pour les rendre plus robustes aux variations intra-classe et améliorer les performances de classification via une pénalité basée sur des métriques liées à la théorie de l'information. Dans un second temps, nous proposons de structurer l'information en deux sous-espaces latents complémentaires, résolvant un conflit entre l'invariance des représentations et la reconstruction. La structuration en deux espaces permet ainsi de relâcher la contrainte posée par les architectures classiques, permettant ainsi d'obtenir de meilleurs résultats en classification semi-supervisé. Enfin, nous nous intéressons au disentangling, c'est-à-dire la séparation de facteurs sémantiques indépendants. Nous poursuivons nos travaux de structuration des espaces latent et utilisons des coûts adverses pour assurer une séparation efficace de l'information. Cela permet d'améliorer la qualité des représentations ainsi que l'édition sémantique d'images
For a decade now, convolutional deep neural networks have demonstrated their ability to produce excellent results for computer vision. For this, these models transform the input image into a series of latent representations. In this thesis, we work on improving the "quality'' of the latent representations of ConvNets for different tasks. First, we work on regularizing those representations to increase their robustness toward intra-class variations and thus improve their performance for classification. To do so, we develop a loss based on information theory metrics to decrease the entropy conditionally to the class. Then, we propose to structure the information in two complementary latent spaces, solving a conflict between the invariance of the representations and the reconstruction task. This structure allows to release the constraint posed by classical architecture, allowing to obtain better results in the context of semi-supervised learning. Finally, we address the problem of disentangling, i.e. explicitly separating and representing independent factors of variation of the dataset. We pursue our work on structuring the latent spaces and use adversarial costs to ensure an effective separation of the information. This allows to improve the quality of the representations and allows semantic image editing

APA, Harvard, Vancouver, ISO, and other styles

41

Fernández, López Adriana. "Learning of meaningful visual representations for continuous lip-reading." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/671206.

Full text

Abstract:

In the last decades, there has been an increased interest in decoding speech exclusively using visual cues, i.e. mimicking the human capability to perform lip-reading, leading to Automatic Lip-Reading (ALR) systems. However, it is well known that the access to speech through the visual channel is subject to many limitations when compared to the audio channel, i.e. it has been argued that humans can actually read around 30% of the information from the lips, and the rest is filled-in from the context. Thus, one of the main challenges in ALR resides in the visual ambiguities that arise at the word level, highlighting that not all sounds that we hear can be easily distinguished by observing the lips. In the literature, early ALR systems addressed simple recognition tasks such as alphabet or digit recognition but progressively shifted to more complex and realistic settings leading to several recent systems that target continuous lip-reading. To a large extent, these advances have been possible thanks to the construction of powerful systems based on deep learning architectures that have quickly started to replace traditional systems. Despite the recognition rates for continuous lip-reading may appear modest in comparison to those achieved by audio-based systems, the field has undeniably made a step forward. Interestingly, an analogous effect can be observed when humans try to decode speech: given sufficiently clean signals, most people can effortlessly decode the audio channel but would struggle to perform lip-reading, since the ambiguity of the visual cues makes it necessary the use of further context to decode the message. In this thesis, we explore the appropriate modeling of visual representations with the aim to improve continuous lip-reading. To this end, we present different data-driven mechanisms to handle the main challenges in lip-reading related to the ambiguities or the speaker dependency of visual cues. Our results highlight the benefits of a proper encoding of the visual channel, for which the most useful features are those that encode corresponding lip positions in a similar way, independently of the speaker. This fact opens the door to i) lip-reading in many different languages without requiring large-scale datasets, and ii) increasing the contribution of the visual channel in audio-visual speech systems. On the other hand, our experiments identify a tendency to focus on the modeling of temporal context as the key to advance the field, where there is a need for ALR models that are trained on datasets comprising large speech variability at several context levels. In this thesis, we show that both proper modeling of visual representations and the ability to retain context at several levels are necessary conditions to build successful lip-reading systems.
En les darreres dècades, hi ha hagut un interès creixent en la descodificació de la parla utilitzant exclusivament senyals visuals, es a dir, imitant la capacitat humana de llegir els llavis, donant lloc a sistemes de lectura automàtica de llavis (ALR). No obstant això, se sap que l’accès a la parla a través del canal visual està subjecte a moltes limitacions en comparació amb el senyal acústic, es a dir, s’ha argumentat que els humans poden llegir al voltant del 30% de la informació dels llavis, i la resta es completa fent servir el context. Així, un dels principals reptes de l’ALR resideix en les ambigüitats visuals que sorgeixen a escala de paraula, destacant que no tots els sons que escoltem es poden distingir fàcilment observant els llavis. A la literatura, els primers sistemes ALR van abordar tasques de reconeixement senzilles, com ara el reconeixement de l’alfabet o els dígits, però progressivament van passar a entorns mes complexos i realistes que han conduït a diversos sistemes recents dirigits a la lectura continua dels llavis. En gran manera, aquests avenços han estat possibles gracies a la construcció de sistemes potents basats en arquitectures d’aprenentatge profund que han començat a substituir ràpidament els sistemes tradicionals. Tot i que les taxes de reconeixement de la lectura continua dels llavis poden semblar modestes en comparació amb les assolides pels sistemes basats en audio, és evident que el camp ha fet un pas endavant. Curiosament, es pot observar un efecte anàleg quan els humans intenten descodificar la parla: donats senyals sense soroll, la majoria de la gent pot descodificar el canal d’àudio sense esforç¸, però tindria dificultats per llegir els llavis, ja que l’ambigüitat dels senyals visuals fa necessari l’ús de context addicional per descodificar el missatge. En aquesta tesi explorem el modelatge adequat de representacions visuals amb l’objectiu de millorar la lectura contínua dels llavis. Amb aquest objectiu, presentem diferents mecanismes basats en dades per fer front als principals reptes de la lectura de llavis relacionats amb les ambigüitats o la dependència dels parlants dels senyals visuals. Els nostres resultats destaquen els avantatges d’una correcta codificació del canal visual, per a la qual les característiques més útils són aquelles que codifiquen les posicions corresponents dels llavis d’una manera similar, independentment de l’orador. Aquest fet obre la porta a i) la lectura de llavis en molts idiomes diferents sense necessitat de conjunts de dades a gran escala, i ii) a l’augment de la contribució del canal visual en sistemes de parla audiovisuals.´ D’altra banda, els nostres experiments identifiquen una tendència a centrar-se en iii la modelització del context temporal com la clau per avançar en el camp, on hi ha la necessitat de models d’ALR que s’entrenin en conjunts de dades que incloguin una gran variabilitat de la parla a diversos nivells de context. En aquesta tesi, demostrem que tant el modelatge adequat de les representacions visuals com la capacitat de retenir el context a diversos nivells són condicions necessàries per construir sistemes de lectura de llavis amb èxit.

APA, Harvard, Vancouver, ISO, and other styles

42

Evans, Benjamin D. "Learning transformation-invariant visual representations in spiking neural networks." Thesis, University of Oxford, 2012. https://ora.ox.ac.uk/objects/uuid:15bdf771-de28-400e-a1a7-82228c7f01e4.

Full text

Abstract:

This thesis aims to understand the learning mechanisms which underpin the process of visual object recognition in the primate ventral visual system. The computational crux of this problem lies in the ability to retain specificity to recognize particular objects or faces, while exhibiting generality across natural variations and distortions in the view (DiCarlo et al., 2012). In particular, the work presented is focussed on gaining insight into the processes through which transformation-invariant visual representations may develop in the primate ventral visual system. The primary motivation for this work is the belief that some of the fundamental mechanisms employed in the primate visual system may only be captured through modelling the individual action potentials of neurons and therefore, existing rate-coded models of this process constitute an inadequate level of description to fully understand the learning processes of visual object recognition. To this end, spiking neural network models are formulated and applied to the problem of learning transformation-invariant visual representations, using a spike-time dependent learning rule to adjust the synaptic efficacies between the neurons. The ways in which the existing rate-coded CT (Stringer et al., 2006) and Trace (Földiák, 1991) learning mechanisms may operate in a simple spiking neural network model are explored, and these findings are then applied to a more accurate model using realistic 3-D stimuli. Three mechanisms are then examined, through which a spiking neural network may solve the problem of learning separate transformation-invariant representations in scenes composed of multiple stimuli by temporally segmenting competing input representations. The spike-time dependent plasticity in the feed-forward connections is then shown to be able to exploit these input layer dynamics to form individual stimulus representations in the output layer. Finally, the work is evaluated and future directions of investigation are proposed.

APA, Harvard, Vancouver, ISO, and other styles

43

Feng, Zeyu. "Learning Deep Representations from Unlabelled Data for Visual Recognition." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/26876.

Full text

Abstract:

Self-supervised learning (SSL) aims at extracting from abundant unlabelled images transferable semantic features, which benefit various downstream visual tasks by reducing the sample complexity when human annotated labels are scarce. SSL is promising because it also boosts performance in diverse tasks when combined with the knowledge of existing techniques. Therefore, it is important and meaningful to study how SSL leads to better transferability and design novel SSL methods. To this end, this thesis proposes several methods to improve SSL and its function in downstream tasks. We begin by investigating the effect of unlabelled training data, and introduce an information-theoretical constraint for SSL from multiple related domains. In contrast to conventional single dataset, exploiting multi-domains has the benefits of decreasing the build-in bias of individual domain and allowing knowledge transfer across domains. Thus, the learned representation is more unbiased and transferable. Next, we describe a feature decoupling (FD) framework that incorporates invariance into predicting transformations, one main category of SSL methods, by observing that they often lead to co-variant features unfavourable for transfer. Our model learns a split representation that contains both transformation related and unrelated parts. FD achieves SOTA results on SSL benchmarks. We also present a multi-task method with theoretical understanding for contrastive learning, the other main category of SSL, by leveraging the semantic information from synthetic images to facilitate the learning of class-related semantics. Finally, we explore self-supervision in open-set unsupervised classification with the knowledge of source domain. We propose to enforce consistency under transformation of target data and discover pseudo-labels from confident predictions. Experimental results outperform SOTA open-set domain adaptation methods.

APA, Harvard, Vancouver, ISO, and other styles

44

Al, chanti Dawood. "Analyse Automatique des Macro et Micro Expressions Faciales : Détection et Reconnaissance par Machine Learning." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAT058.

Full text

Abstract:

L’analyse automatique des expressions faciales représente à l’heure actuelle une problématique importante associée à de multiples applications telles que la reconnaissance de visages ou encore les interactions homme machine. Dans cette thèse, nous nous attaquons au problème de la reconnaissance d’expressions faciales à partir d’une image ou d’une séquence d’images. Nous abordons le problème sous trois angles.Tout d’abord, nous étudions les macro-expressions faciales et nous proposons de comparer l’efficacité de trois descripteurs différents. Cela conduit au développement d’un algorithme de reconnaissance d’expressions basé sur des descripteurs bas niveau encodés dans un modèle de type sac de mots, puis d’un algorithme basé sur des descripteurs de moyen niveau associés à une représentation éparse et enfin d’un algorithme d’apprentissage profond tenant compte de descripteurs haut niveau. Notre objectif lors de la comparaison de ces trois algorithmes est de trouver la représentation des informations de visages la plus discriminante pour reconnaitre des expressions faciales en étant donc capable de s’affranchir des sources de variabilités que sont les facteurs de variabilité intrinsèques tels que l’apparence du visage ou la manière de réaliser une expression donnée et les facteurs de variabilité extrinsèques tels que les variations d’illumination, de pose, d’échelle, de résolution, de bruit ou d’occultations. Nous examinons aussi l’apport de descripteurs spatio-temporels capables de prendre en compte des informations dynamiques utiles pour séparer les classes ambigües.La grosse limitation des méthodes de classification supervisée est qu’elles sont très coûteuses en termes de labélisation de données. Afin de s’affranchir en partie de cette limitation, nous avons étudié dans un second temps, comment utiliser des méthodes de transfert d’apprentissage de manière à essayer d’étendre les modèles appris sur un ensemble donné de classes d’émotions à des expressions inconnues du processus d’apprentissage. Ainsi nous nous sommes intéressés à l’adaptation de domaine et à l’apprentissage avec peu ou pas de données labélisées. La méthode proposée nous permet de traiter des données non labélisées provenant de distributions différentes de celles du domaine source de l’apprentissage ou encore des données qui ne concernent pas les mêmes labels mais qui partagent le même contexte. Le transfert de connaissance s’appuie sur un apprentissage euclidien et des réseaux de neurones convolutifs de manière à définir une fonction de mise en correspondance entre les informations visuelles provenant des expressions faciales et un espace sémantique issu d’un modèle de langage naturel.Dans un troisième temps, nous nous sommes intéressés à la reconnaissance des micro-expressions faciales. Nous proposons un algorithme destiné à localiser ces micro-expressions dans une séquence d’images depuis l’image initiale (onset image) jusqu’à l’image finale (offset image) et à déterminer les régions des images qui sont affectées par les micro-déformations associées aux micro-expressions. Le problème est abordé sous un angle de détection d’anomalies ce qui se justifie par le fait que les déformations engendrées par les micro-expressions sont a priori un phénomène plus rare que celles produites par toutes les autres causes de déformation du visage telles que les macro-expressions, les clignements des yeux, les mouvements de la tête… Ainsi nous proposons un réseau de neurones auto-encodeur récurrent destiné à capturer les changements spatiaux et temporels associés à toutes les déformations du visage autres que celles dues aux micro-expressions. Ensuite, nous apprenons un modèle statistique basé sur un mélange de gaussiennes afin d’estimer la densité de probabilité de ces déformations autres que celles dues aux micro-expressions.Tous nos algorithmes sont testés et évalués sur des bases d’expressions faciales actées et/ou spontanées
Facial expression analysis is an important problem in many biometric tasks, such as face recognition, face animation, affective computing and human computer interface. In this thesis, we aim at analyzing facial expressions of a face using images and video sequences. We divided the problem into three leading parts.First, we study Macro Facial Expressions for Emotion Recognition and we propose three different levels of feature representations. Low-level feature through a Bag of Visual Word model, mid-level feature through Sparse Representation and hierarchical features through a Deep Learning based method. The objective of doing this is to find the most effective and efficient representation that contains distinctive information of expressions and that overcomes various challenges coming from: 1) intrinsic factors such as appearance and expressiveness variability and 2) extrinsic factors such as illumination, pose, scale and imaging parameters, e.g., resolution, focus, imaging, noise. Then, we incorporate the time dimension to extract spatio-temporal features with the objective to describe subtle feature deformations to discriminate ambiguous classes.Second, we direct our research toward transfer learning, where we aim at Adapting Facial Expression Category Models to New Domains and Tasks. Thus we study domain adaptation and zero shot learning for developing a method that solves the two tasks jointly. Our method is suitable for unlabelled target datasets coming from different data distributions than the source domain and for unlabelled target datasets with different label distributions but sharing the same context as the source domain. Therefore, to permit knowledge transfer between domains and tasks, we use Euclidean learning and Convolutional Neural Networks to design a mapping function that map the visual information coming from facial expressions into a semantic space coming from a Natural Language model that encodes the visual attribute description or use the label information. The consistency between the two subspaces is maximized by aligning them using the visual feature distribution.Third, we study Micro Facial Expression Detection. We propose an algorithm to spot micro-expression segments including the onset and offset frames and to spatially pinpoint in each image space the regions involved in the micro-facial muscle movements. The problem is formulated into Anomaly Detection due to the fact that micro-expressions occur infrequently and thus leading to few data generation compared to natural facial behaviours. In this manner, first, we propose a deep Recurrent Convolutional Auto-Encoder to capture spatial and motion feature changes of natural facial behaviours. Then, a statistical based model for estimating the probability density function of normal facial behaviours while associating a discriminating score to spot micro-expressions is learned based on a Gaussian Mixture Model. Finally, an adaptive thresholding technique for identifying micro expressions from natural facial behaviour is proposed.Our algorithms are tested over deliberate and spontaneous facial expression benchmarks

APA, Harvard, Vancouver, ISO, and other styles

45

Clapés, i. Sintes Albert. "Learning to recognize human actions: from hand-crafted to deep-learning based visual representations." Doctoral thesis, Universitat de Barcelona, 2019. http://hdl.handle.net/10803/666794.

Full text

Abstract:

Action recognition is a very challenging and important problem in computer vision. Researchers working on this field aspire to provide computers with the ability to visually perceive human actions – that is, to observe, interpret, and understand human-related events that occur in the physical environment merely from visual data. The applications of this technology are numerous: human-machine interaction, e-health, monitoring/surveillance, and content-based video retrieval, among others. Hand-crafted methods dominated the field until the apparition of the first successful deep learning-based action recognition works. Although earlier deep-based methods underperformed with respect to hand-crafted approaches, these slowly but steadily improved to become state-of-the-art, eventually achieving better results than hand-crafted ones. Still, hand-crafted approaches can be advantageous in certain scenarios, specially when not enough data is available to train very large deep models or simply to be combined with deep-based methods to further boost the performance. Hence, showing how hand-crafted features can provide extra knowledge the deep networks are not able to easily learn about human actions. This Thesis concurs in time with this change of paradigm and, hence, reflects it into two distinguished parts. In the first part, we focus on improving current successful hand-crafted approaches for action recognition and we do so from three different perspectives. Using the dense trajectories framework as a backbone: first, we explore the use of multi-modal and multi-view input data to enrich the trajectory descriptors. Second, we focus on the classification part of action recognition pipelines and propose an ensemble learning approach, where each classifier learns from a different set of local spatiotemporal features to then combine their outputs following an strategy based on the Dempster-Shaffer Theory. And third, we propose a novel hand-crafted feature extraction method that constructs a mid-level feature description to better model long-term spatiotemporal dynamics within action videos. Moving to the second part of the Thesis, we start with a comprehensive study of the current deep-learning based action recognition methods. We review both fundamental and cutting edge methodologies reported during the last few years and introduce a taxonomy of deep-learning methods dedicated to action recognition. In particular, we analyze and discuss how these handle the temporal dimension of data. Last but not least, we propose a residual recurrent network for action recognition that naturally integrates all our previous findings in a powerful and promising framework.
El reconeixement d’accions és un repte de gran rellevància pel que fa a la visió per computador. Els investigadors que treballen en el camp aspiren a proveir als ordinadors l’habilitat de percebre visualment les accions humanes – és a dir, d’observar, interpretar i comprendre a partir de dades visuals els events que involucren humans i que transcorren en l’entorn físic. Les aplicacions d’aquesta tecnologia són nombroses: interacció home-màquina, e-salut, monitoració/vigilància, indexació de videocontingut, etc. Els mètodes de disseny manual han dominat el camp fins l’aparició dels primers treballs exitosos d’aprenentatge profund, els quals han acabat esdevenint estat de l’art. No obstant, els mètodes de disseny manual resulten útils en certs escenaris, com ara quan no es tenen prou dades per a l’entrenament dels mètodes profunds, així com també aportant coneixement addicional que aquests últims no són capaços d’aprendre fàcilment. És per això que sovint els trobem ambdós combinats, aconseguint una millora general del reconeixement. Aquesta Tesi ha concorregut en el temps amb aquest canvi de paradigma i, per tant, ho reflecteix en dues parts ben distingides. En la primera part, estudiem les possibles millores sobre els mètodes existents de característiques manualment dissenyades per al reconeixement d’accions, i ho fem des de diversos punts de vista. Fent ús de les trajectòries denses com a fonament del nostre treball: primer, explorem l’ús de dades d’entrada de múltiples modalitats i des de múltiples vistes per enriquir els descriptors de les trajectòries. Segon, ens centrem en la part de la classificació del reconeixement d’accions, proposant un assemblat de classificadors d’accions que actuen sobre diversos conjunts de característiques i fusionant-ne les sortides amb una estratégia basada en la Teoria de Dempster-Shaffer. I tercer, proposem un nou mètode de disseny manual d’extracció de característiques que construeix una descripció intermèdia dels videos per tal d’aconseguir un millor modelat de les dinàmiques espai-temporals de llarg termini presents en els vídeos d’accions. Pel que fa a la segona part de la Tesi, comencem amb un estudi exhaustiu els mètodes actuals d’aprenentatge profund pel reconeixement d’accions. En revisem les metodologies més fonamentals i les més avançades darrerament aparegudes i establim una taxonomia que en resumeix els aspectes més importants. Més concretament, analitzem com cadascun dels mètodes tracta la dimensió temporal de les dades de vídeo. Per últim però no menys important, proposem una nova xarxa de neurones recurrent amb connexions residuals que integra de manera implícita les nostres contribucions prèvies en un nou marc d’acoblament potent i que mostra resultats prometedors.

APA, Harvard, Vancouver, ISO, and other styles

46

Hom, John S. "Making the Invisible Visible: Interrogating social spaces through photovoice." The Ohio State University, 2010. http://rave.ohiolink.edu/etdc/view?acc_num=osu1284482097.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

McNeill, Dean K. "Adaptive visual representations for autonomous mobile robots using competitive learning algorithms." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1999. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp02/NQ35045.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Li, Muhua 1973. "Learning invariant neuronal representations for objects across visual-related self-actions." Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=85565.

Full text

Abstract:

This work is aimed at understanding and modelling the perceptual stability mechanisms of human visual systems, regardless of large changes in the visual sensory input resulting from some visual-related motions. Invariant neuronal representation plays an important role for memory systems to associate and recognize objects.
In contrast to the bulk of previous research work on the learning of invariance that focuses on the pure bottom-up visual information, we incorporate visual-related self-action signals such as commands for eye, head or body movements, to actively collect the changing visual information and gate the learning process. This helps neural networks learn certain degrees of invariance in an efficient way. We describe a method that can produce a network with invariance to changes in visual input caused by eye movements and covert attention shifts. Training of the network is controlled by signals associated with eye movements and covert attention shifting. A temporal perceptual stability constraint is used to drive the output of the network towards remaining constant across temporal sequences of saccadic motions and covert attention shifts. We use a four-layer neural network model to perform the position-invariant extraction of local features and temporal integration of invariant presentations of local features. The model is further extended to handle viewpoint invariance over eye, head, and/or body movements. We also study cases of multiple features instead of single features in the retinal images, which need a self-organized system to learn over a set of feature classes. A modified saliency map mechanism with spatial constraint is employed to assure that attention stays as much as possible on the same targeted object in a multiple-object scene during the first few shifts.
We present results on both simulated data and real images, to demonstrate that our network can acquire invariant neuronal representations, such as position and attention shift invariance. We also demonstrate that our method performs well in realistic situations in which the temporal sequence of input data is not smooth, situations in which earlier approaches have difficulty.

APA, Harvard, Vancouver, ISO, and other styles

49

Sena, Claudia Pinto Pereira. "Colaboração e mediação no processo de construção e representação do conhecimento por pessoas com deficiência visual, a partir da utilização da aprendizagem baseada em problemas." Faculdade de Educação, 2014. http://repositorio.ufba.br/ri/handle/ri/18154.

Full text

Abstract:

Submitted by Claudia Sena (caupinto.sena@gmail.com) on 2015-09-17T19:09:37Z No. of bitstreams: 1 TeseDoutorado_ClaudiaPintoPereiraSena.pdf: 6711466 bytes, checksum: f098d412ebbc0b471762971497da62dd (MD5)
Approved for entry into archive by Maria Auxiliadora da Silva Lopes (silopes@ufba.br) on 2015-10-21T14:34:54Z (GMT) No. of bitstreams: 1 TeseDoutorado_ClaudiaPintoPereiraSena.pdf: 6711466 bytes, checksum: f098d412ebbc0b471762971497da62dd (MD5)
Made available in DSpace on 2015-10-21T14:34:54Z (GMT). No. of bitstreams: 1 TeseDoutorado_ClaudiaPintoPereiraSena.pdf: 6711466 bytes, checksum: f098d412ebbc0b471762971497da62dd (MD5)
O homem, em todo seu trajeto histórico cultural, vem produzindo e utilizando tecnologias, desde as mais rudimentares, como uma lança de madeira, até as mais complexas, não só como garantia de sobrevivência, como também em um processo contínuo de construção e difusão de conhecimentos. A sociedade, cada vez mais, vem exigindo do homem autonomia, criatividade e adaptação, valorando as informações e, em especial, os conhecimentos construídos. A educação, enquanto espaço formador de cidadãos, precisa estar atenta a estas constantes modificações sociais, econômicas e políticas e oferecer oportunidades que privilegiem o aprendizado significativo. Entendendo que o ambiente, a cultura e outros sujeitos influenciam a construção do conhecimento (mediação), propõe-se, com este trabalho, investigar o PBL (Problem Based Learning), enquanto estratégia educacional de aprendizagem colaborativa, em um grupo de pessoas com deficiência visual, através da experiência vivenciada em um Centro de Apoio ao Deficiente Visual da cidade de Feira de Santana-Ba. A mediação, compreendida neste trabalho como a interação entre os sujeitos e a utilização de instrumentos e signos, perpassa o processo de ensino aprendizagem, privilegiando o diálogo entre os pares e a intervenção. Em se tratando das pessoas com deficiência visual, signos não visuais devem ser utilizados, privilegiando o desenvolvimento de outras aptidões, como a percepção tátil, auditiva, dentre outras. A relevância de experimentar o PBL em um grupo de pessoas com deficiência visual se revela na possibilidade de oportunizar a estas pessoas um ambiente de interação que favorece a aquisição de conceitos e representação de conhecimentos, um espaço de diálogo e de inclusão social e de observar as potencialidades e fragilidades do método neste contexto. Frente às transformações do mundo contemporâneo, a educação tem utilizado as tecnologias de informação e comunicação (TIC) com a intenção de participar do processo de inclusão sociodigital. Diante do exposto, pretende-se também observar de que maneira as TIC podem ser utilizadas, ampliando as habilidades da pessoa com deficiência visual e colaborando na construção e difusão dos conhecimentos partilhados.
ABSTRACT The human being in all its historical and cultural path, has been producing and using technologies, from the most rudimentary, like a wooden spear, to the most complex, not only as a guarantee of survival, but also in a continuous process of construction and diffusion of knowledge. Society increasingly has required human autonomy, creativity and adaptation, valuing information and, in particular, the knowledge built. Education, as an area that forms citizens, need to be aware of these social, economic and political constant changes and provide opportunities that emphasize meaningful learning. Understanding that the environment, culture and other subjects influence the construction of knowledge (mediation), it is proposed, with this work, to investigate the PBL (Problem Based Learning), while educational strategy of collaborative learning in a group of visually impaired people through the lived experience at a Support Center for the Visually Impaired of Feira de Santana - Ba. Mediation, understood in this work as the interaction between the subject and the use of tools and signs, permeates the teaching and learning process, focusing on the dialogue between peers and intervention. In the case of people with visual impairment, visual signs should not be used, favoring the development of other skills, such as the tactile perception, hearing, among others. The relevance of experiencing PBL in a group of visually impaired people is revealed in the ability to create opportunities for these people an environment of interaction that favors the acquisition of concepts and knowledge representation, a space of dialogue and social inclusion and to observe the strengths and weaknesses of the method in this context. Face the transformations of the contemporary world; education has used the information and communication technologies (ICT) with the intention to participate in the process of sociodigital inclusion. Therefore, we intend to also observe how ICT can be used increasing the skills of people with visual impairments, supporting the tutorial sessions and collaborating in the construction and diffusion of shared knowledge.

APA, Harvard, Vancouver, ISO, and other styles

50

Eigenstetter, Angela [Verfasser], and Björn [Akademischer Betreuer] Ommer. "Learning Mid-Level Representations for Visual Recognition / Angela Eigenstetter ; Betreuer: Björn Ommer." Heidelberg : Universitätsbibliothek Heidelberg, 2015. http://d-nb.info/1180499883/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Visual representation learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles