To see the other types of publications on this topic, follow the link: Visual learning.

Dissertations / Theses on the topic 'Visual learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Visual learning.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zhu, Fan. "Visual feature learning." Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/8218/.

Full text
Abstract:
Categorization is a fundamental problem of many computer vision applications, e.g., image classification, pedestrian detection and face recognition. The robustness of a categorization system heavily relies on the quality of features, by which data are represented. The prior arts of feature extraction can be concluded in different levels, which, in a bottom up order, are low level features (e.g., pixels and gradients) and middle/high-level features (e.g., the BoW model and sparse coding). Low level features can be directly extracted from images or videos, while middle/high-level features are constructed upon low-level features, and are designed to enhance the capability of categorization systems based on different considerations (e.g., guaranteeing the domain-invariance and improving the discriminative power). This thesis focuses on the study of visual feature learning. Challenges that remain in designing visual features lie in intra-class variation, occlusions, illumination and view-point changes and insufficient prior knowledge. To address these challenges, I present several visual feature learning methods, where these methods cover the following sub-topics: (i) I start by introducing a segmentation-based object recognition system. (ii) When training data are insufficient, I seek data from other resources, which include images or videos in a different domain, actions captured from a different viewpoint and information in a different media form. In order to appropriately transfer such resources into the target categorization system, four transfer learning-based feature learning methods are presented in this section, where both cross-view, cross-domain and cross-modality scenarios are addressed accordingly. (iii) Finally, I present a random-forest based feature fusion method for multi-view action recognition.
APA, Harvard, Vancouver, ISO, and other styles
2

Goh, Hanlin. "Learning deep visual representations." Paris 6, 2013. http://www.theses.fr/2013PA066356.

Full text
Abstract:
Les avancées récentes en apprentissage profond et en traitement d'image présentent l'opportunité d'unifier ces deux champs de recherche complémentaires pour une meilleure résolution du problème de classification d'images dans des catégories sémantiques. L'apprentissage profond apporte au traitement d'image le pouvoir de représentation nécessaire à l'amélioration des performances des méthodes de classification d'images. Cette thèse propose de nouvelles méthodes d'apprentissage de représentations visuelles profondes pour la résolution de cette tache. L'apprentissage profond a été abordé sous deux angles. D'abord nous nous sommes intéressés à l'apprentissage non supervisé de représentations latentes ayant certaines propriétés à partir de données en entrée. Il s'agit ici d'intégrer une connaissance à priori, à travers un terme de régularisation, dans l'apprentissage d'une machine de Boltzmann restreinte (RBM). Nous proposons plusieurs formes de régularisation qui induisent différentes propriétés telles que la parcimonie, la sélectivité et l'organisation en structure topographique. Le second aspect consiste au passage graduel de l'apprentissage non supervisé à l'apprentissage supervisé de réseaux profonds. Ce but est réalisé par l'introduction sous forme de supervision, d'une information relative à la catégorie sémantique. Deux nouvelles méthodes sont proposées. Le premier est basé sur une régularisation top-down de réseaux de croyance profonds à base de RBMs. Le second optimise un cout intégrant un critre de reconstruction et un critre de supervision pour l'entrainement d'autoencodeurs profonds. Les méthodes proposées ont été appliquées au problme de classification d'images. Nous avons adopté le modèle sac-de-mots comme modèle de base parce qu'il offre d'importantes possibilités grâce à l'utilisation de descripteurs locaux robustes et de pooling par pyramides spatiales qui prennent en compte l'information spatiale de l'image. L'apprentissage profonds avec agrÉgation spatiale est utilisé pour apprendre un dictionnaire hiÉrarchique pour l'encodage de reprÉsentations visuelles de niveau intermÉdiaire. Cette mÉthode donne des rÉsultats trs compétitifs en classification de scènes et d'images. Les dictionnaires visuels appris contiennent diverses informations non-redondantes ayant une structure spatiale cohérente. L'inférence est aussi très rapide. Nous avons par la suite optimisé l'étape de pooling sur la base du codage produit par le dictionnaire hiérarchique précédemment appris en introduisant introduit une nouvelle paramétrisation dérivable de l'opération de pooling qui permet un apprentissage par descente de gradient utilisant l'algorithme de rétro-propagation. Ceci est la premire tentative d'unification de l'apprentissage profond et du modèle de sac de mots. Bien que cette fusion puisse sembler évidente, l'union de plusieurs aspects de l'apprentissage profond de représentations visuelles demeure une tache complexe à bien des égards et requiert encore un effort de recherche important
Recent advancements in the areas of deep learning and visual information processing have presented an opportunity to unite both fields. These complementary fields combine to tackle the problem of classifying images into their semantic categories. Deep learning brings learning and representational capabilities to a visual processing model that is adapted for image classification. This thesis addresses problems that lead to the proposal of learning deep visual representations for image classification. The problem of deep learning is tackled on two fronts. The first aspect is the problem of unsupervised learning of latent representations from input data. The main focus is the integration of prior knowledge into the learning of restricted Boltzmann machines (RBM) through regularization. Regularizers are proposed to induce sparsity, selectivity and topographic organization in the coding to improve discrimination and invariance. The second direction introduces the notion of gradually transiting from unsupervised layer-wise learning to supervised deep learning. This is done through the integration of bottom-up information with top-down signals. Two novel implementations supporting this notion are explored. The first method uses top-down regularization to train a deep network of RBMs. The second method combines predictive and reconstructive loss functions to optimize a stack of encoder-decoder networks. The proposed deep learning techniques are applied to tackle the image classification problem. The bag-of-words model is adopted due to its strengths in image modeling through the use of local image descriptors and spatial pooling schemes. Deep learning with spatial aggregation is used to learn a hierarchical visual dictionary for encoding the image descriptors into mid-level representations. This method achieves leading image classification performances for object and scene images. The learned dictionaries are diverse and non-redundant. The speed of inference is also high. From this, a further optimization is performed for the subsequent pooling step. This is done by introducing a differentiable pooling parameterization and applying the error backpropagation algorithm. This thesis represents one of the first attempts to synthesize deep learning and the bag-of-words model. This union results in many challenging research problems, leaving much room for further study in this area
APA, Harvard, Vancouver, ISO, and other styles
3

Walker, Catherine Livesay. "Visual learning through Hypermedia." CSUSB ScholarWorks, 1996. https://scholarworks.lib.csusb.edu/etd-project/1148.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Owens, Andrew (Andrew Hale). "Learning visual models from paired audio-visual examples." Thesis, Massachusetts Institute of Technology, 2016. http://hdl.handle.net/1721.1/107352.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2016.
Cataloged from PDF version of thesis.
Includes bibliographical references (pages 93-104).
From the clink of a mug placed onto a saucer to the bustle of a busy café, our days are filled with visual experiences that are accompanied by distinctive sounds. In this thesis, we show that these sounds can provide a rich training signal for learning visual models. First, we propose the task of predicting the sound that an object makes when struck as a way of studying physical interactions within a visual scene. We demonstrate this idea by training an algorithm to produce plausible soundtracks for videos in which people hit and scratch objects with a drumstick. Then, with human studies and automated evaluations on recognition tasks, we verify that the sounds produced by the algorithm convey information about actions and material properties. Second, we show that ambient audio - e.g., crashing waves, people speaking in a crowd - can also be used to learn visual models. We train a convolutional neural network to predict a statistical summary of the sounds that occur within a scene, and we demonstrate that the visual representation learned by the model conveys information about objects and scenes.
by Andrew Owens.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
5

Peyre, Julia. "Learning to detect visual relations." Thesis, Paris Sciences et Lettres (ComUE), 2019. http://www.theses.fr/2019PSLEE016.

Full text
Abstract:
Nous étudions le problème de détection de relations visuelles de la forme (sujet, prédicat, objet) dans les images, qui sont des entités intermédiaires entre les objets et les scènes visuelles complexes. Cette thèse s’attaque à deux défis majeurs : (1) le problème d’annotations coûteuses pour l’entrainement de modèles fortement supervisés, (2) la variation d’apparence visuelle des relations. Nous proposons un premier modèle de détection de relations visuelles faiblement supervisé, n’utilisant que des annotations au niveau de l’image, qui, étant donné des détecteurs d’objets pré-entrainés, atteint une précision proche de celle de modèles fortement supervisés. Notre second modèle combine des représentations compositionnelles (sujet, objet, prédicat) et holistiques (triplet) afin de mieux modéliser les variations d’apparence visuelle et propose un module de raisonnement par analogie pour généraliser à de nouveaux triplets. Nous validons expérimentalement le bénéfice apporté par chacune de ces composantes sur des bases de données réelles
In this thesis, we study the problem of detection of visual relations of the form (subject, predicate, object) in images, which are intermediate level semantic units between objects and complex scenes. Our work addresses two main challenges in visual relation detection: (1) the difficulty of obtaining box-level annotations to train fully-supervised models, (2) the variability of appearance of visual relations. We first propose a weakly-supervised approach which, given pre-trained object detectors, enables us to learn relation detectors using image-level labels only, maintaining a performance close to fully-supervised models. Second, we propose a model that combines different granularities of embeddings (for subject, object, predicate and triplet) to better model appearance variation and introduce an analogical reasoning module to generalize to unseen triplets. Experimental results demonstrate the improvement of our hybrid model over a purely compositional model and validate the benefits of our transfer by analogy to retrieve unseen triplets
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Zhaoqing. "Self-supervised Visual Representation Learning." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/29595.

Full text
Abstract:
In general, large-scale annotated data are essential to training deep neural networks in order to achieve better performance in visual feature learning for various computer vision applications. Unfortunately, the amount of annotations is challenging to obtain, requiring a high cost of money and human resources. The dependence on large-scale annotated data has become a crucial bottleneck in developing an advanced intelligence perception system. Self-supervised visual representation learning, a subset of unsupervised learning, has gained popularity because of its ability to avoid the high cost of annotated data. A series of methods designed various pretext tasks to explore the general representations from unlabeled data and use these general representations for different downstream tasks. Although previous methods achieved great success, the label noise problem exists in these pretext tasks due to the lack of human-annotation supervision, which causes harmful effects on the transfer performance. This thesis discusses two types of the noise problem in self-supervised learning and designs the corresponding methods to alleviate the negative effects and explore the transferable representations. Firstly, in pixel-level self-supervised learning, the pixel-level correspondences are easily noisy because of complicated context relationships (e.g., misleading pixels in the background). Secondly, two views of the same image share the foreground object and some background information. As optimizing the pretext task (e.g., contrastive learning), the model is easily to capture the foreground object and noisy background information, simultaneously. Such background information can be harmful to the transfer performance on downstream tasks, including image classification, object detection, and instance segmentation. To address the above mentioned issues, our core idea is to leverage the data regularities and prior knowledge. Experimental results demonstrate that the proposed methods effectively alleviate the negative effects of label noise in self-supervised learning and surpass a series of previous methods.
APA, Harvard, Vancouver, ISO, and other styles
7

Tang-Wright, Kimmy. "Visual topography and perceptual learning in the primate visual system." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:388b9658-dceb-443a-a19b-c960af162819.

Full text
Abstract:
The primate visual system is organised and wired in a topological manner. From the eye well into extrastriate visual cortex, a preserved spatial representation of the vi- sual world is maintained across many levels of processing. Diffusion-weighted imaging (DWI), together with probabilistic tractography, is a non-invasive technique for map- ping connectivity within the brain. In this thesis I probed the sensitivity and accuracy of DWI and probabilistic tractography by quantifying its capacity to detect topolog- ical connectivity in the post mortem macaque brain, between the lateral geniculate nucleus (LGN) and primary visual cortex (V1). The results were validated against electrophysiological and histological data from previous studies. Using the methodol- ogy developed in this thesis, it was possible to segment the LGN reliably into distinct subregions based on its structural connectivity to different parts of the visual field represented in V1. Quantitative differences in connectivity from magno- and parvo- cellular subcomponents of the LGN to different parts of V1 could be replicated with this method in post mortem brains. The topological corticocortical connectivity be- tween extrastriate visual area V5/MT and V1 could also be mapped in the post mortem macaque. In vivo DWI scans previously obtained from the same brains have lower resolution and signal-to-noise because of the shorter scan times. Nevertheless, in many cases, these yielded topological maps similar to the post mortem maps. These results indicate that the preserved topology of connection between LGN to V1, and V5/MT to V1, can be revealed using non-invasive measures of diffusion-weighted imaging and tractography in vivo. In a preliminary investigation using Human Connectome data obtained in vivo, I was not able to segment the retinotopic map in LGN based on con- nections to V1. This may be because information about the topological connectivity is not carried in the much lower resolution human diffusion data, or because of other methodological limitations. I also investigated the mechanisms of perceptual learning by developing a novel task-irrelevant perceptual learning paradigm designed to adapt neuronal elements early on in visual processing in a certain region of the visual field. There is evidence, although not clear-cut, to suggest that the paradigm elicits task- irrelevant perceptual learning, but that these effects only emerge when practice-related effects are accounted for. When orientation and location specific effects on perceptual performance are examined, the largest improvement occurs at the trained location, however, there is also significant improvement at one other 'untrained' location, and there is also a significant improvement in performance for a control group that did not receive any training at any location. The work highlights inherent difficulties in inves- tigating perceptual learning, which relate to the fact that learning likely takes place at both lower and higher levels of processing, however, the paradigm provides a good starting point for comprehensively investigating the complex mechanisms underlying perceptual learning.
APA, Harvard, Vancouver, ISO, and other styles
8

Shi, Xiaojin. "Visual learning from small training datasets /." Diss., Digital Dissertations Database. Restricted to UC campuses, 2005. http://uclibs.org/PID/11984.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Liu, Jingen. "Learning Semantic Features for Visual Recognition." Doctoral diss., University of Central Florida, 2009. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/3358.

Full text
Abstract:
Visual recognition (e.g., object, scene and action recognition) is an active area of research in computer vision due to its increasing number of real-world applications such as video (image) indexing and search, intelligent surveillance, human-machine interaction, robot navigation, etc. Effective modeling of the objects, scenes and actions is critical for visual recognition. Recently, bag of visual words (BoVW) representation, in which the image patches or video cuboids are quantized into visual words (i.e., mid-level features) based on their appearance similarity using clustering, has been widely and successfully explored. The advantages of this representation are: no explicit detection of objects or object parts and their tracking are required; the representation is somewhat tolerant to within-class deformations, and it is efficient for matching. However, the performance of the BoVW is sensitive to the size of the visual vocabulary. Therefore, computationally expensive cross-validation is needed to find the appropriate quantization granularity. This limitation is partially due to the fact that the visual words are not semantically meaningful. This limits the effectiveness and compactness of the representation. To overcome these shortcomings, in this thesis we present principled approach to learn a semantic vocabulary (i.e. high-level features) from a large amount of visual words (mid-level features). In this context, the thesis makes two major contributions. First, we have developed an algorithm to discover a compact yet discriminative semantic vocabulary. This vocabulary is obtained by grouping the visual-words based on their distribution in videos (images) into visual-word clusters. The mutual information (MI) be- tween the clusters and the videos (images) depicts the discriminative power of the semantic vocabulary, while the MI between visual-words and visual-word clusters measures the compactness of the vocabulary. We apply the information bottleneck (IB) algorithm to find the optimal number of visual-word clusters by finding the good tradeoff between compactness and discriminative power. We tested our proposed approach on the state-of-the-art KTH dataset, and obtained average accuracy of 94.2%. However, this approach performs one-side clustering, because only visual words are clustered regardless of which video they appear in. In order to leverage the co-occurrence of visual words and images, we have developed the co-clustering algorithm to simultaneously group the visual words and images. We tested our approach on the publicly available fifteen scene dataset and have obtained about 4% increase in the average accuracy compared to the one side clustering approaches. Second, instead of grouping the mid-level features, we first embed the features into a low-dimensional semantic space by manifold learning, and then perform the clustering. We apply Diffusion Maps (DM) to capture the local geometric structure of the mid-level feature space. The DM embedding is able to preserve the explicitly defined diffusion distance, which reflects the semantic similarity between any two features. Furthermore, the DM provides multi-scale analysis capability by adjusting the time steps in the Markov transition matrix. The experiments on KTH dataset show that DM can perform much better (about 3% to 6% improvement in average accuracy) than other manifold learning approaches and IB method. Above methods use only single type of features. In order to combine multiple heterogeneous features for visual recognition, we further propose the Fielder Embedding to capture the complicated semantic relationships between all entities (i.e., videos, images,heterogeneous features). The discovered relationships are then employed to further increase the recognition rate. We tested our approach on Weizmann dataset, and achieved about 17% 21% improvements in the average accuracy.
Ph.D.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science PhD
APA, Harvard, Vancouver, ISO, and other styles
10

Beale, Dan. "Autonomous visual learning for robotic systems." Thesis, University of Bath, 2012. https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.558886.

Full text
Abstract:
This thesis investigates the problem of visual learning using a robotic platform. Given a set of objects the robots task is to autonomously manipulate, observe, and learn. This allows the robot to recognise objects in a novel scene and pose, or separate them into distinct visual categories. The main focus of the work is in autonomously acquiring object models using robotic manipulation. Autonomous learning is important for robotic systems. In the context of vision, it allows a robot to adapt to new and uncertain environments, updating its internal model of the world. It also reduces the amount of human supervision needed for building visual models. This leads to machines which can operate in environments with rich and complicated visual information, such as the home or industrial workspace; also, in environments which are potentially hazardous for humans. The hypothesis claims that inducing robot motion on objects aids the learning process. It is shown that extra information from the robot sensors provides enough information to localise an object and distinguish it from the background. Also, that decisive planning allows the object to be separated and observed from a variety of dierent poses, giving a good foundation to build a robust classication model. Contributions include a new segmentation algorithm, a new classication model for object learning, and a method for allowing a robot to supervise its own learning in cluttered and dynamic environments.
APA, Harvard, Vancouver, ISO, and other styles
11

Lakshmi, Ratan Aparna. "Learning visual concepts for image classification." Thesis, Massachusetts Institute of Technology, 1999. http://hdl.handle.net/1721.1/80092.

Full text
Abstract:
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.
Includes bibliographical references (leaves 166-174).
by Aparna Lakshmi Ratan.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
12

Moghaddam, Baback 1963. "Probabilistic visual learning for object detection." Thesis, Massachusetts Institute of Technology, 1997. http://hdl.handle.net/1721.1/10242.

Full text
Abstract:
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1997.
Includes bibliographical references (leaves 78-82).
by Baback Moghaddam.
Ph.D.
APA, Harvard, Vancouver, ISO, and other styles
13

Wilson, Andrew David. "Learning visual behavior for gesture analysis." Thesis, Massachusetts Institute of Technology, 1995. http://hdl.handle.net/1721.1/62924.

Full text
APA, Harvard, Vancouver, ISO, and other styles
14

Zhou, Bolei. "Interpretable representation learning for visual intelligence." Thesis, Massachusetts Institute of Technology, 2018. http://hdl.handle.net/1721.1/117837.

Full text
Abstract:
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 131-140).
Recent progress of deep neural networks in computer vision and machine learning has enabled transformative applications across robotics, healthcare, and security. However, despite the superior performance of the deep neural networks, it remains challenging to understand their inner workings and explain their output predictions. This thesis investigates several novel approaches for opening up the "black box" of neural networks used in visual recognition tasks and understanding their inner working mechanism. I first show that objects and other meaningful concepts emerge as a consequence of recognizing scenes. A network dissection approach is further introduced to automatically identify the internal units as the emergent concept detectors and quantify their interpretability. Then I describe an approach that can efficiently explain the output prediction for any given image. It sheds light on the decision-making process of the networks and why the predictions succeed or fail. Finally, I show some ongoing efforts toward learning efficient and interpretable deep representations for video event understanding and some future directions.
by Bolei Zhou.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
15

Pillai, Sudeep. "Learning articulated motions from visual demonstration." Thesis, Massachusetts Institute of Technology, 2014. http://hdl.handle.net/1721.1/89861.

Full text
Abstract:
Thesis: S.M. in Computer Science and Engineering, Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2014.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
35
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 94-98).
Robots operating autonomously in household environments must be capable of interacting with articulated objects on a daily basis. They should be able to infer each object's underlying kinematic linkages purely by observing its motion during manipulation. This work proposes a framework that enables robots to learn the articulation in objects from user-provided demonstrations, using RGB-D sensors. We introduce algorithms that combine concepts in sparse feature tracking, motion segmentation, object pose estimation, and articulation learning, to develop our proposed framework. Additionally, our methods can predict the motion of previously seen articulated objects in future encounters. We present experiments that demonstrate the ability of our method, given RGB-D data, to identify, analyze and predict the articulation of a number of everyday objects within a human-occupied environment.
by Sudeep Pillai.
S.M. in Computer Science and Engineering
APA, Harvard, Vancouver, ISO, and other styles
16

Williams, Oliver Michael Christian. "Bayesian learning for efficient visual inference." Thesis, University of Cambridge, 2006. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.613979.

Full text
APA, Harvard, Vancouver, ISO, and other styles
17

North, Ben. "Learning dynamical models for visual tracking." Thesis, University of Oxford, 1998. http://ora.ox.ac.uk/objects/uuid:6ed12552-4c30-4d80-88ef-7245be2d8fb8.

Full text
Abstract:
Using some form of dynamical model in a visual tracking system is a well-known method for increasing robustness and indeed performance in general. Often, quite simple models are used and can be effective, but prior knowledge of the likely motion of the tracking target can often be exploited by using a specially-tailored model. Specifying such a model by hand, while possible, is a time-consuming and error-prone process. Much more desirable is for an automated system to learn a model from training data. A dynamical model learnt in this manner can also be a source of useful information in its own right, and a set of dynamical models can provide discriminatory power for use in classification problems. Methods exist to perform such learning, but are limited in that they assume the availability of 'ground truth' data. In a visual tracking system, this is rarely the case. A learning system must work from visual data alone, and this thesis develops methods for learning dynamical models while explicitly taking account of the nature of the training data --- they are noisy measurements. The algorithms are developed within two tracking frameworks. The Kalman filter is a simple and fast approach, applicable where the visual clutter is limited. The recently-developed Condensation algorithm is capable of tracking in more demanding situations, and can also employ a wider range of dynamical models than the Kalman filter, for instance multi-mode models. The success of the learning algorithms is demonstrated experimentally. When using a Kalman filter, the dynamical models learnt using the algorithms presented here produce better tracking when compared with those learnt using current methods. Learning directly from training data gathered using Condensation is an entirely new technique, and experiments show that many aspects of a multi-mode system can be successfully identified using very little prior information. Significant computational effort is required by the implementation of the methods, and there is scope for improvement in this regard. Other possibilities for future work include investigation of the strong links this work has with learning problems in other areas. Most notable is the study of the 'graphical models' commonly used in expert systems, where the ideas presented here promise to give insight and perhaps lead to new techniques.
APA, Harvard, Vancouver, ISO, and other styles
18

Florence, Peter R. (Peter Raymond). "Dense visual learning for robot manipulation." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/128398.

Full text
Abstract:
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2020
Cataloged from student-submitted PDF of thesis.
Includes bibliographical references (pages 115-127).
We would like to have highly useful robots which can richly perceive their world, semantically distinguish its fine details, and physically interact with it sufficiently for useful robotic manipulation. This is hard to achieve with previous methods: prior work has not equipped robots with the scalable ability to understand the dense visual state of their varied environments. The limitations have both been in the state representations used, and how to acquire them without significant human labeling effort. In this thesis we present work that leverages self-supervision, particularly via a mix of geometrical computer vision, deep visual learning, and robotic systems, to scalably produce dense visual inferences of the world state. These methods either enable robots to teach themselves dense visual models without human supervision, or they act as a large multiplying factor on the value of information provided by humans. Specifically, we develop a pipeline for providing ground truth labels of visual data in cluttered and multi-object scenes, we introduce the novel application of dense visual object descriptors to robotic manipulation and provide a fully robot-supervised pipeline to acquire them, and we leverage this dense visual understanding to efficiently learn new manipulation skills through imitation. With real robot hardware we demonstrate contact-rich tasks manipulating household objects, including generalizing across a class of objects, manipulating deformable objects, and manipulating a textureless symmetrical object, all with closed-loop, real-time vision-based manipulation policies.
by Peter R. Florence.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
APA, Harvard, Vancouver, ISO, and other styles
19

Chen, Zhenghao. "Deep Learning for Visual Data Compression." Thesis, The University of Sydney, 2022. https://hdl.handle.net/2123/29729.

Full text
Abstract:
With the tremendous success of neural networks, a few learning-based image codecs were proposed and outperformed those traditional image codecs. However, the field of learning-based compression research for other categories of visual data has remained much less explored. This thesis will investigate the effectiveness of deep learning for visual data compression and propose three end-to-end learning-based compression methods for respectively compressing standard videos, 3D volumetric images and stereo videos. First, we improve the existing learning-based video codecs by using a newly proposed adaptive coding method called Resolution-adaptive Motion Coding (RaMC) to effectively compress the introduced motion information for reducing the bit-rate cost. Then, we investigate the effectiveness of deep learning for lossless 3D volumetric image compression and propose the first end-to-end optimized learning framework for losslessly compressing 3D volumetric images. We introduce an Intra-slice and Inter-slice Conditional Entropy Coding (ICEC) module to fuse multi-scale intra-slice and inter-slice features as the context information for better entropy coding. Besides the aforementioned single-view visual data, we further attempt to employ the neural networks for compressing the multi-view visual data and propose the first end-to-end Learning-based Stereo Video Compression (LSVC) framework. It compresses both left and right views of the stereo video by using deep motion and disparity compensation strategy with fully-differentiable modules and can be optimized in an end-to-end manner. We conduct extensive experiments on multiple publicly available datasets to demonstrate the effectiveness of our proposed RaMC, ICEC, and LSVC methods. The results indicate that these three methods achieve state-of-the-art compression performance in the corresponding visual data compression tasks and outperform traditional visual data compression frameworks.
APA, Harvard, Vancouver, ISO, and other styles
20

Dey, Priya. "Visual speech in technology-enhanced learning." Thesis, University of Sheffield, 2012. http://etheses.whiterose.ac.uk/3329/.

Full text
Abstract:
This thesis investigates the use of synthetic talking heads, with lip, tongue and face movements synchronized with synthesized or natural speech, in technology-enhanced learning. This work applies talking heads in a speech tutoring application for teaching English as a second language. Previous studies have shown that speech perception is aided by visual information, but more research is needed to determine the effectiveness of visualization of articulators in pronunciation training. This thesis explores whether or not visual speech technology can give an improvement in learning pronunciation. This thesis investigates techniques for audiovisual speech synthesis, using both viseme-based and data-driven approaches to implement multiple talking heads. Intelligibility studies found the audiovisual heads to be more intelligible than audio alone, and the data-driven head was found to be more intelligible than the viseme-driven implementation. The talking heads are applied in a pronunciation-training application, which is evaluated by second-language learners to investigate the benefit of visual speech in technology-enhanced learning. User trials explored the efficacy of the software in demonstrating the /b/–/p/ contrast in English. The results indicate that learners showed an improvement in listening and pronunciation after using the software, while the benefit of visualization compared to auditory training alone varied between individuals. User evaluations found that the talking heads were perceived to be helpful in learning pronunciation, and the positive feedback on the tutoring system suggests that the use of talking heads in technology-enhanced learning could be useful in addition to traditional methods.
APA, Harvard, Vancouver, ISO, and other styles
21

Nguyen, Duc Minh Chau. "Affordance learning for visual-semantic perception." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2021. https://ro.ecu.edu.au/theses/2443.

Full text
Abstract:
Affordance Learning is linked to the study of interactions between robots and objects, including how robots perceive objects by scene understanding. This area has been popular in the Psychology, which has recently come to influence Computer Vision. In this way, Computer Vision has borrowed the concept of affordance from Psychology in order to develop Visual-Semantic recognition systems, and to develop the capabilities of robots to interact with objects, in particular. However, existing systems of Affordance Learning are still limited to detecting and segmenting object affordances, which is called Affordance Segmentation. Further, these systems are not designed to develop specific abilities to reason about affordances. For example, a Visual-Semantic system, for captioning a scene, can extract information from an image, such as “a person holds a chocolate bar and eats it”, but does not highlight the affordances: “hold” and “eat”. Indeed, these affordances and others commonly appear within all aspects of life, since affordances usually connect to actions (from a linguistic view, affordances are generally known as verbs in sentences). Due to the above mentioned limitations, this thesis aims to develop systems of Affordance Learning for Visual-Semantic Perception. These systems can be built using Deep Learning, which has been empirically shown to be efficient for performing Computer Vision tasks. There are two goals of the thesis: (1) study what are the key factors that contribute to the performance of Affordance Segmentation and (2) reason about affordances (Affordance Reasoning) based on parts of objects for Visual-Semantic Perception. In terms of the first goal, the thesis mainly investigates the feature extraction module as this is one of the earliest steps in learning to segment affordances. The thesis finds that the quality of feature extraction from images plays a vital role in improved performance of Affordance Segmentation. With regard to the second goal, the thesis infers affordances from object parts to reason about part-affordance relationships. Based on this approach, the thesis devises an Object Affordance Reasoning Network that can learn to construct relationships between affordances and object parts. As a result, reasoning about affordance becomes achievable in the generation of scene graphs of affordances and object parts. Empirical results, obtained from extensive experiments, show the potential of the system (that the thesis developed) towards Affordance Reasoning from Scene Graph Generation.
APA, Harvard, Vancouver, ISO, and other styles
22

SANGUINETI, VALENTINA. "Audio-Visual Learning for Scene Understanding." Doctoral thesis, Università degli studi di Genova, 2022. http://hdl.handle.net/11567/1068960.

Full text
Abstract:
Multimodal deep learning aims at combining the complementary information of different modalities. Among all modalities, audio and video are the predominant ones that humans use to explore the world. In this thesis, we decided to focus our study on audio-visual deep learning to mimic with our networks how humans perceive the world. Our research includes images, audio signals and acoustic images. The latter provide spatial audio information and are obtained from a planar array of microphones combining their raw audios with the beamforming algorithm. They better mimic human auditory systems, which cannot be replicated using just one microphone, not able alone to give spatial sound cues. However, as microphones arrays are not so widespread, we also study how to handle the missing spatialized audio modality at test time. As a solution, we propose to distill acoustic images content to audio features during the training in order to handle their absence at test time. This is done for supervised audio classification using the generalized distillation framework, which we also extend for self-supervised learning. Next, we devise a method for reconstructing acoustic images given a single microphone and an RGB frame. Therefore, in case we just dispose of a standard video, we are able to synthesize spatial audio, which is useful for many audio-visual tasks, including sound localization. Lastly, as another example of restoring one modality from available ones, we inpaint degraded images providing audio features, to reconstruct the missing region not only to be visually plausible but also semantically consistent with the related sound. This includes also cross-modal generation, in the limit case of completely missing or hidden visual modality: our method naturally deals with it, being able to generate images from sound. In summary we show how audio can help visual learning and vice versa, by transferring knowledge between the two modalities at training time, in order to distill, reconstruct, or restore the missing modality at test time.
APA, Harvard, Vancouver, ISO, and other styles
23

Santolin, Chiara. "Learning Regularities from the Visual World." Doctoral thesis, Università degli studi di Padova, 2016. http://hdl.handle.net/11577/3424417.

Full text
Abstract:
Patterns of visual objects, streams of sounds, and spatiotemporal events are just a few examples of the structures present in a variety of sensory inputs. Amid such variety, numerous regularities can be found. In order to handle the sensory processing, individuals of each species have to be able to rapidly track these regularities. Statistical learning is one of the principal mechanisms that enable to track patterns from the flow of sensory information, by detecting coherent relations between elements (e.g., A predicts B). Once relevant structures are detected, learners are sometimes required to generalize to novel situations. This process can be challenging since it demands to abstract away from the surface information, and extract structures from previously-unseen stimuli. Over the past two decades, researchers have shown that statistical learning and generalization operate across domains, modalities and species, supporting the generality assumption. These mechanisms in fact, play a crucial role in organizing the sensory world, and developing representation of the environment. But when and how do organisms begin to track and generalize patterns from the environment? From the overall existing literature, very little is known about the roots these mechanisms. The experiments described in this thesis were all designed to explore whether statistical learning and generalization of visual patterns are fully available at birth, using the newborn domestic chick (Gallus gallus) as animal model. This species represents an excellent developmental model for the study of the ontogeny of several cognitive traits because it can be tested soon after hatching, and allows complete manipulation of pre- and post-natal experience. In Chapter 2, four statistical learning experiments are described. Through learning-by-exposure, visually-naive chicks were familiarized to a computer-presented stream of objects defined by a statistical structure; in particular, transitional (conditional) probabilities linked together sequence elements (e.g., the cross predicts the circle 100% of the times). After exposure, the familiar structured sequence were compared to a random presentation (Experiment 1) or a novel, structured combination (Experiment 2) of the familiar shapes. Chicks successfully differentiated test sequences in both experiments. One relevant aspect of these findings is that the learning process is unsupervised. Despite the lack of reinforcement, the mere exposure to the statistically-defined input was sufficient to obtain a significant learning effect. Two additional experiments have been designed in order to explore the complexity of the patterns that can be learned by this species. In particular, the aim of Experiments 3 and 4 was to investigate chicks’ ability to discriminate subtle differences of distributional properties of the stimuli. New sequences have been created; the familiar one was formed by a pairs of shapes that always appear in that order whereas the unfamiliar stimulus was formed by shapes spanning the boundaries across familiar pairs (part-pairs). Unfamiliar part-pairs were indeed created by joining the last element of a familiar pair and the first element of another (subsequent) familiar pair. The key difference among pairs and part-pairs lied on the probabilistic structure of the two: being formed by the union of two familiar elements, part-pairs were experienced during familiarization but with a lower probability. In order to distinguish test sequences, chicks needed to detect a very small difference in conditional probability characterizing the two stimuli. Unfortunately, the animals were unable to differentiate test sequences when formed by 8 (Experiment 3) or 6 (Experiment 4) elements. My final goal would have been to discover whether chicks are effectively able to pick up transitional probabilities or whether they simply track frequencies of co-occurrence. In Experiments 1 and 2, since the frequency of appearance of each shape was balanced across stimuli, it was impossible to tell if chicks detected transitional probabilities (e.g., X predicts Y) or frequencies of co-occurrence (e.g., X and Y co-occur together, but any predictive relation characterize them) among elements. However, since the animals did not succeed in the first task, being unable to discriminate pairs vs. part-pairs, data are inconclusive as regards to this issue. Possible explanations and theoretical implications of these results are provided in the final chapter of this thesis. In Chapter 3, the two studies described were aimed at testing newborn chicks’ capacities of generalization of patterns presented as stings of visual tokens. For instance, the pattern AAB can be defined as “two identical items (AA) followed by another one, different from the formers (B)”. Patterns were presented as triplets of simultaneously-visible shapes, arranged according to AAB, ABA (Experiment 5), ABB and BAA (Experiment 6). Using a training procedure, chicks were able to recognize the trained regularity when compared to another (neutral) regularity (for instance, AAB displayed as cross-cross-circle vs. ABA displayed as cross-circle-cross). Chicks were also capable of generalizing these patterns to novel exemplars composed of previously-unseen elements (AAB vs. ABA implemented by hourglass-hourglass-arrow vs. hourglass-arrow-hourglass). A subsequent study (Experiment 6) was aimed at verifying whether the presence/absence of contiguous reduplicated elements (in AAB but not in ABA) may have facilitated learning and generalization in previous task. All regularities comprised an adjacent repetition that gave the triplets asymmetrical structures (AAB vs. ABB and AAB vs. BAA). Chicks discriminated pattern-following and pattern-violating novel test triplets instantiating all regularities employed in the study, suggesting that the presence/absence of an adjacent repetition was not a relevant cue to succeed in the task. Overall, the present research provides new data of statistical learning and generalization of visual regularities in a newborn animal model, revealing that these mechanisms fully operate at the very beginning of life. For what concerns statistical learning, day-old chicks performed better than neonates but similar to human infants. As regards to generalization, chicks’ performance is consistent to what shown by neonates in the linguistic domain. These findings suggest that newborn chicks may be predisposed to track visual regularities in their postnatal environment. Despite the very limited previous experience, after a mere exposure to a structured input or a 3-days training session, significant learning and generalization effects have been obtained, pointing to the presence of early predispositions serving the development of these cognitive abilities.
Il mondo sensoriale è composto da un insieme di regolarità. Sequenze di sillabe e note musicali, oggetti disposti nell’ambiente visivo e sequenze di eventi sono solo alcune delle tipologie di pattern caratterizzanti l’input sensoriale. La capacità di rilevare queste regolarità risulta fondamentale per l’acquisizione di alcune proprietà del linguaggio naturale (ad esempio, la sintassi), l’apprendimento di sequenze di azioni (ad esempio, il linguaggio dei segni), la discriminazione di eventi ambientali complessi come pure la pianificazione del comportamento. Infatti, rilevare regolarità da una molteplicità di eventi permette di anticipare e pianificare azioni future, aspetti cruciali di adattamento all’ambiente. Questo meccanismo di apprendimento, riportato in letteratura con il nome di statistical learning, consiste nella rilevazione di distribuzioni di probabilità da input sensoriali ovvero, relazioni di dipendenza tra i suoi diversi componenti (ad esempio, X predice Y). Come illustrato nell capitolo introduttivo della presente ricerca, nonostante si tratti di uno dei meccanismi responsabili dell’apprendimento del linguaggio naturale umano, lo statistical learning non sembra essersi evoluto in modo specifico per servire questa funzione. Tale meccanismo rappresenta un processo cognitivo generale che si manifesta in diversi domini sensoriali (acustico, visivo, tattile), modalità (temporale oppure spaziale-statico) e specie (umana e non-umane). La rilevazione di pattern gioca quindi un ruolo fondamentale nell’elaborazione dell’informazione sensoriale, necessaria ad una corretta rappresentazione dell’ambiente. Una volta apprese le regolarità e le strutture presenti nell’ambiente, gli organismi viventi devono saper generalizzare tali strutture a stimoli nuovi da un punto di vista percettivo, ma rappresentanti le stesse regolarità. L’aspetto cruciale della generalizzazione è quindi la capacità di riconoscere una regolarità familiare anche quando implementata da nuovi stimoli. Anche il processo di generalizzazione ricopre un ruolo fondamentale nell’apprendimento della sintassi del linguaggio naturale umano. Ciò nonostante, si tratta di un meccanismo dominio-generale e non specie-specifico. Ciò che non risultava chiaro dalla letteratura era l’ontogenesi di entrambi i meccanismi, specialmente nel dominio visivo. In altre parole, non era chiaro se le abilità di statistical learning e generalizzazione di strutture visive fossero completamente sviluppate alla nascita. Il principale obbiettivo degli esperimenti condotti in questa tesi era quindi quello di approfondire le origini di visual statistical learning e generalizzazione, tramite del pulcino di pollo domestico (Gallus gallus) come modello animale. Appartenendo ad una specie precoce, il pulcino neonato è quasi completamente autonomo per una serie di funzioni comportamentali diventando il candidato ideale per lo studio dell’ontogenesi di diverse abilità percettive e cognitive. La possibilità di essere osservato appena dopo la nascita, e la completa manipolazione dell’ambiente pre- e post- natale (tramite schiusa e allevamento in condizioni controllate), rende il pulcino un’ottimo modello sperimentale per lo studio dell’apprendimento di regolarità. La prima serie di esperimenti illustrati erano allo studio di statistical learning (Chapter 2). Tramite un paradigma sperimentale basato sull’apprendimento per esposizione (imprinting filiale), pulcini neonati naive dal punto di vista visivo, sono stati esposti ad una video-sequenza di elementi visivi arbitrari (forme geometriche). Tale stimolo è definito da una struttura “statistica” basata su transitional (conditional) probabilities che determinano l’ordine di comparsa di ciascun elemento (ad esempio, il quadrato predice la croce con una probabilità del 100%). Al termine della fase di esposizione, i pulcini riuscivano a riconoscere tale sequenza, discriminandola rispetto a sequenze non-familiari che consistevano in una presentazione random degli stessi elementi (ovvero nessun elemento prediceva la comparsa di nessun altro elemento; Experiment 1) oppure in una ricombinazione degli stessi elementi familiari secondo nuovi pattern statistici (ad esempio, il quadrato predice la T con probabilità del 100% ma tale relazione statistica non era mai stata esperita dai pulcini; Experiment 2). In entrambi gli esperimenti i pulcini discriminarono la sequenza familiare da quella non-familiare, dimostrandosi in grado di riconoscere il struttura statistica alla quale erano stati esposti durante la fase d’imprinting. Uno degli aspetti più affascinanti di questo risultato è che il processo di apprendimento è non-supervisionato ovvero nessun rinforzo era stato dato ai pulcini durante la fase di esposizione. Successivamente, sono stati condotti altri due esperimenti (Experiments 3 and 4) con l’obbiettivo di verificare se i pulcini fossero in grado di apprendere regolarità più complesse di quelle testate in precedenza. In particolare, il compito che dovevano svolgere i pulcini consisteva nel differenziare una sequenza familiare strutturata similmente a quella appena descritta e una sequenza non-familiare composta da part-pairs ovvero coppie di figure composte dall’unione dell’ultima figura componente una coppia familiare e la prima figura componente un’altra coppia familiare. Essendo formate dall’unione di elementi appartenenti a coppie familiari, le part-pairs venivano esperite dai pulcini durante la fase di familiarizazzione ma con una probabilità più bassa rispetto alle pairs. La difficoltà del compito risiede quindi nel rilevare una sottile differenza caratterizzante la distribuzione di probabilità dei due stimoli. Sfortunatamente i pulcini non sono stati in grado di discriminare le due sequenze ne quando composte da 8 elementi (Experiment 3) ne da 6 (Experiment 4). L’obbiettivo finale di questi due esperimenti sarebbe stato quello di scoprire il tipo di regolarità appresa dai pulcini. Infatti, negli esperimenti 1 e 2 i pulcini potrebbero aver discriminato sequenze familiari e non familiari sulla base delle frequenze di co-occorrenza delle figure componenti le coppie familiari (ad esempio, co-occorrenza di X e Y) piuttosto che sulle probabilità condizionali (ad esempio, X predice Y). Tuttavia, non avendo superato il test presentato negli esperimenti 3 e 4, la questione riguardante quale tipo di cue statistico viene appreso da questa specie rimane aperta. Possibili spiegazioni e implicazioni teoriche di tale risultato non significativo sono discusse nel capitolo conclusivo. Il secondo gruppo di esperimenti condotti nella presente ricerca riguarda l’indagine del processo di generalizzazione di regolarità visive (Chapter 3). Le regolarità indagate sono rappresentate come stringhe di figure geometriche organizzate spazialmente, i cui elementi sono visibili simultaneamente. Ad esempio, la regolarità definita come AAB viene descritta come una tripletta in cui i primi due elementi sono identici tra loro (AA), seguiti da un’altro elemento diverso dai precedenti (B). I pattern impiegati erano AAB, ABA (Experiment 5) ABB e BAA (Experiment 6) e la procedura sperimentale utilizzata prevedeva addestramento tramite rinforzo alimentare. Una volta imparato a riconoscere il pattern rinforzato (ad esempio, AAB implementato da croce-croce-cerchio) da quello non rinforzato (ad esempio, ABA implementato da croce-cerchio-croce), i pulcini dovevano riconoscere tali strutture rappresentate da nuovi elementi (ad esempio, clessidra-clessidra-freccia vs. clessidra-freccia-clessidra). Gli animali si dimostrarono capaci di generalizzare tutte le regolarità a nuovi esemplari delle stesse. L’aspetto più importante di questi risultati è quanto dimostrato nell’esperimento 6, il cui obbiettivo era quello di indagare le possibili strategie di apprendimento messe in atto dagli animali nello studio precedente. Infatti, considerando il confronto AAB vs. ABA, i pulcini potrebbero aver riconosciuto (e generalizzato) il pattern familiare sulla base della presenza di una ripetizione consecutiva di uno stesso elemento (presente in AAB ma non in ABA, dove lo stesso elemento A è ripetuto e posizionato ai due estremi della tripletta). Nell’esperimento 6 sono state quindi confrontate regolarità caratterizzate da ripetizioni: AAB vs. ABB e AAB vs. BAA. I pulcini si mostrarono comunque in grado di distinguere le nuove regolarità e di generalizzare a nuovi esemplari, suggerendo come tale abilità non sia limitata a un particolare tipo di configurazione. Complessivamente, i risultati ottenuti nella presente ricerca costituiscono la prima evidenza di statistical learning e generalizzazione di regolarità visive in un modello animale osservato appena dopo la nascita. Per quanto riguarda lo statistical learning, i pulcini dimostrano capacità comparabili a quelle osservate in altre specie animali e agli infanti umani ma apparentemente superiori a quelle osservate nel neonato. Ipotesi e implicazioni teoriche di tali differenze sono riportate nel capitolo conclusivo. Per quanto riguarda i processi di generalizzazione, la performance dei pulcini è in linea con quanto dimostrato dai neonati umani nel dominio linguistico. Alla luce di questi risultati, è plausibile pensare che il pulcino si biologicamente predisposto ad rilevare regolarità caratterizzanti il suo ambiente visivo, a partire dai primi momenti di vita.
APA, Harvard, Vancouver, ISO, and other styles
24

Durand, Thibaut. "Weakly supervised learning for visual recognition." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066142/document.

Full text
Abstract:
Cette thèse s'intéresse au problème de la classification d'images, où l'objectif est de prédire si une catégorie sémantique est présente dans l'image, à partir de son contenu visuel. Pour analyser des images de scènes complexes, il est important d'apprendre des représentations localisées. Pour limiter le coût d'annotation pendant l'apprentissage, nous nous sommes intéressé aux modèles d'apprentissage faiblement supervisé. Dans cette thèse, nous proposons des modèles qui simultanément classifient et localisent les objets, en utilisant uniquement des labels globaux pendant l'apprentissage. L'apprentissage faiblement supervisé permet de réduire le cout d'annotation, mais en contrepartie l'apprentissage est plus difficile. Le problème principal est comment agréger les informations locales (e.g. régions) en une information globale (e.g. image). La contribution principale de cette thèse est la conception de nouvelles fonctions de pooling (agrégation) pour l'apprentissage faiblement supervisé. En particulier, nous proposons une fonction de pooling « max+min », qui unifie de nombreuses fonctions de pooling. Nous décrivons comment utiliser ce pooling dans le framework Latent Structured SVM ainsi que dans des réseaux de neurones convolutifs. Pour résoudre les problèmes d'optimisation, nous présentons plusieurs solveurs, dont certains qui permettent d'optimiser une métrique d'ordonnancement (ranking) comme l'Average Precision. Expérimentalement, nous montrons l'intérêt nos modèles par rapport aux méthodes de l'état de l'art, sur dix bases de données standard de classification d'images, incluant ImageNet
This thesis studies the problem of classification of images, where the goal is to predict if a semantic category is present in the image, based on its visual content. To analyze complex scenes, it is important to learn localized representations. To limit the cost of annotation during training, we have focused on weakly supervised learning approaches. In this thesis, we propose several models that simultaneously classify and localize objects, using only global labels during training. The weak supervision significantly reduces the cost of full annotation, but it makes learning more challenging. The key issue is how to aggregate local scores - e.g. regions - into global score - e.g. image. The main contribution of this thesis is the design of new pooling functions for weakly supervised learning. In particular, we propose a “max + min” pooling function, which unifies many pooling functions. We describe how to use this pooling in the Latent Structured SVM framework as well as in convolutional networks. To solve the optimization problems, we present several solvers, some of which allow to optimize a ranking metric such as Average Precision. We experimentally show the interest of our models with respect to state-of-the-art methods, on ten standard image classification datasets, including the large-scale dataset ImageNet
APA, Harvard, Vancouver, ISO, and other styles
25

Dancette, Corentin. "Shortcut Learning in Visual Question Answering." Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS073.

Full text
Abstract:
Cette thèse se concentre sur la tâche de VQA, c'est à dire les systèmes questions-réponses visuelles. Nous étudions l'apprentissage des biais dans cette tâche. Les modèles ont tendance à apprendre des corrélations superficielles les conduisant à des réponses correctes dans la plupart des cas, mais qui peuvent échouer lorsqu'ils rencontrent des données d'entrée inhabituelles. Nous proposons deux méthodes pour réduire l'apprentissage par raccourci sur le VQA. La première, RUBi, consiste à encourager le modèle à apprendre à partir des exemples les plus difficiles et les moins biaisés grâce à une loss spécifique. Nous proposons ensuite SCN, un modèle pour la tâche de comptage visuel, avec une architecture conçue pour être robuste aux changements de distribution. Nous étudions ensuite les raccourcis multimodaux dans le VQA. Nous montrons qu'ils ne sont pas seulement basés sur des corrélations entre la question et la réponse, mais qu'ils peuvent aussi impliquer des informations sur l'image. Nous concevons un benchmark d'évaluation pour mesurer la robustesse des modèles aux raccourcis multimodaux. L'apprentissage de ces raccourcis est particulièrement problématique lorsque les modèles sont testés dans un contexte de changement de distribution. C'est pourquoi il est important de pouvoir évaluer la fiabilité des modèles VQA. Nous proposons une méthode pour leur permettre de s'abstenir de répondre lorsque leur confiance est trop faible. Cette méthode consiste à entraîner un modèle externe, dit "sélecteur", pour prédire la confiance du modèle VQA. Nous montrons que notre méthode peut améliorer la fiabilité des modèles VQA existants
This thesis is focused on the task of VQA: it consists in answering textual questions about images. We investigate Shortcut Learning in this task: the literature reports the tendency of models to learn superficial correlations leading them to correct answers in most cases, but which can fail when encountering unusual input data. We first propose two methods to reduce shortcut learning on VQA. The first, which we call RUBi, consists of an additional loss to encourage the model to learn from the most difficult and less biased examples -- those which cannot be answered solely from the question. We then propose SCN, a model for the more specific task of visual counting, which incorporates architectural priors designed to make it more robust to distribution shifts. We then study the existence of multimodal shortcuts in the VQA dataset. We show that shortcuts are not only based on correlations between the question and the answer but can also involve image information. We design an evaluation benchmark to measure the robustness of models to multimodal shortcuts. We show that existing models are vulnerable to multimodal shortcut learning. The learning of those shortcuts is particularly harmful when models are evaluated in an out-of-distribution context. Therefore, it is important to evaluate the reliability of VQA models, i.e. We propose a method to improve their ability to abstain from answering when their confidence is too low. It consists of training an external ``selector'' model to predict the confidence of the VQA model. This selector is trained using a cross-validation-like scheme in order to avoid overfitting on the training set
APA, Harvard, Vancouver, ISO, and other styles
26

Chen, Yifu. "Deep learning for visual semantic segmentation." Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS200.

Full text
Abstract:
Dans cette thèse, nous nous intéressons à la segmentation sémantique visuelle, une des tâches de haut niveau qui ouvre la voie à une compréhension complète des scènes. Plus précisément, elle requiert une compréhension sémantique au niveau du pixel. Avec le succès de l’apprentissage approfondi de ces dernières années, les problèmes de segmentation sémantique sont abordés en utilisant des architectures profondes. Dans la première partie, nous nous concentrons sur la construction d’une fonction de coût plus appropriée pour la segmentation sémantique. En particulier, nous définissons une nouvelle fonction de coût basé sur un réseau de neurone de détection de contour sémantique. Cette fonction de coût impose des prédictions au niveau du pixel cohérentes avec les informa- tions de contour sémantique de la vérité terrain, et conduit donc à des résultats de segmentation mieux délimités. Dans la deuxième partie, nous abordons une autre question importante, à savoir l’apprentissage de modèle de segmentation avec peu de données annotées. Pour cela, nous proposons une nouvelle méthode d’attribution qui identifie les régions les plus importantes dans une image considérée par les réseaux de classification. Nous intégrons ensuite notre méthode d’attribution dans un contexte de segmentation faiblement supervisé. Les modèles de segmentation sémantique sont ainsi entraînés avec des données étiquetées au niveau de l’image uniquement, facile à collecter en grande quantité. Tous les modèles proposés dans cette thèse sont évalués expérimentalement de manière approfondie sur plusieurs ensembles de données et les résultats sont compétitifs avec ceux de la littérature
In this thesis, we are interested in Visual Semantic Segmentation, one of the high-level task that paves the way towards complete scene understanding. Specifically, it requires a semantic understanding at the pixel level. With the success of deep learning in recent years, semantic segmentation problems are being tackled using deep architectures. In the first part, we focus on the construction of a more appropriate loss function for semantic segmentation. More precisely, we define a novel loss function by employing a semantic edge detection network. This loss imposes pixel-level predictions to be consistent with the ground truth semantic edge information, and thus leads to better shaped segmentation results. In the second part, we address another important issue, namely, alleviating the need for training segmentation models with large amounts of fully annotated data. We propose a novel attribution method that identifies the most significant regions in an image considered by classification networks. We then integrate our attribution method into a weakly supervised segmentation framework. The semantic segmentation models can thus be trained with only image-level labeled data, which can be easily collected in large quantities. All models proposed in this thesis are thoroughly experimentally evaluated on multiple datasets and the results are competitive with the literature
APA, Harvard, Vancouver, ISO, and other styles
27

Durand, Thibaut. "Weakly supervised learning for visual recognition." Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066142.

Full text
Abstract:
Cette thèse s'intéresse au problème de la classification d'images, où l'objectif est de prédire si une catégorie sémantique est présente dans l'image, à partir de son contenu visuel. Pour analyser des images de scènes complexes, il est important d'apprendre des représentations localisées. Pour limiter le coût d'annotation pendant l'apprentissage, nous nous sommes intéressé aux modèles d'apprentissage faiblement supervisé. Dans cette thèse, nous proposons des modèles qui simultanément classifient et localisent les objets, en utilisant uniquement des labels globaux pendant l'apprentissage. L'apprentissage faiblement supervisé permet de réduire le cout d'annotation, mais en contrepartie l'apprentissage est plus difficile. Le problème principal est comment agréger les informations locales (e.g. régions) en une information globale (e.g. image). La contribution principale de cette thèse est la conception de nouvelles fonctions de pooling (agrégation) pour l'apprentissage faiblement supervisé. En particulier, nous proposons une fonction de pooling « max+min », qui unifie de nombreuses fonctions de pooling. Nous décrivons comment utiliser ce pooling dans le framework Latent Structured SVM ainsi que dans des réseaux de neurones convolutifs. Pour résoudre les problèmes d'optimisation, nous présentons plusieurs solveurs, dont certains qui permettent d'optimiser une métrique d'ordonnancement (ranking) comme l'Average Precision. Expérimentalement, nous montrons l'intérêt nos modèles par rapport aux méthodes de l'état de l'art, sur dix bases de données standard de classification d'images, incluant ImageNet
This thesis studies the problem of classification of images, where the goal is to predict if a semantic category is present in the image, based on its visual content. To analyze complex scenes, it is important to learn localized representations. To limit the cost of annotation during training, we have focused on weakly supervised learning approaches. In this thesis, we propose several models that simultaneously classify and localize objects, using only global labels during training. The weak supervision significantly reduces the cost of full annotation, but it makes learning more challenging. The key issue is how to aggregate local scores - e.g. regions - into global score - e.g. image. The main contribution of this thesis is the design of new pooling functions for weakly supervised learning. In particular, we propose a “max + min” pooling function, which unifies many pooling functions. We describe how to use this pooling in the Latent Structured SVM framework as well as in convolutional networks. To solve the optimization problems, we present several solvers, some of which allow to optimize a ranking metric such as Average Precision. We experimentally show the interest of our models with respect to state-of-the-art methods, on ten standard image classification datasets, including the large-scale dataset ImageNet
APA, Harvard, Vancouver, ISO, and other styles
28

De, Pasquale Roberto. "Visual discrimination learning and LTP-like changes in primary visual cortex." Doctoral thesis, Scuola Normale Superiore, 2009. http://hdl.handle.net/11384/85939.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Doyon, Julien. "Right temporal-lobe contribution to global visual processing and visual-cue learning." Thesis, McGill University, 1988. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=75696.

Full text
Abstract:
This thesis explores the visual functions of the right anterior temporal cortex of the human brain. In Part 1, 92 patients with unilateral temporal- or frontal-lobe excisions and 35 normal control subjects were tested under two experimental conditions (global, local) of a reaction-time task, employing hierarchically structured letters or designs as stimuli. In both versions, the right temporal-lobe group was less affected than other groups by interference from the global aspect of the stimulus. These findings support the hypothesis that the right temporal lobe contributes to global visual processing. In Part 2, the ability to learn a cue-system for discriminating between two targets against a background of visually similar items was examined in 107 patients with unilateral temporal- or frontal-lobe excisions and 37 control subjects, using three versions of a visual-cue learning task. With letters and nonsense syllables, all groups took longer to complete the task when the background information was changed after three learning trials. With abstract designs, only patients with right temporal-lobe lesions failed to show this interference effect after three learning trials, but did so after six. Hence, it is argued that the right temporal lobe plays a role in visual pattern-discrimination learning.
APA, Harvard, Vancouver, ISO, and other styles
30

Gepperth, Alexander Rainer Tassilo. "Neural learning methods for visual object detection." [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=981053998.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Qin, Lei. "Online machine learning methods for visual tracking." Thesis, Troyes, 2014. http://www.theses.fr/2014TROY0017/document.

Full text
Abstract:
Nous étudions le problème de suivi de cible dans une séquence vidéo sans aucune connaissance préalable autre qu'une référence annotée dans la première image. Pour résoudre ce problème, nous proposons une nouvelle méthode de suivi temps-réel se basant sur à la fois une représentation originale de l’objet à suivre (descripteur) et sur un algorithme adaptatif capable de suivre la cible même dans les conditions les plus difficiles comme le cas où la cible disparaît et réapparait dans le scène (ré-identification). Tout d'abord, pour la représentation d’une région de l’image à suivre dans le temps, nous proposons des améliorations au descripteur de covariance. Ce nouveau descripteur est capable d’extraire des caractéristiques spécifiques à la cible, tout en ayant la capacité à s’adapter aux variations de l’apparence de la cible. Ensuite, l’étape algorithmique consiste à mettre en cascade des modèles génératifs et des modèles discriminatoires afin d’exploiter conjointement leurs capacités à distinguer la cible des autres objets présents dans la scène. Les modèles génératifs sont déployés dans les premières couches afin d’éliminer les candidats les plus faciles alors que les modèles discriminatoires sont déployés dans les couches suivantes afin de distinguer la cibles des autres objets qui lui sont très similaires. L’analyse discriminante des moindres carrés partiels (AD-MCP) est employée pour la construction des modèles discriminatoires. Enfin, un nouvel algorithme d'apprentissage en ligne AD-MCP a été proposé pour la mise à jour incrémentale des modèles discriminatoires
We study the challenging problem of tracking an arbitrary object in video sequences with no prior knowledge other than a template annotated in the first frame. To tackle this problem, we build a robust tracking system consisting of the following components. First, for image region representation, we propose some improvements to the region covariance descriptor. Characteristics of a specific object are taken into consideration, before constructing the covariance descriptor. Second, for building the object appearance model, we propose to combine the merits of both generative models and discriminative models by organizing them in a detection cascade. Specifically, generative models are deployed in the early layers for eliminating most easy candidates whereas discriminative models are in the later layers for distinguishing the object from a few similar "distracters". The Partial Least Squares Discriminant Analysis (PLS-DA) is employed for building the discriminative object appearance models. Third, for updating the generative models, we propose a weakly-supervised model updating method, which is based on cluster analysis using the mean-shift gradient density estimation procedure. Fourth, a novel online PLS-DA learning algorithm is developed for incrementally updating the discriminative models. The final tracking system that integrates all these building blocks exhibits good robustness for most challenges in visual tracking. Comparing results conducted in challenging video sequences showed that the proposed tracking system performs favorably with respect to a number of state-of-the-art methods
APA, Harvard, Vancouver, ISO, and other styles
32

Pralle, Mandi Jo. "Visual design in the online learning environment." [Ames, Iowa : Iowa State University], 2007.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
33

Hussain, Sibt Ul. "Machine Learning Methods for Visual Object Detection." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00680048.

Full text
Abstract:
The goal of this thesis is to develop better practical methods for detecting common object classes in real world images. We present a family of object detectors that combine Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP) and Local Ternary Pattern (LTP) features with efficient Latent SVM classifiers and effective dimensionality reduction and sparsification schemes to give state-of-the-art performance on several important datasets including PASCAL VOC2006 and VOC2007, INRIA Person and ETHZ. The three main contributions are as follows. Firstly, we pioneer the use of Local Ternary Pattern features for object detection, showing that LTP gives better overall performance than HOG and LBP, because it captures both rich local texture and object shape information while being resistant to variations in lighting conditions. It thus works well both for classes that are recognized mainly by their structure and ones that are recognized mainly by their textures. We also show that HOG, LBP and LTP complement one another, so that an extended feature set that incorporates all three of them gives further improvements in performance. Secondly, in order to tackle the speed and memory usage problems associated with high-dimensional modern feature sets, we propose two effective dimensionality reduction techniques. The first, feature projection using Partial Least Squares, allows detectors to be trained more rapidly with negligible loss of accuracy and no loss of run time speed for linear detectors. The second, feature selection using SVM weight truncation, allows active feature sets to be reduced in size by almost an order of magnitude with little or no loss, and often a small gain, in detector accuracy. Despite its simplicity, this feature selection scheme outperforms all of the other sparsity enforcing methods that we have tested. Lastly, we describe work in progress on Local Quantized Patterns (LQP), a generalized form of local pattern features that uses lookup table based vector quantization to provide local pattern style pixel neighbourhood codings that have the speed of LBP/LTP and some of the flexibility and power of traditional visual word representations. Our experiments show that LQP outperforms all of the other feature sets tested including HOG, LBP and LTP.
APA, Harvard, Vancouver, ISO, and other styles
34

Cabral, Ricardo da Silveira. "Unifying Low-Rank Models for Visual Learning." Research Showcase @ CMU, 2015. http://repository.cmu.edu/dissertations/506.

Full text
Abstract:
Many problems in signal processing, machine learning and computer vision can be solved by learning low rank models from data. In computer vision, problems such as rigid structure from motion have been formulated as an optimization over subspaces with fixed rank. These hard-rank constraints have traditionally been imposed by a factorization that parameterizes subspaces as a product of two matrices of fixed rank. Whilst factorization approaches lead to efficient and kernelizable optimization algorithms, they have been shown to be NP-Hard in presence of missing data. Inspired by recent work in compressed sensing, hard-rank constraints have been replaced by soft-rank constraints, such as the nuclear norm regularizer. Vis-a-vis hard-rank approaches, soft-rank models are convex even in presence of missing data: but how is convex optimization solving a NP-Hard problem? This thesis addresses this question by analyzing the relationship between hard and soft rank constraints in the unsupervised factorization with missing data problem. Moreover, we extend soft rank models to weakly supervised and fully supervised learning problems in computer vision. There are four main contributions of our work: (1) The analysis of a new unified low-rank model for matrix factorization with missing data. Our model subsumes soft and hard-rank approaches and merges advantages from previous formulations, such as efficient algorithms and kernelization. It also provides justifications on the choice of algorithms and regions that guarantee convergence to global minima. (2) A deterministic \rank continuation" strategy for the NP-hard unsupervised factorization with missing data problem, that is highly competitive with the state-of-the-art and often achieves globally optimal solutions. In preliminary work, we show that this optimization strategy is applicable to other NP-hard problems which are typically relaxed to convex semidentite programs (e.g., MAX-CUT, quadratic assignment problem). (3) A new soft-rank fully supervised robust regression model. This convex model is able to deal with noise, outliers and missing data in the input variables. (4) A new soft-rank model for weakly supervised image classification and localization. Unlike existing multiple-instance approaches for this problem, our model is convex.
APA, Harvard, Vancouver, ISO, and other styles
35

Xu, Yang. "Cortical spatiotemporal plasticity in visual category learning." Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/272.

Full text
Abstract:
Central to human intelligence, visual categorization is a skill that is both remarkably fast and accurate. Although there have been numerous studies in primates regarding how information flows in inferiortemporal (ITC) and prefrontal (PFC) cortices during online discrimination of visual categories, there has been little comparable research on the human cortex. To bridge this gap, this thesis explores how visual categories emerge in prefrontal cortex and the ventral stream, which is the human homologue of ITC. In particular, cortical spatiotemporal plasticity in visual category learning was investigated using behavioral experiments, magnetoencephalographic (MEG) imaging, and statistical machine learning methods. From a theoretical perspective, scientists from work on non-human primates have posited that PFC plays a primary role in the encoding of visual categories. Much of the extant research in the cognitive neuroscience literature, however, emphasizes the role of the ventral stream. Despite their apparent incompatibility, no study has evaluated these theories in the human cortex by examining the roles of the ventral stream and PFC in online discrimination and acquisition of visual categories. To address this question, I conducted two learning experiments using visually-similar categories as stimuli and recorded cortical response using MEG—a neuroimaging technique that offers a millisecond temporal resolution. Across both experiments, categorical information was found to be available during the period of cortical activity. Moreover, late in the learning process, this information is supplied increasingly in the ventral stream but less so in prefrontal cortex. These findings extend previous theories by suggesting that the ventral stream is crucial to long-term encoding of visual categories when categorical perception is proficient, but that PFC jointly encodes visual categories early on during learning. From a methodological perspective, MEG is limited as a technique because it can lead to false discoveries in a large number of spatiotemporal regions of interest (ROIs) and, typically, can only coarsely reconstruct the spatial locations of cortical responses. To address the first problem, I developed an excursion algorithm that identified ROIs contiguous in time and space. I then used a permutation test to measure the global statistical significance of the ROIs. To address the second problem, I developed a method that incorporates domainspecific and experimental knowledge in the modeling process. Utilizing faces as a model category, I used a predefined “face” network to constrain the estimation of cortical activities by applying differential shrinkages to regions within and outside this network. I proposed and implemented a trial-partitioning approach which uses trials in the midst of learning for model estimation. Importantly, this renders localizing trials more precise in both the initial and final phases of learning. In summary, this thesis makes two significant contributions. First, it methodologically improves the way we can characterize the spatiotemporal properties of the human cortex using MEG. Second, it provides a combined theory of visual category learning by incorporating the large time scales that encompass the course of the learning.
APA, Harvard, Vancouver, ISO, and other styles
36

Ramachandran, Suchitra. "Visual Statistical Learning in Monkey Inferotemporal Cortex." Research Showcase @ CMU, 2014. http://repository.cmu.edu/dissertations/463.

Full text
Abstract:
Despite living in noisy sensory environments, humans and non-human primates have the ability to learn regularities and patterns in the environment solely on the basis of passive exposure. This ability to learn what is statistically likely and predictable in the environment is called statistical learning. Visual statistical learning of image sequences has been demonstrated at the level of single neurons in the rhesus macaque (monkey) inferotemporal cortex (IT). Upon subjecting monkeys to extensive exposure to pairs of images presented sequentially such that the display of one image always predicted the subsequent display of another image, IT neurons showed suppressed responses to images that occurred in a predicted context, but not when the same effect, called prediction suppression, more thoroughly, we discovered that this effect depends on the conditional probability between the images presented sequentially. Further, the effect generalizes across time and space, it is domain specific, and it can be induced by training monkeys on longer sequences. These effects are long-lasting and robust: they persist at least for 20 months after initial training with no exposure to the stimuli in the interim. We have preliminary evidence for the existence of neurophysiological markers of statistical learning in areas upstream of IT in the ventral visual stream, suggesting that learning statistical regularities may be a fundamental function of sensory cortex. images occurred in an unpredicted context (Meyer & Olson, 2011). Upon investigating this effect, called prediction suppression, more thoroughly, we discovered that this effect depends on the conditional probability between the images presented sequentially. Further, the effect generalizes across time and space, it is domain specific, and it can be induced by training monkeys on longer sequences. These effects are long-lasting and robust: they persist at least for 20 months after initial training with no exposure to the stimuli in the interim. We have preliminary evidence for the existence of neurophysiological markers of statistical learning in areas upstream of IT in the ventral visual stream, suggesting that learning statistical regularities may be a fundamental function of sensory cortex.
APA, Harvard, Vancouver, ISO, and other styles
37

Frier, Helen Jane. "Compass orientation during visual learning by honeybees." Thesis, University of Sussex, 1996. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.321446.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Kodirov, Elyor. "Cross-class transfer learning for visual data." Thesis, Queen Mary, University of London, 2017. http://qmro.qmul.ac.uk/xmlui/handle/123456789/31852.

Full text
Abstract:
Automatic analysis of visual data is a key objective of computer vision research; and performing visual recognition of objects from images is one of the most important steps towards understanding and gaining insights into the visual data. Most existing approaches in the literature for the visual recognition are based on a supervised learning paradigm. Unfortunately, they require a large amount of labelled training data which severely limits their scalability. On the other hand, recognition is instantaneous and effortless for humans. They can recognise a new object without seeing any visual samples by just knowing the description of it, leveraging similarities between the description of the new object and previously learned concepts. Motivated by humans recognition ability, this thesis proposes novel approaches to tackle cross-class transfer learning (crossclass recognition) problem whose goal is to learn a model from seen classes (those with labelled training samples) that can generalise to unseen classes (those with labelled testing samples) without any training data i.e., seen and unseen classes are disjoint. Specifically, the thesis studies and develops new methods for addressing three variants of the cross-class transfer learning: Chapter 3 The first variant is transductive cross-class transfer learning, meaning labelled training set and unlabelled test set are available for model learning. Considering training set as the source domain and test set as the target domain, a typical cross-class transfer learning assumes that the source and target domains share a common semantic space, where visual feature vector extracted from an image can be embedded using an embedding function. Existing approaches learn this function from the source domain and apply it without adaptation to the target one. They are therefore prone to the domain shift problem i.e., the embedding function is only concerned with predicting the training seen class semantic representation in the learning stage during learning, when applied to the test data it may underperform. In this thesis, a novel cross-class transfer learning (CCTL) method is proposed based on unsupervised domain adaptation. Specifically, a novel regularised dictionary learning framework is formulated by which the target class labels are used to regularise the learned target domain embeddings thus effectively overcoming the projection domain shift problem. Chapter 4 The second variant is inductive cross-class transfer learning, that is, only training set is assumed to be available during model learning, resulting in a harder challenge compared to the previous one. Nevertheless, this setting reflects a real-world setting in which test data is available after the model learning. The main problem remains the same as the previous variant, that is, the domain shift problem occurs when the model learned only from the training set is applied to the test set without adaptation. In this thesis, a semantic autoencoder (SAE) is proposed building on an encoder-decoder paradigm. Specifically, first a semantic space is defined so that knowledge transfer is possible from the seen classes to the unseen classes. Then, an encoder aims to embed/project a visual feature vector into the semantic space. However, the decoder exerts a generative task, that is, the projection must be able to reconstruct the original visual features. The generative task forces the encoder to preserve richer information, thus the learned encoder from seen classes is able generalise better to the new unseen classes. Chapter 5 The third one is unsupervised cross-class transfer learning. In this variant, no supervision is available for model learning i.e., only unlabelled training data is available, leading to the hardest setting compared to the previous cases. The goal, however, is the same, learning some knowledge from the training data that can be transferred to the test data composed of completely different labels from that of training data. The thesis proposes a novel approach which requires no labelled training data yet is able to capture discriminative information. The proposed model is based on a new graph regularised dictionary learning algorithm. By introducing a l1- norm graph regularisation term, instead of the conventional squared l2-norm, the model is robust against outliers and noises typical in visual data. Importantly, the graph and representation are learned jointly, resulting in further alleviation of the effects of data outliers. As an application, person re-identification is considered for this variant in this thesis.
APA, Harvard, Vancouver, ISO, and other styles
39

Crowley, Elliott Joseph. "Visual recognition in art using machine learning." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:d917f38e-64cb-4b09-9ccf-b081fe68b187.

Full text
Abstract:
This thesis is concerned with the problem of visual recognition in art - such as finding the objects (e.g. cars, cows and cathedrals) present in a painting, or identifying the subject of an oil portrait. Solving this problem is extremely beneficial to art historians, who are often interested in determining when an object first appeared in a painting or how the portrayal of an object has evolved over time. It allows them to avoid the unenviable task of finding paintings for study manually. However, visual recognition of art is a challenging problem, in part due to the lack of annotation in art. A solution is to train recognition models on natural, photographic images. These models have to overcome a domain shift when applied to art. Firstly, a thorough evaluation of the domain shift problem is conducted for the task of image classification in paintings; the performance of natural image-trained and painting- trained classifiers on a fixed set of paintings are compared for both shallow (Fisher Vec- tors) and deep image representations (Convolutional Neural Networks - CNNs) to exam- ine the performance gap across domains. Then, we show that this performance gap can be ameliorated by classifying regions using detectors. We next consider the problem of annotating gods and animals on classical Greek vases, starting from a large dataset of images of vases with associated brief text descriptions. To solve this, we develop a weakly supervised learning approach to solve the correspondence problem between the descriptions and unknown image regions. Then, we study the problem of matching photos of a person to paintings of that person, in order to retrieve similar paintings given a query photo. We show that performance at this task can be improved substantially by learning with a combination of photos and paintings - either by learning a linear projection matrix common across facial identities, or by fine-tuning a CNN. Finally, we present several applications of this research. These include a system that learns object classifiers on-the-fly from images crawled off the web, and uses these to find a variety of objects in very large datasets of art. We show that this research has resulted in the discovery of over 250,000 new object annotations across 93,000 paintings on the public Art UK website.
APA, Harvard, Vancouver, ISO, and other styles
40

Kashyap, Karan. "Learning digits via joint audio-visual representations." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113143.

Full text
Abstract:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 59-60).
Our goal is to explore models for language learning in the manner that humans learn languages as children. Namely, children do not have intermediary text transcriptions in correlating visual and audio inputs from the environment; rather, they directly make connections between what they see and what they hear, sometimes even across languages! In this thesis, we present weakly-supervised models for learning representations of numerical digits between two modalities: speech and images. We experiment with architectures of convolutional neural networks taking in spoken utterances of numerical digits and images of handwritten digits as inputs. In nearly all cases we randomly initialize network weights (without pre-training) and evaluate the model's ability to return a matching image for a spoken input or to identify the number of overlapping digits between an utterance and an image. We also provide some visuals as evidence that our models are truly learning correspondences between the two modalities.
by Karan Kashyap.
M. Eng.
APA, Harvard, Vancouver, ISO, and other styles
41

Gilja, Vikash. "Learning and applying model-based visual context." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/33139.

Full text
Abstract:
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 53).
I believe that context's ability to reduce the ambiguity of an input signal makes it a vital constraint for understanding the real world. I specifically examine the role of context in vision and how a model-based approach can aid visual search and recognition. Through the implementation of a system capable of learning visual context models from an image database, I demonstrate the utility of the model-based approach. The system is capable of learning models for "water-horizon scenes" and "suburban street scenes" from a database of 745 images.
by Vikash Gilja.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
42

Woodley, Thomas Edward. "Visual tracking using offline and online learning." Thesis, University of Cambridge, 2010. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.608814.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

Naha, Shujon. "Zero-shot Learning for Visual Recognition Problems." IEEE, 2015. http://hdl.handle.net/1993/31806.

Full text
Abstract:
In this thesis we discuss different aspects of zero-shot learning and propose solutions for three challenging visual recognition problems: 1) unknown object recognition from images 2) novel action recognition from videos and 3) unseen object segmentation. In all of these three problems, we have two different sets of classes, the “known classes”, which are used in the training phase and the “unknown classes” for which there is no training instance. Our proposed approach exploits the available semantic relationships between known and unknown object classes and use them to transfer the appearance models from known object classes to unknown object classes to recognize unknown objects. We also propose an approach to recognize novel actions from videos by learning a joint model that links videos and text. Finally, we present a ranking based approach for zero-shot object segmentation. We represent each unknown object class as a semantic ranking of all the known classes and use this semantic relationship to extend the segmentation model of known classes to segment unknown class objects.
October 2016
APA, Harvard, Vancouver, ISO, and other styles
44

Rao, Anantha N. "Learning-based Visual Odometry - A Transformer Approach." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627658636420617.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Horn, Robert R. "Visual attention and information in observational learning." Thesis, Liverpool John Moores University, 2003. http://researchonline.ljmu.ac.uk/5624/.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

White, Alan Daniel. "Visual-motor learning in minimally invasive surgery." Thesis, University of Leeds, 2016. http://etheses.whiterose.ac.uk/17321/.

Full text
Abstract:
The purpose of this thesis was to develop an in-depth understanding of motor control in surgery. This was achieved by applying current theories of sensorimotor learning and developing a novel experimental approach. A survey of expert opinion and a review of the existing literature identified several issues related to human performance and MIS. The approach of this thesis combined existing surgical training tools with state-of-the-art technology and adapted rigorous experimental psychology techniques (grounded in the principles of sensorimotor learning) within a controlled laboratory environment. Existing technology was incorporated into surgical scenarios via the Kinematic Assessment Tool - an experimentally validated, powerful and portable system capable of providing accurate and repeatable measures of visual-motor performance. The Kinematic Assessment Tool (KAT) was first established as an appropriate means of assessing visual-motor performance, subsequently the KAT was assessed as valid when assessing MIS performance. Following this, the system was used to investigate whether the principles of ‘structural learning’ could be applied to MIS. The final experiment investigated if there is any benefit of a standardised, repeatable laparoscopic warm-up to MIS performance. These experiments demonstrated that the KAT system combined with other existing technologies, can be used to investigate visual-motor performance. The results suggested that learning the control dynamics of the surgical instruments and variability in training is beneficial when presented with novel but similar tasks. These findings are consistent with structural learning theory. This thesis should inform current thinking on MIS training and performance and the future development of simulators with more emphasis on introducing variability within tasks during training. Further investigation of the role of structural learning in MIS is required.
APA, Harvard, Vancouver, ISO, and other styles
47

Hanwell, David. "Weakly supervised learning of visual semantic attributes." Thesis, University of Bristol, 2014. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.687063.

Full text
Abstract:
There are at present many billions of images on the internet, only a fraction of which are labelled according to their semantic content. To automatically provide labels for the rest, models of visual semantic concepts must be created. Such models are traditionally trained using images which have been manually acquired, segmented, and labelled. In this thesis, we submit that such models can be learned automatically using those few images which have already been labelled, either directly by their creators, or indirectly by their associated text. Such imagery can be acquired easily, cheaply, and in large quantities, using web image searches. Though there has been some work towards learning from such weakly labelled data, all methods yet proposed require more than a minimum of human effort. In this thesis we put forth a number of methods for reliably learning models of visual semantic attributes using only the raw, unadulterated results of web image searches. The proposed methods do not require any human input beyond specifying the names of the attributes to be learned. We also present means of identifying and localising learned attributes in challenging, real-world images. Our methods are of a probabilistic nature, and make extensive use of multivariate Gaussian mixture models to represent both data and learned models. The contributions of this thesis also include several tools for acquiring and comparing these distributions, including a novel clustering algorithm. We apply our weakly supervised learning methods to the training of models of a variety of visual semantic attributes including colour and pattern terms. Detection and localization of the learned attributes in unseen realworld images is demonstrated, and both quantitative and qualitative results are presented. We compare against other work, including both general methods of weakly supervised learning, and more attribute specific methods. We apply our learning methods to the training sets of previous works, and assess their performance on the test sets used by other authors. Our results show that our methods give better results than the current state of the art.
APA, Harvard, Vancouver, ISO, and other styles
48

Hussain, Sabit ul. "Machine Learning Methods for Visual Object Detection." Thesis, Grenoble, 2011. http://www.theses.fr/2011GRENM070/document.

Full text
Abstract:
Le but de cette thèse est de développer des méthodes pratiques plus performantes pour la détection d'instances de classes d'objets de la vie quotidienne dans les images. Nous présentons une famille de détecteurs qui incorporent trois types d'indices visuelles performantes – histogrammes de gradients orientés (Histograms of Oriented Gradients, HOG), motifs locaux binaires (Local Binary Patterns, LBP) et motifs locaux ternaires (Local Ternary Patterns, LTP) – dans des méthodes de discrimination efficaces de type machine à vecteur de support latent (Latent SVM), sous deux régimes de réduction de dimension – moindres carrées partielles (Partial Least Squares, PLS) et sélection de variables par élagage de poids SVM (SVM Weight Truncation). Sur plusieurs jeux de données importantes, notamment ceux du PASCAL VOC2006 et VOC2007, INRIA Person et ETH Zurich, nous démontrons que nos méthodes améliorent l'état de l'art du domaine. Nos contributions principales sont : – Nous étudions l'indice visuelle LTP pour la détection d'objets. Nous démontrons que sa performance est globalement mieux que celle des indices bien établies HOG et LBP parce qu'elle permet d'encoder à la fois la texture locale de l'objet et sa forme globale, tout en étant résistante aux variations d'éclairage. Grâce à ces atouts, LTP fonctionne aussi bien pour les classes qui sont caractérisées principalement par leurs structures que pour celles qui sont caractérisées par leurs textures. En plus, nous démontrons que les indices HOG, LBP et LTP sont bien complémentaires, de sorte qu'un jeux d'indices étendu qui intègre tous les trois améliore encore la performance. – Les jeux d'indices visuelles performantes étant de dimension assez élevée, nous proposons deux méthodes de réduction de dimension afin d'améliorer leur vitesse et réduire leur utilisation de mémoire. La première, basée sur la projection moindres carrés partielles, diminue significativement le temps de formation des détecteurs linéaires, sans réduction de précision ni perte de vitesse d'exécution. La seconde, fondée sur la sélection de variables par l'élagage des poids du SVM, nous permet de réduire le nombre d'indices actives par un ordre de grandeur avec une réduction minime, voire même une petite augmentation, de la précision du détecteur. Malgré sa simplicité, cette méthode de sélection de variables surpasse toutes les autres approches que nous avons mis à l'essai. – Enfin, nous décrivons notre travail en cours sur une nouvelle variété d'indice visuelle – les « motifs locaux quantifiées » (Local Quantized Patterns, LQP). LQP généralise les indices existantes LBP / LTP en introduisant une étape de quantification vectorielle – ce qui permet une souplesse et une puissance analogue aux celles des approches de reconnaissance visuelle « sac de mots », qui sont basées sur la quantification des régions locales d'image considérablement plus grandes – sans perdre la simplicité et la rapidité qui caractérisent les approches motifs locales actuelles parce que les résultats de la quantification puissent être pré-compilés et stockés dans un tableau. LQP permet une augmentation considérable de la taille du support local de l'indice, et donc de sa puissance discriminatoire. Nos expériences indiquent qu'elle a la meilleure performance de toutes les indices visuelles testés, y compris HOG, LBP et LTP
The goal of this thesis is to develop better practical methods for detecting common object classes in real world images. We present a family of object detectors that combine Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP) and Local Ternary Pattern (LTP) features with efficient Latent SVM classifiers and effective dimensionality reduction and sparsification schemes to give state-of-the-art performance on several important datasets including PASCAL VOC2006 and VOC2007, INRIA Person and ETHZ. The three main contributions are as follows. Firstly, we pioneer the use of Local Ternary Pattern features for object detection, showing that LTP gives better overall performance than HOG and LBP, because it captures both rich local texture and object shape information while being resistant to variations in lighting conditions. It thus works well both for classes that are recognized mainly by their structure and ones that are recognized mainly by their textures. We also show that HOG, LBP and LTP complement one another, so that an extended feature set that incorporates all three of them gives further improvements in performance. Secondly, in order to tackle the speed and memory usage problems associated with high-dimensional modern feature sets, we propose two effective dimensionality reduction techniques. The first, feature projection using Partial Least Squares, allows detectors to be trained more rapidly with negligible loss of accuracy and no loss of run time speed for linear detectors. The second, feature selection using SVM weight truncation, allows active feature sets to be reduced in size by almost an order of magnitude with little or no loss, and often a small gain, in detector accuracy. Despite its simplicity, this feature selection scheme outperforms all of the other sparsity enforcing methods that we have tested. Lastly, we describe work in progress on Local Quantized Patterns (LQP), a generalized form of local pattern features that uses lookup table based vector quantization to provide local pattern style pixel neighbourhood codings that have the speed of LBP/LTP and some of the flexibility and power of traditional visual word representations. Our experiments show that LQP outperforms all of the other feature sets tested including HOG, LBP and LTP
APA, Harvard, Vancouver, ISO, and other styles
49

Campanholo, Guizilini Vitor. "Non-Parametric Learning for Monocular Visual Odometry." Thesis, The University of Sydney, 2013. http://hdl.handle.net/2123/9903.

Full text
Abstract:
This thesis addresses the problem of incremental localization from visual information, a scenario commonly known as visual odometry. Current visual odometry algorithms are heavily dependent on camera calibration, using a pre-established geometric model to provide the transformation between input (optical flow estimates) and output (vehicle motion estimates) information. A novel approach to visual odometry is proposed in this thesis where the need for camera calibration, or even for a geometric model, is circumvented by the use of machine learning principles and techniques. A non-parametric Bayesian regression technique, the Gaussian Process (GP), is used to elect the most probable transformation function hypothesis from input to output, based on training data collected prior and during navigation. Other than eliminating the need for a geometric model and traditional camera calibration, this approach also allows for scale recovery even in a monocular configuration, and provides a natural treatment of uncertainties due to the probabilistic nature of GPs. Several extensions to the traditional GP framework are introduced and discussed in depth, and they constitute the core of the contributions of this thesis to the machine learning and robotics community. The proposed framework is tested in a wide variety of scenarios, ranging from urban and off-road ground vehicles to unconstrained 3D unmanned aircrafts. The results show a significant improvement over traditional visual odometry algorithms, and also surpass results obtained using other sensors, such as laser scanners and IMUs. The incorporation of these results to a SLAM scenario, using a Exact Sparse Information Filter (ESIF), is shown to decrease global uncertainty by exploiting revisited areas of the environment. Finally, a technique for the automatic segmentation of dynamic objects is presented, as a way to increase the robustness of image information and further improve visual odometry results.
APA, Harvard, Vancouver, ISO, and other styles
50

Liu, Li. "Learning discriminative feature representations for visual categorization." Thesis, University of Sheffield, 2015. http://etheses.whiterose.ac.uk/8239/.

Full text
Abstract:
Learning discriminative feature representations has attracted a great deal of attention due to its potential value and wide usage in a variety of areas, such as image/video recognition and retrieval, human activities analysis, intelligent surveillance and human-computer interaction. In this thesis we first introduce a new boosted key-frame selection scheme for action recognition. Specifically, we propose to select a subset of key poses for the representation of each action via AdaBoost and a new classifier, namely WLNBNN, is then developed for final classification. The experimental results of the proposed method are 0.6% - 13.2% better than previous work. After that, a domain-adaptive learning approach based on multiobjective genetic programming (MOGP) has been developed for image classification. In this method, a set of primitive 2-D operators are randomly combined to construct feature descriptors through the MOGP evolving and then evaluated by two objective fitness criteria, i.e., the classification error and the tree complexity. Later, the (near-)optimal feature descriptor can be obtained. The proposed approach can achieve 0.9% ∼ 25.9% better performance compared with state-of-the-art methods. Moreover, effective dimensionality reduction algorithms have also been widely used for obtaining better representations. In this thesis, we have proposed a novel linear unsupervised algorithm, termed Discriminative Partition Sparsity Analysis (DPSA), explicitly considering different probabilistic distributions that exist over the data points, simultaneously preserving the natural locality relationship among the data. All these above methods have been systematically evaluated on several public datasets, showing their accurate and robust performance (0.44% - 6.69% better than the previous) for action and image categorization. Targeting efficient image classification , we also introduce a novel unsupervised framework termed evolutionary compact embedding (ECE) which can automatically learn the task-specific binary hash codes. It is regarded as an optimization algorithm which combines the genetic programming (GP) and a boosting trick. The experimental results manifest ECE significantly outperform others by 1.58% - 2.19% for classification tasks. In addition, a supervised framework, bilinear local feature hashing (BLFH), has also been proposed to learn highly discriminative binary codes on the local descriptors for large-scale image similarity search. We address it as a nonconvex optimization problem to seek orthogonal projection matrices for hashing, which can successfully preserve the pairwise similarity between different local features and simultaneously take image-to-class (I2C) distances into consideration. BLFH produces outstanding results (0.017% - 0.149% better) compared to the state-of-the-art hashing techniques.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography