To see the other types of publications on this topic, follow the link: Computer vision, object detection, action recognition.

Dissertations / Theses on the topic 'Computer vision, object detection, action recognition'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Computer vision, object detection, action recognition.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Anwer, Rao Muhammad. "Color for Object Detection and Action Recognition." Doctoral thesis, Universitat Autònoma de Barcelona, 2013. http://hdl.handle.net/10803/120224.

Full text
Abstract:
Detectar objetos en imágenes es un problema central en el campo de la visión por computador. El marco de detección basado en modelos de partes deformable es actualmente el más eficaz. Generalmente, HOG es el descriptor de imágenes a partir del cual se construyen esos modelos. El reconocimiento de acciones humanas es otro de los tópicos de más interés actualmente en el campo de la visión por computador. En este caso, los modelos usados siguen la idea de conjuntos de palabras (visuales), en inglés bag-of-words, en este caso siendo SIFT uno de los descriptor de imágenes más usados para dar soporte a la formación de esos modelos. En este contexto hay una información muy relevante para el sistema visual humano que normalmente está infrautilizada tanto en la detección de objetos como en el reconocimiento de acciones, hablamos del color. Es decir, tanto HOG como SIFT suelen ser aplicados al canal de luminancia o algún tipo de proyección de los canales de color que también lo desechan. Globalmente esta tesis se centra en incorporar color como fuente de información adicional para mejorar tanto la detección objetos como el reconocimiento de acciones. En primer lugar la tesis analiza el problema de la detección de personas en fotografías. En particular nos centramos en analizar la aportación del color a los métodos del estado del arte. A continuación damos el salto al problema de la detección de objetos en general, no solo personas. Además, en lugar de introducir el color en el nivel más bajo de la representación de la imagen, lo cual incrementa la dimensión de la representación provocando un mayor coste computacional y la necesidad de más ejemplos de aprendizaje, en esta tesis nos centramos en introducir el color en un nivel más alto de la representación. Esto no es trivial ya que el sistema en desarrollo tiene que aprender una serie de atributos de color que sean lo suficientemente discriminativos para cada tarea. En particular, en esta tesis combinamos esos atributos de color con los tradicionales atributos de forma y lo aplicamos de forma que mejoramos el estado del arte de la detección de objetos. Finalmente, nos centramos en llevar las ideas incorporadas para la tarea de detección a la tarea de reconocimiento de acciones. En este caso también demostramos cómo la incorporación del color, tal y como proponemos en esta tesis, permite mejorar el estado del arte.
Recognizing object categories in real world images is a challenging problem in computer vision. The deformable part based framework is currently the most successful approach for object detection. Generally, HOG are used for image representation within the part-based framework. For action recognition, the bag-of-word framework has shown to provide promising results. Within the bag-of-words framework, local image patches are described by SIFT descriptor. Contrary to object detection and action recognition, combining color and shape has shown to provide the best performance for object and scene recognition. In the first part of this thesis, we analyze the problem of person detection in still images. Standard person detection approaches rely on intensity based features for image representation while ignoring the color. Channel based descriptors is one of the most commonly used approaches in object recognition. This inspires us to evaluate incorporating color information using the channel based fusion approach for the task of person detection. In the second part of the thesis, we investigate the problem of object detection in still images. Due to high dimensionality, channel based fusion increases the computational cost. Moreover, channel based fusion has been found to obtain inferior results for object category where one of the visual varies significantly. On the other hand, late fusion is known to provide improved results for a wide range of object categories. A consequence of late fusion strategy is the need of a pure color descriptor. Therefore, we propose to use Color attributes as an explicit color representation for object detection. Color attributes are compact and computationally efficient. Consequently color attributes are combined with traditional shape features providing excellent results for object detection task. Finally, we focus on the problem of action detection and classification in still images. We investigate the potential of color for action classification and detection in still images. We also evaluate different fusion approaches for combining color and shape information for action recognition. Additionally, an analysis is performed to validate the contribution of color for action recognition. Our results clearly demonstrate that combining color and shape information significantly improve the performance of both action classification and detection in still images.
APA, Harvard, Vancouver, ISO, and other styles
2

Friberg, Oscar. "Recognizing Semantics in Human Actions with Object Detection." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-212579.

Full text
Abstract:
Two-stream convolutional neural networks are currently one of the most successful approaches for human action recognition. The two-stream convolutional networks separates spatial and temporal information into a spatial stream and a temporal stream. The spatial stream accepts a single RGB frame, while the temporal stream accepts a sequence of optical flow. There have been attempts to further extend the work of the two-stream convolutional network framework. For instance there have been attempts to extend with a third network for auxiliary information, which this thesis mainly focuses on. We seek to extend the two-stream convolutional neural network by introducing a semantic stream by using object detection systems. Two contributions are made in thesis: First we show that this semantic stream can provide slight improvements over two-stream convolutional neural networks for human action recognition on standard benchmarks. Secondly, we attempt to seek divergence enhancements techniques to force our new semantic stream to complement the spatial and the temporal streams by modifying the loss function during training. Slight gains are seen using these divergence enhancement techniques.
Faltningsnätverk i två strömmar är just nu den mest lyckade tillvägagångsmetoden för mänsklig aktivitetsigenkänning, vilket delar upp rumslig och timlig information i en rumslig ström och en timlig ström. Den rumsliga strömmen tar emot individella RGB bildrutor för igenkänning, medan den timliga strömmen tar emot en sekvens av optisk flöde. Försök i att utöka ramverket för faltningsnätverk i två strömmar har gjorts i tidigare arbete. Till exempel har försök gjorts i att komplementera dessa två nätverk med ett tredje nätverk som tar emot extra information. I detta examensarbete söker vi metoder för att utöka faltningsnätverk i två strömmar genom att introducera en semantisk ström med objektdetektion. Vi gör i huvudsak två bidrag i detta examensarbete: Först visar vi att den semantiska strömmen tillsammans med den rumsliga strömmen och den timliga strömmen kan bidra till små förbättringar för mänsklig aktivitetsigenkänning i video på riktmärkesstandarder. För det andra söker vi efter divergensutökningstekniker som tvingar den semantiska strömme att komplementera de andra två strömmarna genom att modifiera förlustfunktionen under träning. Vi ser små förbättringar med att använda dessa tekniker för att öka divergens.
APA, Harvard, Vancouver, ISO, and other styles
3

Kalogeiton, Vasiliki. "Localizing spatially and temporally objects and actions in videos." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/28984.

Full text
Abstract:
The rise of deep learning has facilitated remarkable progress in video understanding. This thesis addresses three important tasks of video understanding: video object detection, joint object and action detection, and spatio-temporal action localization. Object class detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, training an object detector on one domain (either still images or videos) and testing on the other one results in a significant performance gap compared to training and testing on the same domain. In the first part of this thesis, we examine the reasons behind this performance gap. We define and evaluate several domain shift factors: spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing. We examine the impact of these factors by comparing the detection performance before and after cancelling them out. The results show that all five factors affect the performance of the detectors and their combined effect explains the performance gap. While most existing approaches for detection in videos focus on objects or human actions separately, in the second part of this thesis we aim at detecting non-human centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting object-action pairs in videos, and show that both tasks of object and action detection benefit from this joint learning. In experiments on the A2D dataset [Xu et al., 2015], we obtain state-of-the-art results on segmentation of object-action pairs. In the third part, we are the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-of-the-art approaches do. The same way modern detectors rely on anchor boxes, our tubelet detector is based on anchor cuboids by taking as input a sequence of frames and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our tubelet detector outperforms all state of the art on the UCF-Sports [Rodriguez et al., 2008], J-HMDB [Jhuang et al., 2013a], and UCF-101 [Soomro et al., 2012] action localization datasets especially at high overlap thresholds. The improvement in detection performance is explained by both more accurate scores and more precise localization.
APA, Harvard, Vancouver, ISO, and other styles
4

Ranalli, Lorenzo. "Studio ed implementazione di un modello di Action Recognition. Classificazione delle azioni di gioco e della tipologia di colpi durante un match di Tennis." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
Il Machine Learning e lo sport stanno sempre di più consolidando il proprio matrimonio. Che siano sport individuali, sport di squadra, sport più o meno professionistici, è sempre più presente una componente smart che emerge sia in fase di arbitraggio che in fase di coaching virtuale. È proprio in ambito Virtual Coaching che si colloca l’idea di IConsulting che, con mAIcoach, cerca di ridefinire le regole degli allenamenti di tennis, assistere l’atleta e guidarlo nell’esecuzione corretta dei movimenti. Più nello specifico l’idea è quella di trasmettere un metodo matematico attraverso un sistema smart di valutazioni del tennista. L’utente potrà effettuare submit dei video del proprio allenamento e ricevere consigli e critiche costruttive al fine di migliorare le proprie posture ed i propri colpi.
APA, Harvard, Vancouver, ISO, and other styles
5

Liu, Chang. "Human motion detection and action recognition." HKBU Institutional Repository, 2010. http://repository.hkbu.edu.hk/etd_ra/1108.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Ta, Anh Phuong. "Inexact graph matching techniques : application to object detection and human action recognition." Lyon, INSA, 2010. http://theses.insa-lyon.fr/publication/2010ISAL0099/these.pdf.

Full text
Abstract:
Object detection and human action recognition are two active fields of research in computer vision, which have applications ranging from robotics and video surveillance, medical image analysis, human-computer interactions to content-based video annotation and retrieval. At this time, building such robust recognition systems still remain very challenging tasks, because of the variations in action/object classes, different possible viewpoints, as well as illumination changes, moving cameras, complex dynamic backgrounds and occlusions. In this thesis, we deal with object and activity recognition problems. Despite differences in the applications’ goals, the associated fundamental problems share numerous properties, for instance the necessity of handling non-rigid transformations. Describing a model object or a video by a set of local features, we formulate the recognition problem as a graph matching problem, where nodes represent local features, and edges represent spatial and/or spatio-temporal relationships between them. Inexact matching of valued graphs is a well known NP-hard problem, therefore we concentrated on finding approximate solutions. To this end, the graph matching problem is formulated as an energy minimization problem. Based on this energy function, we propose two different solutions for the two applications: object detection in images and activity recognition in video sequences. We also propose new features to improve the conventional Bag of words model, which is widely used in computer vision. Experiments on both standard datasets and our own datasets, demonstrate that our methods provide good results regarding the recent state-of-the-art in both domains
La détection d’objets et la reconnaissance des activités humaines sont les deux domaines actifs dans la vision par ordinateur, qui trouve des applications en robotique, vidéo surveillance, analyse des images médicales, interaction homme-machine, annotation et recherche de la vidéo par le contenue. Actuellement, il reste encore très difficile de construire de tels systèmes, en raison des variations des classes d’objets et d’actions, les différents points de vue, ainsi que des changements d’illumination, des mouvements de caméra, des fonds dynamiques et des occlusions. Dans cette thèse, nous traitons le problème de la détection d’objet et d’activités dans la vidéo. Malgré ses différences de buts, les problèmes fondamentaux associés partagent de nombreuses propriétés, par exemple la nécessité de manipuler des transformations non-ridiges. En décrivant un modèle d’objet ou une vidéo par un ensemble des caractéristiques locales, nous formulons le problème de reconnaissance comme celui d’une mise en correspondance de graphes, dont les nœuds représentent les caractéristiques locales, et les arrêtes représentent les relations que l’on veut vérifier entre ces caractéristiques. Le problème de mise en correspondance inexacte de graphes est connu comme NP-difficile, nous avons donc porté notre effort sur des solutions approchées. Pour cela, le problème est transformé en problème d’optimisation d’une fonction d’énergie, qui contient un terme en rapport avec la distance entre les descripteurs locaux et d’autres termes en rapport avec les relations spatiales (ou/et temporelles) entre eux. Basé sur cette énergie, deux différentes solutions ont été proposées et validées pour les deux applications ciblées: la reconnaissance d’objets à partir d’images et la reconnaissance des activités dans la vidéo. En plus, nous avons également proposé un nouveaux descripteur pour améliorer les modèles de Sac-de-mots, qui sont largement utilisé dans la vision par ordinateur. Nos expérimentations sur deux bases standards, ainsi que sur nos bases démontrent que les méthodes proposées donnent de bons résultats en comparant avec l’état de l’art dans ces deux domaines
APA, Harvard, Vancouver, ISO, and other styles
7

Dittmar, George William. "Object Detection and Recognition in Natural Settings." PDXScholar, 2013. https://pdxscholar.library.pdx.edu/open_access_etds/926.

Full text
Abstract:
Much research as of late has focused on biologically inspired vision models that are based on our understanding of how the visual cortex processes information. One prominent example of such a system is HMAX [17]. HMAX attempts to simulate the biological process for object recognition in cortex based on the model proposed by Hubel & Wiesel [10]. This thesis investigates the ability of an HMAX-like system (GLIMPSE [20]) to perform object-detection in cluttered natural scenes. I evaluate these results using the StreetScenes database from MIT [1, 8]. This thesis addresses three questions: (1) Can the GLIMPSE-based object detection system replicate the results on object-detection reported by Bileschi using HMAX? (2) Which features computed by GLIMPSE lead to the best object-detection performance? (3) What effect does elimination of clutter in the training sets have on the performance of our system? As part of this thesis, I built an object detection and recognition system using GLIMPSE [20] and demonstrate that it approximately replicates the results reported in Bileschi's thesis. In addition, I found that extracting and combining features from GLIMPSE using different layers of the HMAX model gives the best overall invariance to position, scale and translation for recognition tasks, but comes with a much higher computational overhead. Further contributions include the creation of modified training and test sets based on the StreetScenes database, with removed clutter in the training data and extending the annotations for the detection task to cover more objects of interest that were not in the original annotations of the database.
APA, Harvard, Vancouver, ISO, and other styles
8

Higgs, David Robert. "Parts-based object detection using multiple views /." Link to online version, 2005. https://ritdml.rit.edu/dspace/handle/1850/1000.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Pan, Xiang. "Approaches for edge detection, pose determination and object representation in computer vision." Thesis, Heriot-Watt University, 1994. http://hdl.handle.net/10399/1378.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Tonge, Ashwini Kishor. "Object Recognition Using Scale-Invariant Chordiogram." Thesis, University of North Texas, 2017. https://digital.library.unt.edu/ark:/67531/metadc984116/.

Full text
Abstract:
This thesis describes an approach for object recognition using the chordiogram shape-based descriptor. Global shape representations are highly susceptible to clutter generated due to the background or other irrelevant objects in real-world images. To overcome the problem, we aim to extract precise object shape using superpixel segmentation, perceptual grouping, and connected components. The employed shape descriptor chordiogram is based on geometric relationships of chords generated from the pairs of boundary points of an object. The chordiogram descriptor applies holistic properties of the shape and also proven suitable for object detection and digit recognition mechanisms. Additionally, it is translation invariant and robust to shape deformations. In spite of such excellent properties, chordiogram is not scale-invariant. To this end, we propose scale invariant chordiogram descriptors and intend to achieve a similar performance before and after applying scale invariance. Our experiments show that we achieve similar performance with and without scale invariance for silhouettes and real world object images. We also show experiments at different scales to confirm that we obtain scale invariance for chordiogram.
APA, Harvard, Vancouver, ISO, and other styles
11

Case, Isaac. "Automatic object detection and tracking in video /." Online version of thesis, 2010. http://hdl.handle.net/1850/12332.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Clark, Daniel S. "Object detection and tracking using a parts-based approach /." Link to online version, 2005. https://ritdml.rit.edu/dspace/handle/1850/1167.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Naha, Shujon. "Zero-shot Learning for Visual Recognition Problems." IEEE, 2015. http://hdl.handle.net/1993/31806.

Full text
Abstract:
In this thesis we discuss different aspects of zero-shot learning and propose solutions for three challenging visual recognition problems: 1) unknown object recognition from images 2) novel action recognition from videos and 3) unseen object segmentation. In all of these three problems, we have two different sets of classes, the “known classes”, which are used in the training phase and the “unknown classes” for which there is no training instance. Our proposed approach exploits the available semantic relationships between known and unknown object classes and use them to transfer the appearance models from known object classes to unknown object classes to recognize unknown objects. We also propose an approach to recognize novel actions from videos by learning a joint model that links videos and text. Finally, we present a ranking based approach for zero-shot object segmentation. We represent each unknown object class as a semantic ranking of all the known classes and use this semantic relationship to extend the segmentation model of known classes to segment unknown class objects.
October 2016
APA, Harvard, Vancouver, ISO, and other styles
14

Liu, X. (Xin). "Human motion detection and gesture recognition using computer vision methods." Doctoral thesis, Oulun yliopisto, 2019. http://urn.fi/urn:isbn:9789526222011.

Full text
Abstract:
Abstract Gestures are present in most daily human activities and automatic gestures analysis is a significant topic with the goal of enabling the interaction between humans and computers as natural as the communication between humans. From a computer vision perspective, a gesture analysis system is typically composed of two stages, the low-level stage for human motion detection and the high-level stage for understanding human gestures. Therefore, this thesis contributes to the research on gesture analysis from two aspects, 1) Detection: human motion segmentation from video sequences, and 2) Understanding: gesture cues extraction and recognition. In the first part of this thesis, two sparse signal recovery based human motion detection methods are presented. In real videos the foreground (human motions) pixels are often not randomly distributed but have the group properties in both spatial and temporal domains. Based on this observation, a spatio-temporal group sparsity recovery model is proposed, which explicitly consider the foreground pixels' group clustering priors of spatial coherence and temporal contiguity. Moreover, a pixel should be considered as a multi-channel signal. Namely, if a pixel is equal to the adjacent ones that means all the three RGB coefficients should be equal. Motivated by this observation, a multi-channel fused Lasso regularizer is developed to explore the smoothness of multi-channels signals. In the second part of this thesis, two human gesture recognition methods are presented to resolve the issue of temporal dynamics, which is crucial to the interpretation of the observed gestures. In the first study, a gesture skeletal sequence is characterized by a trajectory on a Riemannian manifold. Then, a time-warping invariant metric on the Riemannian manifold is proposed. Furthermore, a sparse coding for skeletal trajectories is presented by explicitly considering the labelling information, with the aim to enforcing the discriminant validity of the dictionary. In the second work, based on the observation that a gesture is a time series with distinctly defined phases, a low-rank matrix decomposition model is proposed to build temporal compositions of gestures. In this way, a more appropriate alignment of hidden states for a hidden Markov model can be achieved
Tiivistelmä Eleet ovat läsnä useimmissa päivittäisissä ihmisen toiminnoissa. Automaattista eleiden analyysia tarvitaan laitteiden ja ihmisten välisestä vuorovaikutuksesta parantamiseksi ja tavoitteena on yhtä luonnollinen vuorovaikutus kuin ihmisten välinen vuorovaikutus. Konenäön näkökulmasta eleiden analyysijärjestelmä koostuu ihmisen liikkeiden havainnoinnista ja eleiden tunnistamisesta. Tämä väitöskirjatyö edistää eleanalyysin-tutkimusta erityisesti kahdesta näkökulmasta: 1) Havainnointi - ihmisen liikkeiden segmentointi videosekvenssistä. 2) Ymmärtäminen - elemarkkerien erottaminen ja tunnistaminen. Väitöskirjan ensimmäinen osa esittelee kaksi liikkeen havainnointi menetelmää, jotka perustuvat harvan signaalin rekonstruktioon. Videokuvan etualan (ihmisen liikkeet) pikselit eivät yleensä ole satunnaisesti jakautuneita vaan niillä toisistaan riippuvia ominaisuuksia spatiaali- ja aikatasolla tarkasteltuna. Tähän havaintoon perustuen esitellään spatiaalis-ajallinen harva rekonstruktiomalli, joka käsittää etualan pikseleiden klusteroinnin spatiaalisen koherenssin ja ajallisen jatkuvuuden perusteella. Lisäksi tehdään oletus, että pikseli on monikanavainen signaali (RGB-väriarvot). Pikselin ollessa samankaltainen vieruspikseliensä kanssa myös niiden värikanava-arvot ovat samankaltaisia. Havaintoon nojautuen kehitettiin kanavat yhdistävä lasso-regularisointi, joka mahdollistaa monikanavaisen signaalin tasaisuuden tutkimisen. Väitöskirjan toisessa osassa esitellään kaksi menetelmää ihmisen eleiden tunnistamiseksi. Menetelmiä voidaan käyttää eleiden ajallisen dynamiikan ongelmien (eleiden nopeuden vaihtelu) ratkaisemiseksi, mikä on ensiarvoisen tärkeää havainnoitujen eleiden oikein tulkitsemiseksi. Ensimmäisessä menetelmässä ele kuvataan luurankomallin liikeratana Riemannin monistossa (Riemannian manifold), joka hyödyntää aikavääristymille sietoista metriikkaa. Lisäksi esitellään harvakoodaus (sparse coding) luurankomallien liikeradoille. Harvakoodaus perustuu nimiöintitietoon, jonka tavoitteena on varmistua koodisanaston keskinäisestä riippumattomuudesta. Toisen menetelmän lähtökohtana on havainto, että ele on ajallinen sarja selkeästi määriteltäviä vaiheita. Vaiheiden yhdistämiseen ehdotetaan matala-asteista matriisihajotelmamallia, jotta piilotilat voidaan sovittaa paremmin Markovin piilomalliin (Hidden Markov Model)
APA, Harvard, Vancouver, ISO, and other styles
15

Prokaj, Jan. "DETECTING CURVED OBJECTS AGAINST CLUTTERED BACKGROUNDS." Master's thesis, University of Central Florida, 2008. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2847.

Full text
Abstract:
Detecting curved objects against cluttered backgrounds is a hard problem in computer vision. We present new low-level and mid-level features to function in these environments. The low-level features are fast to compute, because they employ an integral image approach, which makes them especially useful in real-time applications. The mid-level features are built from low-level features, and are optimized for curved object detection. The usefulness of these features is tested by designing an object detection algorithm using these features. Object detection is accomplished by transforming the mid-level features into weak classifiers, which then produce a strong classifier using AdaBoost. The resulting strong classifier is then tested on the problem of detecting heads with shoulders. On a database of over 500 images of people, cropped to contain head and shoulders, and with a diverse set of backgrounds, the detection rate is 90% while the false positive rate on a database of 500 negative images is less than 2%.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS
APA, Harvard, Vancouver, ISO, and other styles
16

IACONO, MASSIMILIANO. "Object detection and recognition with event driven cameras." Doctoral thesis, Università degli studi di Genova, 2020. http://hdl.handle.net/11567/1005981.

Full text
Abstract:
This thesis presents study, analysis and implementation of algorithms to perform object detection and recognition using an event-based cam era. This sensor represents a novel paradigm which opens a wide range of possibilities for future developments of computer vision. In partic ular it allows to produce a fast, compressed, illumination invariant output, which can be exploited for robotic tasks, where fast dynamics and significant illumination changes are frequent. The experiments are carried out on the neuromorphic version of the iCub humanoid platform. The robot is equipped with a novel dual camera setup mounted directly in the robot’s eyes, used to generate data with a moving camera. The motion causes the presence of background clut ter in the event stream. In such scenario the detection problem has been addressed with an at tention mechanism, specifically designed to respond to the presence of objects, while discarding clutter. The proposed implementation takes advantage of the nature of the data to simplify the original proto object saliency model which inspired this work. Successively, the recognition task was first tackled with a feasibility study to demonstrate that the event stream carries sufficient informa tion to classify objects and then with the implementation of a spiking neural network. The feasibility study provides the proof-of-concept that events are informative enough in the context of object classifi cation, whereas the spiking implementation improves the results by employing an architecture specifically designed to process event data. The spiking network was trained with a three-factor local learning rule which overcomes weight transport, update locking and non-locality problem. The presented results prove that both detection and classification can be carried-out in the target application using the event data.
APA, Harvard, Vancouver, ISO, and other styles
17

Irhebhude, Martins. "Object detection, recognition and re-identification in video footage." Thesis, Loughborough University, 2015. https://dspace.lboro.ac.uk/2134/19600.

Full text
Abstract:
There has been a significant number of security concerns in recent times; as a result, security cameras have been installed to monitor activities and to prevent crimes in most public places. These analysis are done either through video analytic or forensic analysis operations on human observations. To this end, within the research context of this thesis, a proactive machine vision based military recognition system has been developed to help monitor activities in the military environment. The proposed object detection, recognition and re-identification systems have been presented in this thesis. A novel technique for military personnel recognition is presented in this thesis. Initially the detected camouflaged personnel are segmented using a grabcut segmentation algorithm. Since in general a camouflaged personnel's uniform appears to be similar both at the top and the bottom of the body, an image patch is initially extracted from the segmented foreground image and used as the region of interest. Subsequently the colour and texture features are extracted from each patch and used for classification. A second approach for personnel recognition is proposed through the recognition of the badge on the cap of a military person. A feature matching metric based on the extracted Speed Up Robust Features (SURF) from the badge on a personnel's cap enabled the recognition of the personnel's arm of service. A state-of-the-art technique for recognising vehicle types irrespective of their view angle is also presented in this thesis. Vehicles are initially detected and segmented using a Gaussian Mixture Model (GMM) based foreground/background segmentation algorithm. A Canny Edge Detection (CED) stage, followed by morphological operations are used as pre-processing stage to help enhance foreground vehicular object detection and segmentation. Subsequently, Region, Histogram Oriented Gradient (HOG) and Local Binary Pattern (LBP) features are extracted from the refined foreground vehicle object and used as features for vehicle type recognition. Two different datasets with variant views of front/rear and angle are used and combined for testing the proposed technique. For night-time video analytics and forensics, the thesis presents a novel approach to pedestrian detection and vehicle type recognition. A novel feature acquisition technique named, CENTROG, is proposed for pedestrian detection and vehicle type recognition in this thesis. Thermal images containing pedestrians and vehicular objects are used to analyse the performance of the proposed algorithms. The video is initially segmented using a GMM based foreground object segmentation algorithm. A CED based pre-processing step is used to enhance segmentation accuracy prior using Census Transforms for initial feature extraction. HOG features are then extracted from the Census transformed images and used for detection and recognition respectively of human and vehicular objects in thermal images. Finally, a novel technique for people re-identification is proposed in this thesis based on using low-level colour features and mid-level attributes. The low-level colour histogram bin values were normalised to 0 and 1. A publicly available dataset (VIPeR) and a self constructed dataset have been used in the experiments conducted with 7 clothing attributes and low-level colour histogram features. These 7 attributes are detected using features extracted from 5 different regions of a detected human object using an SVM classifier. The low-level colour features were extracted from the regions of a detected human object. These 5 regions are obtained by human object segmentation and subsequent body part sub-division. People are re-identified by computing the Euclidean distance between a probe and the gallery image sets. The experiments conducted using SVM classifier and Euclidean distance has proven that the proposed techniques attained all of the aforementioned goals. The colour and texture features proposed for camouflage military personnel recognition surpasses the state-of-the-art methods. Similarly, experiments prove that combining features performed best when recognising vehicles in different views subsequent to initial training based on multi-views. In the same vein, the proposed CENTROG technique performed better than the state-of-the-art CENTRIST technique for both pedestrian detection and vehicle type recognition at night-time using thermal images. Finally, we show that the proposed 7 mid-level attributes and the low-level features results in improved performance accuracy for people re-identification.
APA, Harvard, Vancouver, ISO, and other styles
18

Garcia, Rui Pedro Figueiredo. "Object recognition for a service robot." Master's thesis, Universidade de Aveiro, 2015. http://hdl.handle.net/10773/17393.

Full text
Abstract:
Mestrado em Engenharia de Computadores e Telemática
A contínua evolução da tecnologia e o crescimento no desenvolvimento de aplicações robóticas tornou possível a criação de robôs autónomos que consigam assistir ou até mesmo substituir os humanos em tarefas diárias e trabalhos monótomos. Atualmente, com o envelhecimento da população humana, é esperado que os robôs de serviço venham a ser cada vez mais utilizados para assistência de pessoas idosas ou com deficiência. Para isso, um robô de serviços tem que ser capaz de evitar obstáculos enquanto se movimenta em ambientes conhecidos ou desconhecidos, ser capaz de detetar e manipular objetos e perceber comandos dados pelos humanos. O objetivo desta dissertação é o desenvolvimento de um sistema de visão, capaz de detetar e identificar objetos, para o robô CAMBADA@Home. O sistema de visão proposto implementa dois métodos para deteção de objetos, sendo o primeiro baseado em histogramas de cor e o segundo método usando algoritmos de deteção e descrição de pontos de interesse (algoritmos SIFT e SURF). O sistema usa informação de profundidade e de cor, sendo a informação 3D usada para detectar objetos que estejam pousados sobre superfícies planas. Os resultados experimentais obtidos com o robô CAMBADA@Home são apresentados e discutidos, com o objetivo de avaliar a robustez do sistema proposto.
The continuous evolution of technology and the fast development of robotic applications has made possible to create autonomous robots that can assist or even replace humans in daily routines and monotonous jobs. Nowadays, with the aging of the world population, it is expected that service robots can be explored to assist elderly or disable people. For this, a service robot has to be capable of avoiding obstacles while navigating in known and unknown environments, recognizing and manipulating objects and understanding commands from humans. The objective of this dissertation is the development of a vision system, capable to detect and recognize household objects, for the service robot CAMBADA@Home. The proposed approach implements two methods for object detection, the first one based on color histograms and the second method using feature detection algorithms (SIFT and SURF algorithms). It uses depth and color information, where the 3D data is used to detect the objects that are found on horizontal planes. Experimental results obtained with the CAMBADA@Home robot are presented and discussed, in order to evaluate the robustness of the proposed system.
APA, Harvard, Vancouver, ISO, and other styles
19

Olafsson, Björgvin. "Partially Observable Markov Decision Processes for Faster Object Recognition." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-198632.

Full text
Abstract:
Object recognition in the real world is a big challenge in the field of computer vision. Given the potentially enormous size of the search space it is essential to be able to make intelligent decisions about where in the visual field to obtain information from to reduce the computational resources needed. In this report a POMDP (Partially Observable Markov Decision Process) learning framework, using a policy gradient method and information rewards as a training signal, has been implemented and used to train fixation policies that aim to maximize the information gathered in each fixation. The purpose of such policies is to make object recognition faster by reducing the number of fixations needed. The trained policies are evaluated by simulation and comparing them with several fixed policies. Finally it is shown that it is possible to use the framework to train policies that outperform the fixed policies for certain observation models.
APA, Harvard, Vancouver, ISO, and other styles
20

Solmon, Joanna Browne. "Using GIST Features to Constrain Search in Object Detection." PDXScholar, 2014. https://pdxscholar.library.pdx.edu/open_access_etds/1957.

Full text
Abstract:
This thesis investigates the application of GIST features [13] to the problem of object detection in images. Object detection refers to locating instances of a given object category in an image. It is contrasted with object recognition, which simply decides whether an image contains an object, regardless of the object's location in the image. In much of computer vision literature, object detection uses a "sliding window" approach to finding objects in an image. This requires moving various sizes of windows across an image and running a trained classifier on the visual features of each window. This brute force method can be time consuming. I investigate whether global, easily computed GIST features can be used to classify the size and location of objects in the image to help reduce the number of windows searched before the object is found. Using K–means clustering and Support Vector Machines to classify GIST feature vectors, I find that object size and vertical location can be classified with 73–80% accuracy. These classifications can be used to constrain the search location and window sizes explored by object detection methods.
APA, Harvard, Vancouver, ISO, and other styles
21

Taurone, Francesco. "3D Object Recognition from a Single Image via Patch Detection by a Deep CNN." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/18669/.

Full text
Abstract:
This thesis describes the development of a new technique for recognizing the 3D pose of an object via a single image. The whole project is based on a CNN for recognizing patches on the object, that we use for estimating the pose given an a priori model. The positions of the patches, together with the knowledge of their coordinates in the model, make the estimation of the pose possible through a solution of a PnP problem. The CNN chosen for this project is Yolo. In order to build the training dataset for the network, a new approach is used. Instead of labeling each individual training image as for the standard supervised learning, the initial coordinates of the patches are propagated on all the other images making use of the pose of the camera for all the pictures.
APA, Harvard, Vancouver, ISO, and other styles
22

Li, Ying. "Efficient and Robust Video Understanding for Human-robot Interaction and Detection." The Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu152207324664654.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Solini, Arianna. "Applicazione di Deep Learning e Computer Vision ad un Caso d'uso aziendale: Progettazione, Risoluzione ed Analisi." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
Nella computer vision, sono oramai più di dieci anni che si parla di Machine Learning (ML), con l'obiettivo di creare sistemi autonomi che siano in grado di realizzare modelli approssimati della realtà tridimensionale partendo da immagini bidimensionali. Grazie a questa capacità si possono interpretare e comprendere le immagini, emulando la vista umana. Molti ricercatori hanno creato reti neurali in grado di sfidarsi su grandi dataset di milioni di immagini e, come conseguenza, si è ottenuto il continuo miglioramento delle performance di classificazione di immagini da parte delle reti e la capacità di individuare il framework più adatto per ogni situazione, ottenendo risultati il più possibile performanti, veloci e accurati. Numerose aziende in tutto il mondo fanno uso di Machine Learning e computer vision, spaziando dal controllo qualità, all'assistenza diretta a persone che lavorano su attività ripetitive e spesso stancanti. Il lavoro di tesi è stato realizzato nel corso di un tirocinio presso Injenia (azienda informatica italiana partner Google) ed è stato svolto nell'ambito di un progetto industriale commissionato ad Injenia da parte di una multi-utility italiana. Il progetto prevedeva l'utilizzo di uno o più modelli di ML in ambito computer vision e, a tal fine, è stata portata avanti un'indagine su più fronti per indirizzare le scelte durante il processo di sviluppo. Una parte dei risultati dell'indagine ha fornito informazioni utili all'ottimizzazione del modello di ML utilizzato. Un'altra parte è stata utilizzata per il fine-tuning di un modello di ML (già pre-allenato), applicando quindi il principio di transfer learning al dataset di immagini fornite dalla multi-utility. Lo scopo della tesi è, quindi, quello di presentare lo sviluppo e l'applicazione di tecniche di Machine Learning, Deep Learning e computer vision ad un caso d'uso aziendale concreto.
APA, Harvard, Vancouver, ISO, and other styles
24

Thaung, Ludwig. "Advanced Data Augmentation : With Generative Adversarial Networks and Computer-Aided Design." Thesis, Linköpings universitet, Datorseende, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-170886.

Full text
Abstract:
CNN-based (Convolutional Neural Network) visual object detectors often reach human level of accuracy but need to be trained with large amounts of manually annotated data. Collecting and annotating this data can frequently be time-consuming and financially expensive. Using generative models to augment the data can help minimize the amount of data required and increase detection per-formance. Many state-of-the-art generative models are Generative Adversarial Networks (GANs). This thesis investigates if and how one can utilize image data to generate new data through GANs to train a YOLO-based (You Only Look Once) object detector, and how CAD (Computer-Aided Design) models can aid in this process. In the experiments, different models of GANs are trained and evaluated by visual inspection or with the Fréchet Inception Distance (FID) metric. The data provided by Ericsson Research consists of images of antenna and baseband equipment along with annotations and segmentations. Ericsson Research supplied the YOLO detector, and no modifications are made to this detector. Finally, the YOLO detector is trained on data generated by the chosen model and evaluated by the Average Precision (AP). The results show that the generative models designed in this work can produce RGB images of high quality. However, the quality reduces if binary segmentation masks are to be generated as well. The experiments with CAD input data did not result in images that could be used for the training of the detector. The GAN designed in this work is able to successfully replace objects in images with the style of other objects. The results show that training the YOLO detector with GAN-modified data compared to training with real data leads to the same detection performance. The results also show that the shapes and backgrounds of the antennas contributed more to detection performance than their style and colour.
APA, Harvard, Vancouver, ISO, and other styles
25

NICORA, ELENA. "Efficient Projections for Salient Motion Detection and Representation." Doctoral thesis, Università degli studi di Genova, 2022. http://hdl.handle.net/11567/1091835.

Full text
Abstract:
Motion perception is one of the first abilities developed by our cognitive sys- tems. Since the earliest days of life, we are inclined to focus our attention on moving objects in order to gather information about what is happening around us, without having to actually process all the visual stimuli captured by our eyes. This ability is related to the notion of Visual Saliency. It is based on the concept of finding areas of the scene significantly different from the sur- roundings and it helps both biological and computational systems reducing the amount of incoming information, otherwise extremely expensive even to parallel process. Measuring and understanding motion in the last decades has gained an in- creasing importance in several Artificial Intelligence applications. In Computer Vision, the general problem of motion understanding is often broken down into several tasks, each specialized on a different motion-oriented goal. In the recent years Deep Learning solutions established a sort of monopoly, especially in image and video processing, reaching outstanding results for the task of interest but providing poor generalization and interpretation ability. Furthermore these methods come with major drawbacks in what concerns time and computational complexity, requiring huge amount of data to learn from. Hence their use might not be suited for all the approachable tasks, in particular when we have to deal with pipelines composed by various steps. Robotics, assisted living and video-surveillance are just some examples of ap- plication domains in the need of alternative algorithmic solutions promoting portability, real-time computations and the use of limited quantities of data. The aim of this thesis is to study approaches to couple effectiveness and ef- ficiency, ultimately promoting the overall sustainability. In this direction we investigate the potential of a family of efficient filters, the Gray-Code Kernels, for addressing Visual Saliency estimation with a focus on motion information. Our implementation relies on the use of 3D kernels applied to overlapping blocks of frames and is able to gather meaningful spatio-temporal informa- tion with a very light computation. Through a single set of extracted features we manage to tackle three different motion-oriented goals: motion saliency detection, video object segmentation and motion representation. Additionally, the three intermediate results are exploited to address the problem of Human Action Recognition. To summarise, this thesis focuses on: • The efficient computation of a set of features highlighting spatio-temporal information • The design of global representations able to compactly describe motion cues • The development of a framework that addresses increasingly higher-level tasks of motion understanding The developed framework has been tested on two well-known Computer Vis- ion tasks: Video Object Segmentation and Action Classification. We compared the motion detection and segmentation abilities of our method with classical approaches of similar complexity. In the experimental analysis we evaluate our method on publicly available datasets and show that it is able to effectively and efficiently identify the portion of the image where the motion is occurring, providing tolerance to a variety of scene conditions and complexities. We propose a comparison with classical methods for change detection, outperforming Optical Flow and Background Subtraction algorithms. By adding appearance information to our motion-based segmentation we manage to reach, under appropriate condi- tions, comparable results to more complex state-of-the-art approaches. Lastly, we tested the motion representation ability of our method by employ- ing it in traditional and Deep Learning action recognition scenarios.
APA, Harvard, Vancouver, ISO, and other styles
26

Piemontese, Cristiano. "Progettazione e implementazione di una applicazione didattica interattiva per il riconoscimento di oggetti basata sull'algoritmo SIFT." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/10883/.

Full text
Abstract:
Nell'elaborato viene introdotto l'ambito della Computer Vision e come l'algoritmo SIFT si inserisce nel suo panorama. Viene inoltre descritto SIFT stesso, le varie fasi di cui si compone e un'applicazione al problema dell'object recognition. Infine viene presentata un'implementazione di SIFT in linguaggio Python creata per ottenere un'applicazione didattica interattiva e vengono mostrati esempi di questa applicazione.
APA, Harvard, Vancouver, ISO, and other styles
27

Anguzza, Umberto. "A method to develop a computer-vision based system for the automaticac dairy cow identification and behaviour detection in free stall barns." Doctoral thesis, Università di Catania, 2013. http://hdl.handle.net/10761/1334.

Full text
Abstract:
In this thesis, a method to develop a computer-vision based system (CVBS) for the automatic dairy cow identification and behaviour detection in free stall barns is proposed. Two different methodologies based on digital image processing were proposed in order to achieve dairy cow identification and behaviour detection, respectively. Suitable algorithms among that used in computer vision science were chosen and adapted to the specific characteristics of the breeding environment under study. The trial was carried out during the years 2011 and 2012 in a dairy cow free-stall barn located in the municipality of Vittoria in the province of Ragusa. A multi-camera video-recording system was designed in order to obtain sequences of panoramic top-view images coming from the multi-camera video-recording system. The two methodologies proposed in order to achieve dairy cow identification and behaviour detection, were implemented in a software component of the CVBS and tested. Finally, the CVBS was validated by comparing the detection and identification results with those generated by an operator through visual recognition of cows in sequences of panoramic top-view images. This comparison allowed the computation of accuracy indices. The detection of the dairy cow behavioural activities in the barn provided a Cow Detection Percentage (CDP) index greater than 86% and a Quality Percentage (QP) index greater than 75%. With regard to cow identification the CVBS provided a CDP > 90% and a QP > 85%.
APA, Harvard, Vancouver, ISO, and other styles
28

Sharma, Vinay. "Simultaneous object detection and segmentation using top-down and bottom-up processing." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1196372113.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Maurice, Camille. "Reconnaissance d'actions humaines dans des vidéos, en particulier lors d'interaction avec des objets." Thesis, Toulouse 3, 2020. http://www.theses.fr/2020TOU30188.

Full text
Abstract:
Dans cette thèse nous étudions la reconnaissance d'actions humaines. Typiquement, différentes actions se déroulent dans un même lieu et font intervenir divers objets. Ce problème est difficile en raison de la variété et la ressemblance de certaines actions, de l'encombrement du fond de la scène. De nombreuses approches de vision par ordinateur étudient cette problématique et leur performance est souvent dépendante du paramétrage de certains hyper-paramètres. Par exemple pour les approches d'apprentissage profond nous retrouvons l'initialisation du learning-rate, la taille des mini-lots... Partant de ce constat, nous commençons par une étude comparative des outils d'optimisation des hyper-paramètres de la littérature appliquée à une problématique de vision par ordinateur. Puis nous proposons une première approche bayésienne originale pour la reconnaissance d'actions en ligne qui repose sur des primitives de haut-niveau en 3D : l'observation du squelette humain et les objets environnants. Les nombreux paramètres à régler sont optimisés grâce à l'outil d'optimisation qui émerge de notre étude comparative. Les performances de cette première approche sont comparées à un réseau d'apprentissage profond de l'état de l'art, il en ressort une certaine complémentarité que nous proposons d'exploiter à travers un mécanisme de fusion. Enfin, suite aux récentes avancées dans les réseaux de convolutions à graphes, nous proposons une approche compacte originale et modulaire qui repose sur la construction de graphes spatio-temporels du squelette et des objets. Ces différentes approches sont évaluées et comparées, en performance brute et vis-à-vis des actions sous-représentées sur différents jeux de données publiques qui proposent des séquences d'actions de la vie quotidienne. Nos approches montrent des gains de performance intéressants eu égard à la littérature, notamment vis-à-vis des classes sous représentées dans le jeu de données
In this thesis we study the recognition of actions of daily life. Typically, different actions take place in the same place and involve various objects. This problem is difficult because of the variety and resemblance of some actions and the clutter in the background. Many computer vision approaches study this problem and their performance is often dependent on the setting of certain hyper-parameters. For example, for deep learning approaches there are: the initialization of the learning-rate, the size of the mini-batch... Based on this observation, we begin with a comparative study of hyper-parameter optimization tools from the literature applied to a computer vision problem. Then we propose a first Bayesian approach for online action recognition based on high-level 3D primitives: the observation of the human skeleton and surrounding objects. The parameters to be set are optimized thanks to the optimization tool that emerges from our comparative study. The performances of this first approach are compared to a deep state of the art learning network, and a certain complementarity emerges that we propose to exploit through a fusion mechanism. Finally, following recent advances in graph convolutional networks, we propose a light and modular approach based on the construction of spatio-temporal graphs of the skeleton and objects. The validity of the different approaches is evaluated, in raw performance and with respect to under-represented actions on different public data sets that propose sequences of actions of everyday life. Our approaches show interesting results compared to the literature especially regarding imbalanced data and under-represented classes in datasets
APA, Harvard, Vancouver, ISO, and other styles
30

Lin, Chung-Ching. "Detecting and tracking moving objects from a moving platform." Diss., Georgia Institute of Technology, 2012. http://hdl.handle.net/1853/49014.

Full text
Abstract:
Detecting and tracking moving objects are important topics in computer vision research. Classical methods perform well in applications of steady cameras. However, these techniques are not suitable for the applications of moving cameras because the unconstrained nature of realistic environments and sudden camera movement makes cues to object positions rather fickle. A major difficulty is that every pixel moves and new background keeps showing up when a handheld or car-mounted camera moves. In this dissertation, a novel estimation method of camera motion parameters will be discussed first. Based on the estimated camera motion parameters, two detection algorithms are developed using Bayes' rule and belief propagation. Next, an MCMC-based feature-guided particle filtering method is presented to track detected moving objects. In addition, two detection algorithms without using camera motion parameters will be further discussed. These two approaches require no pre-defined class or model to be trained in advance. The experiment results will demonstrate robust detecting and tracking performance in object sizes and positions.
APA, Harvard, Vancouver, ISO, and other styles
31

Petit, Antoine. "Robust visual detection and tracking of complex objects : applications to space autonomous rendez-vous and proximity operations." Phd thesis, Université Rennes 1, 2013. http://tel.archives-ouvertes.fr/tel-00931604.

Full text
Abstract:
In this thesis, we address the issue of fully localizing a known object through computer vision, using a monocular camera, what is a central problem in robotics. A particular attention is here paid on space robotics applications, with the aims of providing a unified visual localization system for autonomous navigation purposes for space rendezvous and proximity operations. Two main challenges of the problem are tackled: initially detecting the targeted object and then tracking it frame-by-frame, providing the complete pose between the camera and the object, knowing the 3D CAD model of the object. For detection, the pose estimation process is based on the segmentation of the moving object and on an efficient probabilistic edge-based matching and alignment procedure of a set of synthetic views of the object with a sequence of initial images. For the tracking phase, pose estimation is handled through a 3D model-based tracking algorithm, for which we propose three different types of visual features, pertinently representing the object with its edges, its silhouette and with a set of interest points. The reliability of the localization process is evaluated by propagating the uncertainty from the errors of the visual features. This uncertainty besides feeds a linear Kalman filter on the camera velocity parameters. Qualitative and quantitative experiments have been performed on various synthetic and real data, with challenging imaging conditions, showing the efficiency and the benefits of the different contributions, and their compliance with space rendezvous applications.
APA, Harvard, Vancouver, ISO, and other styles
32

Azizpour, Hossein. "Visual Representations and Models: From Latent SVM to Deep Learning." Doctoral thesis, KTH, Datorseende och robotik, CVAP, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-192289.

Full text
Abstract:
Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning. First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class.  In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection. Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence. Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

QC 20160908

APA, Harvard, Vancouver, ISO, and other styles
33

Harzallah, Hedi. "Contribution à la détection et à la reconnaissance d'objets dans les images." Phd thesis, Université de Grenoble, 2011. http://tel.archives-ouvertes.fr/tel-00628027/en/.

Full text
Abstract:
Cette thèse s'intéresse au problème de la reconnaissance d'objets dans les images vidéo et plus particulièrement à celui de leur localisation. Elle a été conduite dans le contexte d'une collaboration scientifique entre l'INRIA Rhône-Alpes et MBDA France. De ce fait, une attention particulière a été accordée à l'applicabilité des approches proposées aux images infra-rouges. La méthode de localisation proposée repose sur l'utilisation d'une fenêtre glissante incluant une cascade à deux étages qui, malgré sa simplicité, permet d'allier rapidité et précision. Le premier étage est un étage de filtrage rejetant la plupart des faux positifs au moyen d'un classifieur SVM linéaire. Le deuxième étage élimine les fausses détections laissées par le premier étage avec un classifieur SVM non-linéaire plus lent, mais plus performant. Les fenêtres sont représentées par des descripteurs HOG et Bag-of-words. La seconde contribution de la thèse réside dans une méthode permettant de combiner localisation d'objets et catégorisation d'images. Ceci permet, d'une part, de prendre en compte le contexte de l'image lors de la localisation des objets, et d'autre part de s'appuyer sur la structure géométrique des objets lors de la catégorisation des images. Cette méthode permet d'améliorer les performances pour les deux tâches et produit des détecteurs et classifieurs dont la performance dépasse celle de l'état de l'art. Finalement, nous nous penchons sur le problème de localisation de catégories d'objets similaires et proposons de décomposer la tâche de localisation d'objets en deux étapes. Une première étape de détection permet de trouver les objets sans déterminer leurs positions tandis qu'une seconde étape d'identification permet de prédire la catégorie de l'objet. Nous montrons que cela permet de limiter les confusions entre les classes, principal problème observé pour les catégories d'objets visuellement similaires. La thèse laisse une place importante à la validation expérimentale, conduites sur la base PASCAL VOC ainsi que sur des bases d'images spécifiquement réalisées pour la thèse.
APA, Harvard, Vancouver, ISO, and other styles
34

Abou, Bakr Nachwa. "Reconnaissance et modélisation des actions de manipulation." Thesis, Université Grenoble Alpes, 2020. http://www.theses.fr/2020GRALM010.

Full text
Abstract:
Cette thèse aborde le problème de la reconnaissance, de la modélisation et de ladescription des activités humaines. Nous décrivons nos résultats sur trois problèmes : (1) l’utilisation de l’apprentissage par transfert pour la reconnaissance visuelle simultanée d’objets et de leur état, (2) la reconnaissance d’actions de manipulation à partir de transitions d’états, et (3) l’interprétation d’une série d’actions et d’états comme les événements d’une histoire prédéfinie afin d’en construire une description narrative.Ces résultats ont été développés en utilisant les activités culinaires comme domaine expérimental. Nous commençons par reconnaître les ingrédients comme les tomates et la laitue et les ingrédients tranchés et coupés en dés pendant la préparation d’un repas. Nous adaptons l’architecture VGG afin d’apprendre conjointement les représentations des ingrédients et de leurs états selon une approche par transfert d’apprentissage. Nous modélisons les actions en tant que transformations d’état d’objets. Nous détectons ainsi les actions de manipulation en suivant les transformations des propriétés correspondantes des objets (état et type) dans la vidéo. L’évaluation expérimentale de cette approche est réalisée en se servant des jeux de données 50 salads et EPIC-Kitchen. Nous utilisons les descriptions des actions qui en résultent pour construire les descriptions narratives des activités complexes observées dans les vidéos du jeu de données 50 salads
This thesis addresses the problem of recognition, modelling and description of human activities. We describe results on three problems: (1) the use of transfer learning for simultaneous visual recognition of objects and object states, (2) the recognition of manipulation actions from state transitions, and (3) the interpretation of a series of actions and states as events in a predefined story to construct a narrative description.These results have been developed using food preparation activities as an experimental domain. We start by recognising food classes such as tomatoes and lettuce and food states, such as sliced and diced, during meal preparation. We adapt the VGG network architecture to jointly learn the representations of food items and food states using transfer learning. We model actions as the transformation of object states. We use recognised object properties (state and type) to detect corresponding manipulation actions by tracking object transformations in the video. Experimental performance evaluation for this approach is provided using the 50 salads and EPIC-Kitchen datasets. We use the resulting action descriptions to construct narrative descriptions for complex activities observed in videos of 50 salads dataset
APA, Harvard, Vancouver, ISO, and other styles
35

Li, Yunming. "Machine vision algorithms for mining equipment automation." Thesis, Queensland University of Technology, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
36

Azaza, Aymen. "Context, motion and semantic information for computational saliency." Doctoral thesis, Universitat Autònoma de Barcelona, 2018. http://hdl.handle.net/10803/664359.

Full text
Abstract:
El objetivo principal de esta tesis es resaltar el objeto más sobresaliente (salient) de una imagen o en una secuencia de video. Abordamos tres aspectos importantes --- según nuestra opinión, no han sido suficientemente investigados --- en la detección de saliencia. En primer lugar, comenzamos ampliando la investigación previa sobre saliency que modela explícitamente la información proporcionada desde el contexto. Luego, mostramos la importancia del modelado de contexto explícito para la estimación del saliency. Varios trabajos importantes en saliency se basan en el uso de “object proposal”. Sin embargo, estos métodos se centran en el Saliency del “object proposal” e ignoran el contexto. Para introducir el contexto en tales enfoques de Saliency, unimos cada “object proposal” con su contexto directo. Esto nos permite evaluar la importancia del entorno inmediato (contexto) para calcular su Saliency. Proponemos varias características de Saliency, que se calculan a partir de los “object porposal”, incluidas las funciones basadas en continuidad de contexto omnidireccional y horizontal. En segundo lugar, investigamos el uso de métodos top-down (información semántica de alto nivel) para la tarea de predicción de saliency, ya que la mayoría de los métodos computacionales son bottom-up o solo incluyen pocas clases semánticas. Proponemos considerar un grupo más amplio de clases de objetos. Estos objetos representan información semántica importante que explotaremos en nuestro enfoque de predicción de prominencias. En tercer lugar, desarrollamos un método para detectar la saliency de video mediante el cálculo de la saliencia de supervoxels y optical flow. Además, aplicamos las características de contexto desarrolladas en esta tesis para la detección de saliency en video. El método combina características de forma y movimiento con nuestras características de contexto. En resumen, demostramos que la extensión de “object proposal” con su contexto directo mejora la tarea de detección de saliency en datos de imágenes y video. También se evalúa la importancia de la información semántica en la estimación del saliency. Finalmente, proponemos una nueva función de movimiento para detectar el salient en los datos de video. Las tres novedades propuestas se evalúan en conjuntos de datos de referencia de saliency estándar y se ha demostrado que mejoran con respecto al estado del arte.
The main objective of this thesis is to highlight the salient object in an image or in a video sequence. We address three important --- but in our opinion insufficiently investigated --- aspects of saliency detection. Firstly, we start by extending previous research on saliency which explicitly models the information provided from the context. Then, we show the importance of explicit context modelling for saliency estimation. Several important works in saliency are based on the usage of object proposals. However, these methods focus on the saliency of the object proposal itself and ignore the context. To introduce context in such saliency approaches, we couple every object proposal with its direct context. This allows us to evaluate the importance of the immediate surround (context) for its saliency. We propose several saliency features which are computed from the context proposals including features based on omni-directional and horizontal context continuity. Secondly, we investigate the usage of top-down methods (high-level semantic information) for the task of saliency prediction since most computational methods are bottom-up or only include few semantic classes. We propose to consider a wider group of object classes. These objects represent important semantic information which we will exploit in our saliency prediction approach. Thirdly, we develop a method to detect video saliency by computing saliency from supervoxels and optical flow. In addition, we apply the context features developed in this thesis for video saliency detection. The method combines shape and motion features with our proposed context features. To summarize, we prove that extending object proposals with their direct context improves the task of saliency detection in both image and video data. Also the importance of the semantic information in saliency estimation is evaluated. Finally, we propose a new motion feature to detect saliency in video data. The three proposed novelties are evaluated on standard saliency benchmark datasets and are shown to improve with respect to state-of-the-art.
APA, Harvard, Vancouver, ISO, and other styles
37

Peloušek, Jan. "Sledování obličejových rysů v reálném čase." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-218823.

Full text
Abstract:
This thesis considers the problematic of the object recognition in a digital picture, particularly about the human face recognition and its components. There are described the basics of the computer vision, the object detector Viola-Jones, its computer realization with help of the OpenCV libraries and the test results. This thesis also describes the accurate system of the facial features detection per the algorithm of the Active Shape Models and also related mechanism of the classifier training, including the software implementation.
APA, Harvard, Vancouver, ISO, and other styles
38

Lamberti, Lorenzo. "A deep learning solution for industrial OCR applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19777/.

Full text
Abstract:
This thesis describes a project developed throughout a six months internship in the Machine Vision Laboratory of Datalogic based in Pasadena, California. The project aims to develop a deep learning system as a possible solution for industrial optical character recognition applications. In particular, the focus falls on a specific algorithm called You Only Look Once (YOLO), which is a general-purpose object detector based on convolutional neural networks that currently offers state-of-the-art performances in terms of trade-off between speed and accuracy. This algorithm is indeed well known for reaching impressive processing speeds, but its intrinsic structure makes it struggle in detecting small objects clustered together, which unfortunately matches our scenario: we are trying to read alphanumerical codes by detecting each single character and then reconstructing the final string. The final goal of this thesis is to overcome this drawback and push the accuracy performances of a general object detector convolutional neural network to its limits, in order to meet the demanding requirements of industrial OCR applications. To accomplish this, first YOLO's unique detecting approach was mastered in its original framework called Darknet, written in C and CUDA, then all the code was translated into Python programming language for a better flexibility, which also allowed the deployment of a custom architecture. Four different datasets with increasing complexity were used as case-studies and the final performances reached were surprising: the accuracy varies between 99.75\% and 99.97\% with a processing time of 15 ms for images $1000\times1000$ big, largely outperforming in speed the current deep learning solution deployed by Datalogic. On the downsides, the training phase usually requires a very large amount of data and time and YOLO also showed some memorization behaviours if not enough variability is given at training time.
APA, Harvard, Vancouver, ISO, and other styles
39

Lee, Yeongseon. "Bayesian 3D multiple people tracking using multiple indoor cameras and microphones." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2009. http://hdl.handle.net/1853/29668.

Full text
Abstract:
Thesis (Ph.D)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Rusell M. Mersereau; Committee Member: Biing Hwang (Fred) Juang; Committee Member: Christopher E. Heil; Committee Member: Georgia Vachtsevanos; Committee Member: James H. McClellan. Part of the SMARTech Electronic Thesis and Dissertation Collection.
APA, Harvard, Vancouver, ISO, and other styles
40

Becattini, Federico. "Object and action annotation in visual media beyond categories." Doctoral thesis, 2018. http://hdl.handle.net/2158/1121033.

Full text
Abstract:
The constant growth of applications involving artificial intelligence and machine learning is an important cue for an imminent large scale diffusion of intelligent agents in our society, intended both as robotic and as software modules. In particular, the developments in computer vision are now more than ever of primary importance in order to provide a certain degree of awareness to these agents with respect to the environment they act in. In this thesis we reach out to this goal, tackling at first the need to detect objects in images at a fine-grained instance level. Moving then to the video domain we learn to discover unknown entities and to model the behavior of an important subset of those: humans. The first part of this thesis is dedicated to the image domain, for which we propose a taxonomy based technique to speed up an ensemble of instance based Exemplar-SVM classifiers. Exemplar-SVMs have been used in literature to tackle object detection tasks, while transferring at the same time semantic labels to the detections at a linear cost respect to the number of training samples. Our proposed method allows us to employ these classifiers achieving a sub-logarithmic dependence and resulting in speed gains up to 100x for large ensembles. We also demonstrate the application of similar techniques for image analysis in a real case scenario: the development of an Android App for the Museo Novecento in Florence, Italy, which is able to recognize paintings in the museum and transfer their artistic styles to personal photos. Transitioning to videos, we then propose an approach aimed at discovering objects in an unsupervised fashion by exploiting the temporal consistency of a frame-wise object proposal. Almost without relying on the visual content of the frames we are able to generate spatio-temporal tracks that contain generic objects and that can be used as a preliminary step to process a video sequence. Lastly, driven by the intuition that humans should be the focus of attention in video understanding, we introduce the problem of modeling the progress of human actions with a frame-level granularity. Besides knowing when someone is performing an action and where in every frame this person is, we believe that predicting how far an ongoing action has progressed will provide important benefits for an intelligent agent in order to interact with the surrounding environment and with the human performing the action. To this end we propose ProgressNet, a Recurrent Neural Network based model to jointly predict the spatio-temporal extent of an action and how far it has progressed during its execution. Experiments on the challenging UCF101 and J-HMDB datasets demonstrate the effectiveness of our method.
APA, Harvard, Vancouver, ISO, and other styles
41

Moria, Kawther. "Computer vision-based detection of fire and violent actions performed by individuals in videos acquired with handheld devices." Thesis, 2016. http://hdl.handle.net/1828/7423.

Full text
Abstract:
Advances in social networks and multimedia technologies greatly facilitate the recording and sharing of video data on violent social and/or political events via In- ternet. These video data are a rich source of information in terms of identifying the individuals responsible for damaging public and private property through vio- lent behavior. Any abnormal, violent individual behavior could trigger a cascade of undesirable events, such as vandalism and damage to stores and public facilities. When such incidents occur, investigators usually need to analyze thousands of hours of videos recorded using handheld devices in order to identify suspects. The exhaus- tive manual investigation of these video data is highly time and resource-consuming. Automated detection techniques of abnormal events and actions based on computer vision would o↵er a more e cient solution to this problem. The first contribution described in this thesis consists of a novel method for fire detection in riot videos acquired with handheld cameras and smart-phones. This is a typical example of computer vision in the wild, where we have no control over the data acquisition process, and where the quality of the video data varies considerably. The proposed spatial model is based on the Mixtures of Gaussians model and exploits color adjacency in the visible spectrum of incandescence. The experimental results demonstrate that using this spatial model in concert with motion cues leads to highly accurate results for fire detection in noisy, complex scenes of rioting crowds. The second contribution consists in a method for detecting abnormal, violent actions that are performed by individual subjects and witnessed by passive crowds. The problem of abnormal individual behavior, such as a fight, witnessed by passive bystanders gathered into a crowd has not been studied before. We show that the presence of a passive, standing crowd is an important indicator that an abnormal action might occur. Thus, detecting the standing crowd improves the performance of detecting the abnormal action. The proposed method performs crowd detection first, followed by the detection of abnormal motion events. Our main theoretical contribution consists in linking crowd detection to abnormal, violent actions, as well as in defining novel sets of features that characterize static crowds and abnormal individual actions in both spatial and spatio-temporal domains. Experimental results are computed on a custom dataset, the Vancouver Riot Dataset, that we generated using amateur video footage acquired with handheld devices and uploaded on public social network sites. Our approach achieves good precision and recall values, which validates our system’s reliability of localizing the crowds and the abnormal actions. To summarize, this thesis focuses on the detection of two types of abnormal events occurring in violent street movements. The data are gathered by passive participants to these movements using handheld devices. Although our data sets are drawn from one single social movement (the Vancouver 2011 Stanley cup riot) we are confident that our approaches would generalize well and would be helpful to forensic activities performed in the context of other similar violent occasions.
Graduate
APA, Harvard, Vancouver, ISO, and other styles
42

Kim, Jaechul. "Region detection and matching for object recognition." 2013. http://hdl.handle.net/2152/21261.

Full text
Abstract:
In this thesis, I explore region detection and consider its impact on image matching for exemplar-based object recognition. Detecting regions is important to provide semantically meaningful spatial cues in images. Matching establishes similarity between visual entities, which is crucial for recognition. My thesis starts by detecting regions in both local and object level. Then, I leverage geometric cues of the detected regions to improve image matching for the ultimate goal of object recognition. More specifically, my thesis considers four key questions: 1) how can we extract distinctively-shaped local regions that also ensure repeatability for robust matching? 2) how can object-level shape inform bottom-up image segmentation? 3) how should the spatial layout imposed by segmented regions influence image matching for exemplar-based recognition? and 4) how can we exploit regions to improve the accuracy and speed of dense image matching? I propose novel algorithms to tackle these issues, addressing region-based visual perception from low-level local region extraction, to mid-level object segmentation, to high-level region-based matching and recognition. First, I propose a Boundary Preserving Local Region (BPLR) detector to extract local shapes. My approach defines a novel spanning-tree based image representation whose structure reflects shape cues combined from multiple segmentations, which in turn provide multiple initial hypotheses of the object boundaries. Unlike traditional local region detectors that rely on local cues like color and texture, BPLRs explicitly exploit the segmentation that encodes global object shape. Thus, they respect object boundaries more robustly and reduce noisy regions that straddle object boundaries. The resulting detector yields a dense set of local regions that are both distinctive in shape as well as repeatable for robust matching. Second, building on the strength of the BPLR regions, I develop an approach for object-level segmentation. The key insight of the approach is that objects shapes are (at least partially) shared among different object categories--for example, among different animals, among different vehicles, or even among seemingly different objects. This shape sharing phenomenon allows us to use partial shape matching via BPLR-detected regions to predict global object shape of possibly unfamiliar objects in new images. Unlike existing top-down methods, my approach requires no category-specific knowledge on the object to be segmented. In addition, because it relies on exemplar-based matching to generate shape hypotheses, my approach overcomes the viewpoint sensitivity of existing methods by allowing shape exemplars to span arbitrary poses and classes. For the ultimate goal of region-based recognition, not only is it important to detect good regions, but we must also be able to match them reliably. A matching establishes similarity between visual entities (images, objects or scenes), which is fundamental for visual recognition. Thus, in the third major component of this thesis, I explore how to leverage geometric cues of the segmented regions for accurate image matching. To this end, I propose a segmentation-guided local feature matching strategy, in which segmentation suggests spatial layout among the matched local features within each region. To encode such spatial structures, I devise a string representation whose 1D nature enables efficient computation to enforce geometric constraints. The method is applied for exemplar-based object classification to demonstrate the impact of my segmentation-driven matching approach. Finally, building on the idea of regions for geometric regularization in image matching, I consider how a hierarchy of nested image regions can be used to constrain dense image feature matches at multiple scales simultaneously. Moving beyond individual regions, the last part of my thesis studies how to exploit regions' inherent hierarchical structure to improve the image matching. To this end, I propose a deformable spatial pyramid graphical model for image matching. The proposed model considers multiple spatial extents at once--from an entire image to grid cells to every single pixel. The proposed pyramid model strikes a balance between robust regularization by larger spatial supports on the one hand and accurate localization by finer regions on the other. Further, the pyramid model is suitable for fast coarse-to-fine hierarchical optimization. I apply the method to pixel label transfer tasks for semantic image segmentation, improving upon the state-of-the-art in both accuracy and speed. Throughout, I provide extensive evaluations on challenging benchmark datasets, validating the effectiveness of my approach. In contrast to traditional texture-based object recognition, my region-based approach enables to use strong geometric cues such as shape and spatial layout that advance the state-of-the-art of object recognition. Also, I show that regions' inherent hierarchical structure allows fast image matching for scalable recognition. The outcome realizes the promising potential of region-based visual perception. In addition, all my codes for local shape detector, object segmentation, and image matching are publicly available, which I hope will serve as useful new additions for vision researchers' toolbox.
text
APA, Harvard, Vancouver, ISO, and other styles
43

Mohammed, Hussein Adnan. "Object detection and recognition in complex scenes." Master's thesis, 2014. http://hdl.handle.net/10400.1/8368.

Full text
Abstract:
Dissertação de Mestrado, Engenharia Informática, Faculdade de Ciências e Tecnologia, Universidade do Algarve, 2014
Contour-based object detection and recognition in complex scenes is one of the most dificult problems in computer vision. Object contours in complex scenes can be fragmented, occluded and deformed. Instances of the same class can have a wide range of variations. Clutter and background edges can provide more than 90% of all image edges. Nevertheless, our biological vision system is able to perform this task effortlessly. On the other hand, the performance of state-of-the-art computer vision algorithms is still limited in terms of both speed and accuracy. The work in this thesis presents a simple, efficient and biologically motivated method for contour-based object detection and recognition in complex scenes. Edge segments are extracted from training and testing images using a simple contour-following algorithm at each pixel. Then a descriptor is calculated for each segment using Shape Context, including an offset distance relative to the centre of the object. A Bayesian criterion is used to determine the discriminative power of each segment in a query image by means of a nearest-neighbour lookup, and the most discriminative segments vote for potential bounding boxes. The generated hypotheses are validated using the k nearest-neighbour method in order to eliminate false object detections. Furthermore, meaningful model segments are extracted by finding edge fragments that appear frequently in training images of the same class. Only 2% of the training segments are employed in the models. These models are used as a second approach to validate the hypotheses, using a distancebased measure based on nearest-neighbour lookups of each segment of the hypotheses. A review of shape coding in the visual cortex of primates is provided. The shape-related roles of each region in the ventral pathway of the visual cortex are described. A further step towards a fully biological model for contourbased object detection and recognition is performed by implementing a model for meaningful segment extraction and binding on the basis of two biological principles: proximity and alignment. Evaluation on a challenging benchmark is performed for both k nearestneighbour and model-segment validation methods. Recall rates of the proposed method are compared to the results of recent state-of-the-art algorithms at 0.3 and 0.4 false positive detections per image.
Erasmus Mundus action 2, Lot IIY 2011 Scholarship Program.
APA, Harvard, Vancouver, ISO, and other styles
44

"Intelligent surveillance system employing object detection, recognition, segmentation, and object-based coding." 2013. http://library.cuhk.edu.hk/record=b5879094.

Full text
Abstract:
視頻監控通常是指為了管理、引導和保護人們,通過電子設備監視和人們有關的行為、活動或者信息變化,例如使用閉路電視或者攔截遠距離電子傳輸的信息,如網絡流量,手機通信。視頻監控的潛在應用包括國土安全,反犯罪,交通控製,小孩、老人和病人的遠程看護。視頻監控技術為打擊恐怖主义和异常事件提供一小重要的防護。通過使用闭路电視摄像机等廉份的現代电子技朮使得視頻監控可成為可能。但是,除非一直監視著來自這些攝像機的反饋,否則它們提供僅僅是一些心理上安全。僱用足夠的人員來監督這些成千上萬的屏幕是不現實的,然而使用具有高度智能的現代自動化系統可以彌補這一空缺。
對于全天候地準確地管理成千上萬地攝像機,人工智能化的視頻監控是非常必要而且重要的。通常來說,智能監控包括以下部分: 1 信息獲取,如利用一個或者多個攝像機或者熱感成像或深度成像攝像機; 2 視頻分析,如目標檢測,識別,跟蹤,再識別或分割。3 存儲和傳輸,如編碼,分類和製片。在本文中,我們構建一個智能監控系統,其包括三個相互協作的摄像機用來估計感興趣物體的3D位置並且進行研究和跟蹤。為了識別物體,我們提出級聯頭肩檢測器尋找人臉區域進行識別。感興趣物體分割出來用于任意形狀物體編碼器對物體進行壓縮。
在第一部分中,我們討論如何使多個攝像頭在一起工作。在我們系統中,兩個固定的攝像機像人眼一樣註視著整個監控場景,搜尋非正常事件。如果有警報被非正常事件激活, PTZ攝像機會用來處理該事件,例如去跟蹤或者調查不明物體。利用相機標定技術,我們可以估計出物體的3D信息并將其傳輪到三個攝像機。
在第二部分中,我們提出級聯頭肩檢測器來檢測正面的頭肩并進行高級別的物體分析,例如識別和異常行為分析。在檢測器中,我們提出利用級聯結構融閤兩種強大的特徵, Harar-like 特微和HOG特徽,他們能有傚的檢測人臉和行人。利用Harr-like特徵,頭肩檢測器能夠在初期用有限的計算去除非頭肩區域。檢測的區域可以用來識別和分割。
在第三部分中,利用訓練的糢型,人臉區域可以從檢測到的頭肩區域中提取。利用CAMshift對人臉區域進行細化。在視頻監控的環境中,人臉識別是十分具有挑戰性的,因為人臉圖像受到多種因素的影響,例如在不均勻光綫條件下變化姿態和非聚焦糢糊的人臉。基于上述觀測,我們提出一種使用OLPF特微結閤AGMM糢型的人臉識別方法,其中OLPF特徵不僅不受糢糊圖像的影響,而且對人臉的姿態很魯棒。AGMM能夠很好地構建多種人臉。對標準測試集和實際數據的實驗結果證明了我們提出的方法一直地优于其它最先進的人臉識別方法。
在第四部分中,我們提出一種自動人體分割系統。首先,我們用檢測到的人臉或者人體對graph cut分割模型初始化并使用max-flow /min-cut算法對graph進行優化。針對有缺點的檢測目標的情況,采用一種基于coarse-to-fine的分割策略。我們提出抹除背景差別技術和自適應初始化level set 技術來解決存在于通用模型中的讓人頭疼的分割問題,例如發生在高差別的物體邊界區域或者在物體和背景中存在相同顏色的錯誤分割。實驗結果證明了我們的人體分割系統在實時視頻圖像和具有復雜背景的標準測試序列中都能很好的運作。
在最后部分中,我們專註于怎么樣對視頻內容進行智能的壓縮。在最近幾十年里,視頻編碼研究取得了巨大的成就,例如H.264/AVC標準和下一代的HEVC標準,它們的壓縮性能大大的超過以往的標準,高于50% 。但是相對于MPEG-4 ,在最新的編碼標準中缺少了壓縮任意形狀物體的能力。雖然在現在的H.264/AVC 中提供了片組結構和彈性模塊組閤技術,但是它仍然不能準確地高效地處理任意形狀區域。為了解決H.264/AVC 的這一缺點,我們提出基于H.264/AVC編碼框架的任意形狀物體編碼,它包括二值圖像編碼,運動補償和紋理編碼。在我們系統里,我們采用了1) 用新的運動估計改進的二值圖像編碼,它對二值塊的預測很有用。2) 在紋理編碼中,采用新的任意形狀整型變換來壓縮紋理信息,它是一種從4x4的ICT衍生出來的變換。3)和一些讓該編碼器勻新的框架兼容的相關編碼技術。我們把編碼器應用到高清視頻序列並且從客觀方便和主觀方面對編碼器進行評估。實驗結果證明了我們的編碼器遠遠超越以前的物體編碼方法並且十分接近H.264/AVC 的編碼性能。
Surveillance is the process of monitoring the behaviour, activities, or changing information, usually of people for the purpose of managing, directing or protecting by means of electronic equipment, such as closed-circuit television (CCTV) camera or interception of electronically transmitted information from a distance, such as Internet or phone calls. Some potential surveillance applications are homeland security, anti-crime, traffic control, monitoring children, elderly and patients at a distance. Surveillance technology provides a shield against terrorism and abnormal event, and cheap modern electronics makes it possible to implement with CCTV cameras. But unless the feeds from those cameras are constantly monitored, they only provide an illusion of security. Finding enough observers to watch thousands of screens simply is impractical, yet modern automated systems can solve the problems with a surprising degree of intelligence.
Surveillance with intelligence is necessary and important to accurately mange the information from millions of sensors in 7/24 hours. Generally, intelligent surveillance includes: 1. information acquirement, like a single or the collaboration of multiple cameras, thermal or depth camera; 2. video analysis, like object detection, recognition, tracking, re-identification and segmentation; 3. storage and transmission, like coding, classification, and footage. In this thesis, we build an intelligent surveillance system, in which three cameras working collaboratively to estimate the position of the object of interest (OOI) in 3D space, investigate and track it. In order to identify the OOI, Cascade Head-Shoulder Detector is proposed to find the face region for recognition. The object can be segmented out and compressed by arbitrarily shaped object coding (ASOC).
In the first part, we discuss how to make the multiple cameras work together. In our system, two stationary cameras, like human eyes, are focusing on the whole scene of the surveillance region to observe abnormal events. If an alarm is triggered by abnormal instance, a PTZ camera will be assigned to deal with it, such as tracking orinvestigating the object. With calibrated cameras, the 3D information of the object can be estimated and communicated among the three cameras.
In the second part, cascade head-shoulder detector (CHSD) is proposed to detect the frontal head-shoulder region in the surveillance videos. The high-level object analysis will be performed on the detected region, e.g., recognition and abnormal behaviour analysis. In the detector, we propose a cascading structure that fuses the two powerful features: Haar-like feature and HOG feature, which have been used to detect face and pedestrian efficiently. With the Haar-like feature, CHSD can reject most of non-headshoulder regions in the earlier stages with limited computations. The detected region can be used for recognition and segmentation.
In the third part, the face region can be extracted from the detected head-shoulder region with training the body model. Continuously adaptive mean shift (CAMshift) is proposed to refine the face region. Face recognition is a very challenging problem in surveillance environment because the face image suffers from the concurrence of multiple factors, such as a variant pose with out-of-focused blurring under non-uniform lighting condition. Based on this observations, we propose a face recognition method using overlapping local phase feature (OLPF) feature and adaptive Gaussian mixture model (AGMM). OLPF feature is not only invariant to blurring but also robust to pose variations and AGMM can robustly model the various faces. Experiments conducted on standard dataset and real data demonstrate that the proposed method consistently outperforms the state-of-art face recognition methods.
In the forth part, we propose an automatic human body segmentation system. We first initialize graph cut using the detected face/body and optimize the graph by maxflow/ min-cut. And then a coarse-to-fine segmentation strategy is employed to deal with the imperfectly detected object. Background contrast removal (BCR) and selfadaptive initialization level set (SAILS) are proposed to solve the tough problems that exist in the general graph cut model, such as errors occurred at object boundary with high contrast and similar colors in the object and background. Experimental results demonstrate that our body segmentation system works very well in live videos and standard sequences with complex background.
In the last part, we concentrate on how to intelligently compress the video context. In recent decades, video coding research has achieved great progress, such as inH.264/AVC and next generation HEVC whose compression performance significantly exceeds previous standards by more than 50%. But as compared with the MPEG-4, the capability of coding arbitrarily shaped objects is absent from the following standards. Despite of the provision of slice group structures and flexible macroblock ordering (FMO) in the current H.264/AVC, it cannot deal with arbitrarily shaped regions accurately and efficiently. To solve the limitation of H.264/AVC, we propose the arbitrarily shaped object coding (ASOC) based on the framework H.264/AVC, which includes binary alpha coding, motion compensation and texture coding. In our ASOC, we adopt (1) an improved binary alpha Coding with a novel motion estimation to facilitate the binary alpha blocks prediction, (2) an arbitrarily shaped integer transform derivative from the 4×4 ICT in H.264/AVC to code texture and (3) associated coding techniques to make ASOC more compatible with the new framework. We extent ASOC to HD video and evaluate it objectively and subjectively. Experimental results prove that our ASOC significantly outperforms previous object-coding methods and performs close to the H.264/AVC.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Detailed summary in vernacular field only.
Liu, Qiang.
"November 2012."
Thesis (Ph.D.)--Chinese University of Hong Kong, 2013.
Includes bibliographical references (leaves 123-135).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.
Abstracts in English and Chinese.
Dedication --- p.ii
Acknowledgments --- p.iii
Abstract --- p.vii
Publications --- p.x
Nomenclature --- p.xii
Contents --- p.xviii
List of Figures --- p.xxii
List of Tables --- p.xxiii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Motivation and objectives --- p.1
Chapter 1.2 --- A brief review of camera calibration --- p.2
Chapter 1.3 --- Object detection --- p.5
Chapter 1.3.1 --- Face detection --- p.5
Chapter 1.3.2 --- Pedestrian detection --- p.7
Chapter 1.4 --- Recognition --- p.8
Chapter 1.5 --- Segmentation --- p.10
Chapter 1.5.1 --- Thresholding-based methods --- p.11
Chapter 1.5.2 --- Clustering-based methods --- p.11
Chapter 1.5.3 --- Histogram-based methods --- p.12
Chapter 1.5.4 --- Region-growing methods --- p.12
Chapter 1.5.5 --- Level set methods --- p.13
Chapter 1.5.6 --- Graph cut methods --- p.13
Chapter 1.5.7 --- Neural network-based methods --- p.14
Chapter 1.6 --- Object-based video coding --- p.14
Chapter 1.7 --- Organization of thesis --- p.16
Chapter 2 --- Cameras Calibration --- p.18
Chapter 2.1 --- Introduction --- p.18
Chapter 2.2 --- Basic Equations --- p.21
Chapter 2.2.1 --- Parameters of Camera Model --- p.22
Chapter 2.2.2 --- Two-view homography induced by a Plane --- p.22
Chapter 2.3 --- Pair-wise pose estimation --- p.23
Chapter 2.3.1 --- Homography estimation --- p.24
Chapter 2.3.2 --- Calculation of n and λ --- p.24
Chapter 2.3.3 --- (R,t) Estimation --- p.25
Chapter 2.4 --- Distortion analysis and correction --- p.27
Chapter 2.5 --- Feature detection and matching --- p.28
Chapter 2.6 --- 3D point estimation and evaluation --- p.30
Chapter 2.7 --- Conclusion --- p.34
Chapter 3 --- Cascade Head-Shoulder Detector --- p.35
Chapter 3.1 --- Introduction --- p.35
Chapter 3.2 --- Cascade head-shoulder detection --- p.36
Chapter 3.2.1 --- Initial feature rejecter --- p.37
Chapter 3.2.2 --- Haar-like rejecter --- p.39
Chapter 3.2.3 --- HOG feature classifier --- p.40
Chapter 3.2.4 --- Cascade of classifiers --- p.45
Chapter 3.3 --- Experimental results and analysis --- p.46
Chapter 3.3.1 --- CHSD training --- p.46
Chapter 3.4 --- Conclusion --- p.49
Chapter 4 --- A Robust Face Recognition in Surveillance --- p.50
Chapter 4.1 --- Introduction --- p.50
Chapter 4.2 --- Cascade head-shoulder detection --- p.53
Chapter 4.2.1 --- Body model training --- p.53
Chapter 4.2.2 --- Face region refinement --- p.54
Chapter 4.3 --- Face recognition --- p.56
Chapter 4.3.1 --- Overlapping local phase feature (OLPF) --- p.56
Chapter 4.3.2 --- Fixed Gaussian Mixture Model (FGMM) --- p.59
Chapter 4.3.3 --- Adaptive Gaussian mixture model --- p.61
Chapter 4.4 --- Experimental verification --- p.62
Chapter 4.4.1 --- Preprocessing --- p.62
Chapter 4.4.2 --- Face recognition --- p.63
Chapter 4.5 --- Conclusion --- p.66
Chapter 5 --- Human Body Segmentation --- p.68
Chapter 5.1 --- Introduction --- p.68
Chapter 5.2 --- Proposed automatic human body segmentation system --- p.70
Chapter 5.2.1 --- Automatic human body detection --- p.71
Chapter 5.2.2 --- Object Segmentation --- p.73
Chapter 5.2.3 --- Self-adaptive initialization level set --- p.79
Chapter 5.2.4 --- Object Updating --- p.86
Chapter 5.3 --- Experimental results --- p.87
Chapter 5.3.1 --- Evaluation using real-time videos and standard sequences --- p.87
Chapter 5.3.2 --- Comparison with Other Methods --- p.87
Chapter 5.3.3 --- Computational complexity analysis --- p.91
Chapter 5.3.4 --- Extensions --- p.93
Chapter 5.4 --- Conclusion --- p.93
Chapter 6 --- Arbitrarily Shaped Object Coding --- p.94
Chapter 6.1 --- Introduction --- p.94
Chapter 6.2 --- Arbitrarily shaped object coding --- p.97
Chapter 6.2.1 --- Shape coding --- p.97
Chapter 6.2.2 --- Lossy alpha coding --- p.99
Chapter 6.2.3 --- Motion compensation --- p.102
Chapter 6.2.4 --- Texture coding --- p.105
Chapter 6.3 --- Performance evaluation --- p.108
Chapter 6.3.1 --- Objective evaluations --- p.108
Chapter 6.3.2 --- Extension on HD sequences --- p.112
Chapter 6.3.3 --- Subjective evaluations --- p.115
Chapter 6.4 --- Conclusions --- p.119
Chapter 7 --- Conclusions and future work --- p.120
Chapter 7.1 --- Contributions --- p.120
Chapter 7.1.1 --- 3D object positioning --- p.120
Chapter 7.1.2 --- Automatic human body detection --- p.120
Chapter 7.1.3 --- Human face recognition --- p.121
Chapter 7.1.4 --- Automatic human body segmentation --- p.121
Chapter 7.1.5 --- Arbitrarily shaped object coding --- p.121
Chapter 7.2 --- Future work --- p.122
Bibliography --- p.123
APA, Harvard, Vancouver, ISO, and other styles
45

Hwang, Sung Ju. "Reading between the lines : object localization using implicit cues from image tags." Thesis, 2010. http://hdl.handle.net/2152/ETD-UT-2010-05-1514.

Full text
Abstract:
Current uses of tagged images typically exploit only the most explicit information: the link between the nouns named and the objects present somewhere in the image. We propose to leverage “unspoken” cues that rest within an ordered list of image tags so as to improve object localization. We define three novel implicit features from an image’s tags—the relative prominence of each object as signified by its order of mention, the scale constraints implied by unnamed objects, and the loose spatial links hinted by the proximity of names on the list. By learning a conditional density over the localization parameters (position and scale) given these cues, we show how to improve both accuracy and efficiency when detecting the tagged objects. We validate our approach with 25 object categories from the PASCAL VOC and LabelMe datasets, and demonstrate its effectiveness relative to both traditional sliding windows as well as a visual context baseline.
text
APA, Harvard, Vancouver, ISO, and other styles
46

Russa, Hélder Filipe de Sousa. "Computer Vision: Object recognition with deep learning applied to fashion items detection in images." Master's thesis, 2017. https://repositorio-aberto.up.pt/handle/10216/107862.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Russa, Hélder Filipe de Sousa. "Computer Vision: Object recognition with deep learning applied to fashion items detection in images." Dissertação, 2017. https://repositorio-aberto.up.pt/handle/10216/107862.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

"Computer Vision from Spatial-Multiplexing Cameras at Low Measurement Rates." Doctoral diss., 2017. http://hdl.handle.net/2286/R.I.45490.

Full text
Abstract:
abstract: In UAVs and parking lots, it is typical to first collect an enormous number of pixels using conventional imagers. This is followed by employment of expensive methods to compress by throwing away redundant data. Subsequently, the compressed data is transmitted to a ground station. The past decade has seen the emergence of novel imagers called spatial-multiplexing cameras, which offer compression at the sensing level itself by providing an arbitrary linear measurements of the scene instead of pixel-based sampling. In this dissertation, I discuss various approaches for effective information extraction from spatial-multiplexing measurements and present the trade-offs between reliability of the performance and computational/storage load of the system. In the first part, I present a reconstruction-free approach to high-level inference in computer vision, wherein I consider the specific case of activity analysis, and show that using correlation filters, one can perform effective action recognition and localization directly from a class of spatial-multiplexing cameras, called compressive cameras, even at very low measurement rates of 1\%. In the second part, I outline a deep learning based non-iterative and real-time algorithm to reconstruct images from compressively sensed (CS) measurements, which can outperform the traditional iterative CS reconstruction algorithms in terms of reconstruction quality and time complexity, especially at low measurement rates. To overcome the limitations of compressive cameras, which are operated with random measurements and not particularly tuned to any task, in the third part of the dissertation, I propose a method to design spatial-multiplexing measurements, which are tuned to facilitate the easy extraction of features that are useful in computer vision tasks like object tracking. The work presented in the dissertation provides sufficient evidence to high-level inference in computer vision at extremely low measurement rates, and hence allows us to think about the possibility of revamping the current day computer systems.
Dissertation/Thesis
Doctoral Dissertation Electrical Engineering 2017
APA, Harvard, Vancouver, ISO, and other styles
49

Larsson, Stefan, and Filip Mellqvist. "Automatic Number Plate Recognition for Android." Thesis, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-72573.

Full text
Abstract:
This thesis describes how we utilize machine learning and image preprocessing to create a system that can extract a license plate number by taking a picture of a car with an Android smartphone. This project was provided by ÅF at the behalf of one of their customers who wanted to make the workflow of their employees more efficient. The two main techniques of this project are object detection to detect license plates and optical character recognition to then read them. In between are several different image preprocessing techniques to make the images as readable as possible. These techniques mainly includes skewing and color distorting the image. The object detection consists of a convolutional neural network using the You Only Look Once technique, trained by us using Darkflow. When using our final product to read license plates of expected quality in our evaluation phase, we found that 94.8% of them were read correctly. Without our image preprocessing, this was reduced to only 7.95%.
APA, Harvard, Vancouver, ISO, and other styles
50

BALLAN, LAMBERTO. "Object and event recognition in multimedia archives using local visual features." Doctoral thesis, 2011. http://hdl.handle.net/2158/485661.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography