Dissertations / Theses: 'Deep Learning, Computer Vision, Object Detection'

1

Kohmann, Erich. "Tecniche di deep learning per l'object detection." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19637/.

Full text

Abstract:

L’object detection è uno dei principali problemi nell’ambito della computer vision. Negli ultimi anni, con l’avvento delle reti neurali e del deep learning, sono stati fatti notevoli progressi nei metodi per affrontare questo problema. Questa tesi intende fornire una rassegna dei principali modelli di object detection basati su deep learning, di cui si illustrano le caratteristiche fondamentali e gli elementi che li contraddistinguono dai modelli precedenti. Dopo un infarinatura iniziale sul deep learning e sulle reti neurali in genere, vengono presentati i modelli caratterizzati da tecniche innovative che hanno portato ad un miglioramento significativo, sia nella precisione e nell’accuratezza delle predizioni, che in termini di consumo di risorse. Nella seconda parte l’elaborato si concentra su YOLO e sui suoi sviluppi. YOLO è un modello basato su reti neurali convoluzionali, con il quale i problemi di localizzazione e classificazione degli oggetti in un’immagine sono stati trattati per la prima volta come un unico problema di regressione. Questo cambio di prospettiva apportato dagli autori di YOLO ha aperto la strada verso un nuovo approccio all’object detection, facilitando il successivo sviluppo di modelli sempre più precisi e performanti.

APA, Harvard, Vancouver, ISO, and other styles

2

Andersson, Dickfors Robin, and Nick Grannas. "OBJECT DETECTION USING DEEP LEARNING ON METAL CHIPS IN MANUFACTURING." Thesis, Mälardalens högskola, Akademin för innovation, design och teknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-55068.

Full text

Abstract:

Designing cutting tools for the turning industry, providing optimal cutting parameters is of importance for both the client, and for the company's own research. By examining the metal chips that form in the turning process, operators can recommend optimal cutting parameters. Instead of doing manual classification of metal chips that come from the turning process, an automated approach of detecting chips and classification is preferred. This thesis aims to evaluate if such an approach is possible using either a Convolutional Neural Network (CNN) or a CNN feature extraction coupled with machine learning (ML). The thesis started with a research phase where we reviewed existing state of the art CNNs, image processing and ML algorithms. From the research, we implemented our own object detection algorithm, and we chose to implement two CNNs, AlexNet and VGG16. A third CNN was designed and implemented with our specific task in mind. The three models were tested against each other, both as standalone image classifiers and as a feature extractor coupled with a ML algorithm. Because the chips were inside a machine, different angles and light setup had to be tested to evaluate which setup provided the optimal image for classification. A top view of the cutting area was found to be the optimal angle with light focused on both below the cutting area, and in the chip disposal tray. The smaller proposed CNN with three convolutional layers, three pooling layers and two dense layers was found to rival both AlexNet and VGG16 in terms of both as a standalone classifier, and as a feature extractor. The proposed model was designed with a limited system in mind and is therefore more suited for those systems while still having a high accuracy. The classification accuracy of the proposed model as a standalone classifier was 92.03%. Compared to the state of the art classifier AlexNet which had an accuracy of 92.20%, and VGG16 which had an accuracy of 91.88%. When used as a feature extractor, all three models paired best with the Random Forest algorithm, but the accuracy between the feature extractors is not that significant. The proposed feature extractor combined with Random Forest had an accuracy of 82.56%, compared to AlexNet with an accuracy of 81.93%, and VGG16 with 79.14% accuracy.
DIGICOGS

APA, Harvard, Vancouver, ISO, and other styles

3

Arefiyan, Khalilabad Seyyed Mostafa. "Deep Learning Models for Context-Aware Object Detection." Thesis, Virginia Tech, 2017. http://hdl.handle.net/10919/88387.

Full text

Abstract:

In this thesis, we present ContextNet, a novel general object detection framework for incorporating context cues into a detection pipeline. Current deep learning methods for object detection exploit state-of-the-art image recognition networks for classifying the given region-of-interest (ROI) to predefined classes and regressing a bounding-box around it without using any information about the corresponding scene. ContextNet is based on an intuitive idea of having cues about the general scene (e.g., kitchen and library), and changes the priors about presence/absence of some object classes. We provide a general means for integrating this notion in the decision process about the given ROI by using a pretrained network on the scene recognition datasets in parallel to a pretrained network for extracting object-level features for the corresponding ROI. Using comprehensive experiments on the PASCAL VOC 2007, we demonstrate the effectiveness of our design choices, the resulting system outperforms the baseline in most object classes, and reaches 57.5 mAP (mean Average Precision) on the PASCAL VOC 2007 test set in comparison with 55.6 mAP for the baseline.
MS

APA, Harvard, Vancouver, ISO, and other styles

4

Bartoli, Giacomo. "Edge AI: Deep Learning techniques for Computer Vision applied to embedded systems." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16820/.

Full text

Abstract:

In the last decade, Machine Learning techniques have been used in different fields, ranging from finance to healthcare and even marketing. Amongst all these techniques, the ones adopting a Deep Learning approach were revealed to outperform humans in tasks such as object detection, image classification and speech recognition. This thesis introduces the concept of Edge AI: that is the possibility to build learning models capable of making inference locally, without any dependence on expensive servers or cloud services. A first case study we consider is based on the Google AIY Vision Kit, an intelligent camera equipped with a graphic board to optimize Computer Vision algorithms. Then, we test the performances of CORe50, a dataset for continuous object recognition, on embedded systems. The techniques developed in these chapters will be finally used to solve a challenge within the Audi Autonomous Driving Cup 2018, where a mobile car equipped with a camera, sensors and a graphic board must recognize pedestrians and stop before hitting them.

APA, Harvard, Vancouver, ISO, and other styles

5

Espis, Andrea. "Object detection and semantic segmentation for assisted data labeling." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2022.

Find full text

Abstract:

The automation of data labeling tasks is a solution to the errors and time costs related to human labeling. In this thesis work CenterNet, DeepLabV3, and K-Means applied to the RGB color space, are deployed to build a pipeline for Assisted data labeling: a semi-automatic process to iteratively improve the quality of the annotations. The proposed pipeline pointed out a total of 1547 wrong and missing annotations when applied to a dataset originally containing 8,300 annotations. Moreover, the quality of each annotation has been drastically improved, and at the same time, more than 600 hours of work have been saved. The same models have also been used to address the real-time Tire inspection task, regarding the detection of markers on the surface of tires. According to the experiments, the combination of DeepLabV3 output and post-processing based on the area and shape of the predicted blobs, achieves a maximum of mean Precision 0.992, with mean Recall 0.982, and a maximum of mean Recall 0.998, with mean Precision 0.960.

APA, Harvard, Vancouver, ISO, and other styles

6

Norrstig, Andreas. "Visual Object Detection using Convolutional Neural Networks in a Virtual Environment." Thesis, Linköpings universitet, Datorseende, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-156609.

Full text

Abstract:

Visual object detection is a popular computer vision task that has been intensively investigated using deep learning on real data. However, data from virtual environments have not received the same attention. A virtual environment enables generating data for locations that are not easily reachable for data collection, e.g. aerial environments. In this thesis, we study the problem of object detection in virtual environments, more specifically an aerial virtual environment. We use a simulator, to generate a synthetic data set of 16 different types of vehicles captured from an airplane. To study the performance of existing methods in virtual environments, we train and evaluate two state-of-the-art detectors on the generated data set. Experiments show that both detectors, You Only Look Once version 3 (YOLOv3) and Single Shot MultiBox Detector (SSD), reach similar performance quality as previously presented in the literature on real data sets. In addition, we investigate different fusion techniques between detectors which were trained on two different subsets of the dataset, in this case a subset which has cars with fixed colors and a dataset which has cars with varying colors. Experiments show that it is possible to train multiple instances of the detector on different subsets of the data set, and combine these detectors in order to boost the performance.

APA, Harvard, Vancouver, ISO, and other styles

7

Dickens, James. "Depth-Aware Deep Learning Networks for Object Detection and Image Segmentation." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42619.

Full text

Abstract:

The rise of convolutional neural networks (CNNs) in the context of computer vision has occurred in tandem with the advancement of depth sensing technology. Depth cameras are capable of yielding two-dimensional arrays storing at each pixel the distance from objects and surfaces in a scene from a given sensor, aligned with a regular color image, obtaining so-called RGBD images. Inspired by prior models in the literature, this work develops a suite of RGBD CNN models to tackle the challenging tasks of object detection, instance segmentation, and semantic segmentation. Prominent architectures for object detection and image segmentation are modified to incorporate dual backbone approaches inputting RGB and depth images, combining features from both modalities through the use of novel fusion modules. For each task, the models developed are competitive with state-of-the-art RGBD architectures. In particular, the proposed RGBD object detection approach achieves 53.5% mAP on the SUN RGBD 19-class object detection benchmark, while the proposed RGBD semantic segmentation architecture yields 69.4% accuracy with respect to the SUN RGBD 37-class semantic segmentation benchmark. An original 13-class RGBD instance segmentation benchmark is introduced for the SUN RGBD dataset, for which the proposed model achieves 38.4% mAP. Additionally, an original depth-aware panoptic segmentation model is developed, trained, and tested for new benchmarks conceived for the NYUDv2 and SUN RGBD datasets. These benchmarks offer researchers a baseline for the task of RGBD panoptic segmentation on these datasets, where the novel depth-aware model outperforms a comparable RGB counterpart.

APA, Harvard, Vancouver, ISO, and other styles

8

Solini, Arianna. "Applicazione di Deep Learning e Computer Vision ad un Caso d'uso aziendale: Progettazione, Risoluzione ed Analisi." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

Nella computer vision, sono oramai più di dieci anni che si parla di Machine Learning (ML), con l'obiettivo di creare sistemi autonomi che siano in grado di realizzare modelli approssimati della realtà tridimensionale partendo da immagini bidimensionali. Grazie a questa capacità si possono interpretare e comprendere le immagini, emulando la vista umana. Molti ricercatori hanno creato reti neurali in grado di sfidarsi su grandi dataset di milioni di immagini e, come conseguenza, si è ottenuto il continuo miglioramento delle performance di classificazione di immagini da parte delle reti e la capacità di individuare il framework più adatto per ogni situazione, ottenendo risultati il più possibile performanti, veloci e accurati. Numerose aziende in tutto il mondo fanno uso di Machine Learning e computer vision, spaziando dal controllo qualità, all'assistenza diretta a persone che lavorano su attività ripetitive e spesso stancanti. Il lavoro di tesi è stato realizzato nel corso di un tirocinio presso Injenia (azienda informatica italiana partner Google) ed è stato svolto nell'ambito di un progetto industriale commissionato ad Injenia da parte di una multi-utility italiana. Il progetto prevedeva l'utilizzo di uno o più modelli di ML in ambito computer vision e, a tal fine, è stata portata avanti un'indagine su più fronti per indirizzare le scelte durante il processo di sviluppo. Una parte dei risultati dell'indagine ha fornito informazioni utili all'ottimizzazione del modello di ML utilizzato. Un'altra parte è stata utilizzata per il fine-tuning di un modello di ML (già pre-allenato), applicando quindi il principio di transfer learning al dataset di immagini fornite dalla multi-utility. Lo scopo della tesi è, quindi, quello di presentare lo sviluppo e l'applicazione di tecniche di Machine Learning, Deep Learning e computer vision ad un caso d'uso aziendale concreto.

APA, Harvard, Vancouver, ISO, and other styles

9

Cuan, Bonan. "Deep similarity metric learning for multiple object tracking." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI065.

Full text

Abstract:

Le suivi d’objets multiples dans une scène est une tâche importante dans le domaine de la vision par ordinateur, et présente toujours de très nombreux verrous. Les objets doivent être détectés et distingués les uns des autres de manière continue et simultanée. Les approches «suivi par détection» sont largement utilisées, où la détection des objets est d’abord réalisée sur toutes les frames, puis le suivi est ramené à un problème d’association entre les détections d’un même objet et les trajectoires identifiées. La plupart des algorithmes de suivi associent des modèles de mouvement et des modèles d’apparence. Dans cette thèse, nous proposons un modèle de ré-identification basé sur l’apparence et utilisant l’apprentissage de métrique de similarité. Nous faisons tout d’abord appel à un réseau siamois profond pour apprendre un maping de bout en bout, des images d’entrée vers un espace de caractéristiques où les objets sont mieux discriminés. De nombreuses configurations sont évaluées, afin d’en déduire celle offrant les meilleurs scores. Le modèle ainsi obtenu atteint des résultats de ré-identification satisfaisants comparables à l’état de l’art. Ensuite, notre modèle est intégré dans un système de suivi d’objets multiples pour servir de guide d’apparence pour l’association des objets. Un modèle d’apparence est établi pour chaque objet détecté s’appuyant sur le modèle de ré-identification. Les similarités entre les objets détectés sont alors exploitées pour la classification. Par ailleurs, nous avons étudié la coopération et les interférences entre les modèles d’apparence et de mouvement dans le processus de suivi. Un couplage actif entre ces 2 modèles est proposé pour améliorer davantage les performances du suivi, et la contribution de chacun d’eux est estimée en continue. Les expérimentations menées dans le cadre du benchmark «Multiple Object Tracking Challenge» ont prouvé l’efficacité de nos propositions et donné de meilleurs résultats de suivi que l’état de l’art
Multiple object tracking, i.e. simultaneously tracking multiple objects in the scene, is an important but challenging visual task. Objects should be accurately detected and distinguished from each other to avoid erroneous trajectories. Since remarkable progress has been made in object detection field, “tracking-by-detection” approaches are widely adopted in multiple object tracking research. Objects are detected in advance and tracking reduces to an association problem: linking detections of the same object through frames into trajectories. Most tracking algorithms employ both motion and appearance models for data association. For multiple object tracking problems where exist many objects of the same category, a fine-grained discriminant appearance model is paramount and indispensable. Therefore, we propose an appearance-based re-identification model using deep similarity metric learning to deal with multiple object tracking in mono-camera videos. Two main contributions are reported in this dissertation: First, a deep Siamese network is employed to learn an end-to-end mapping from input images to a discriminant embedding space. Different metric learning configurations using various metrics, loss functions, deep network structures, etc., are investigated, in order to determine the best re-identification model for tracking. In addition, with an intuitive and simple classification design, the proposed model achieves satisfactory re-identification results, which are comparable to state-of-the-art approaches using triplet losses. Our approach is easy and fast to train and the learned embedding can be readily transferred onto the domain of tracking tasks. Second, we integrate our proposed re-identification model in multiple object tracking as appearance guidance for detection association. For each object to be tracked in a video, we establish an identity-related appearance model based on the learned embedding for re-identification. Similarities among detected object instances are exploited for identity classification. The collaboration and interference between appearance and motion models are also investigated. An online appearance-motion model coupling is proposed to further improve the tracking performance. Experiments on Multiple Object Tracking Challenge benchmark prove the effectiveness of our modifications, with a state-of-the-art tracking accuracy

APA, Harvard, Vancouver, ISO, and other styles

10

Chen, Zhe. "Augmented Context Modelling Neural Networks." Thesis, The University of Sydney, 2019. http://hdl.handle.net/2123/20654.

Full text

Abstract:

Contexts provide beneficial information for machine-based image understanding tasks. However, existing context modelling methods still cannot fully exploit contexts, especially for object recognition and detection. In this thesis, we develop augmented context modelling neural networks to better utilize contexts for different object recognition and detection tasks. Our contributions are two-fold: 1) we introduce neural networks to better model instance-level visual relationships; 2) we introduce neural network-based algorithms to better utilize contexts from 3D information and synthesized data. In particular, to augment the modelling of instance-level visual relationships, we propose a context refinement network and an encapsulated context modelling network for object detection. In the context refinement study, we propose to improve the modeling of visual relationships by introducing overlap scores and confidence scores of different regions. In addition, in the encapsulated context modelling study, we boost the context modelling performance by exploiting the more powerful capsule-based neural networks. To augment the modeling of contexts from different sources, we propose novel neural networks to better utilize 3D information and synthesis-based contexts. For the modelling of 3D information, we mainly investigate the modelling of LiDAR data for road detection and the depth data for instance segmentation, respectively. In road detection, we develop a progressive LiDAR adaptation algorithm to improve the fusion of 3D LiDAR data and 2D image data. Regarding instance segmentation, we model depth data as context to help tackle the low-resolution annotation-based training problem. Moreover, to improve the modelling of synthesis-based contexts, we devise a shape translation-based pedestrian generation framework to help improve the pedestrian detection performance.

APA, Harvard, Vancouver, ISO, and other styles

11

Gustafsson, Fredrik, and Erik Linder-Norén. "Automotive 3D Object Detection Without Target Domain Annotations." Thesis, Linköpings universitet, Datorseende, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-148585.

Full text

Abstract:

In this thesis we study a perception problem in the context of autonomous driving. Specifically, we study the computer vision problem of 3D object detection, in which objects should be detected from various sensor data and their position in the 3D world should be estimated. We also study the application of Generative Adversarial Networks in domain adaptation techniques, aiming to improve the 3D object detection model's ability to transfer between different domains. The state-of-the-art Frustum-PointNet architecture for LiDAR-based 3D object detection was implemented and found to closely match its reported performance when trained and evaluated on the KITTI dataset. The architecture was also found to transfer reasonably well from the synthetic SYN dataset to KITTI, and is thus believed to be usable in a semi-automatic 3D bounding box annotation process. The Frustum-PointNet architecture was also extended to explicitly utilize image features, which surprisingly degraded its detection performance. Furthermore, an image-only 3D object detection model was designed and implemented, which was found to compare quite favourably with current state-of-the-art in terms of detection performance. Additionally, the PixelDA approach was adopted and successfully applied to the MNIST to MNIST-M domain adaptation problem, which validated the idea that unsupervised domain adaptation using Generative Adversarial Networks can improve the performance of a task network for a dataset lacking ground truth annotations. Surprisingly, the approach did however not significantly improve upon the performance of the image-based 3D object detection models when trained on the SYN dataset and evaluated on KITTI.

APA, Harvard, Vancouver, ISO, and other styles

12

Ogier, du Terrail Jean. "Réseaux de neurones convolutionnels profonds pour la détection de petits véhicules en imagerie aérienne." Thesis, Normandie, 2018. http://www.theses.fr/2018NORMC276/document.

Full text

Abstract:

Cette thèse présente une tentative d'approche du problème de la détection et discrimination des petits véhicules dans des images aériennes en vue verticale par l'utilisation de techniques issues de l'apprentissage profond ou "deep-learning". Le caractère spécifique du problème permet d'utiliser des techniques originales mettant à profit les invariances des automobiles et autres avions vus du ciel.Nous commencerons par une étude systématique des détecteurs dits "single-shot", pour ensuite analyser l'apport des systèmes à plusieurs étages de décision sur les performances de détection. Enfin nous essayerons de résoudre le problème de l'adaptation de domaine à travers la génération de données synthétiques toujours plus réalistes, et son utilisation dans l'apprentissage de ces détecteurs
The following manuscript is an attempt to tackle the problem of small vehicles detection in vertical aerial imagery through the use of deep learning algorithms. The specificities of the matter allows the use of innovative techniques leveraging the invariance and self similarities of automobiles/planes vehicles seen from the sky.We will start by a thorough study of single shot detectors. Building on that we will examine the effect of adding multiple stages to the detection decision process. Finally we will try to come to grips with the domain adaptation problem in detection through the generation of better looking synthetic data and its use in the training process of these detectors

APA, Harvard, Vancouver, ISO, and other styles

13

Taurone, Francesco. "3D Object Recognition from a Single Image via Patch Detection by a Deep CNN." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/18669/.

Full text

Abstract:

This thesis describes the development of a new technique for recognizing the 3D pose of an object via a single image. The whole project is based on a CNN for recognizing patches on the object, that we use for estimating the pose given an a priori model. The positions of the patches, together with the knowledge of their coordinates in the model, make the estimation of the pose possible through a solution of a PnP problem. The CNN chosen for this project is Yolo. In order to build the training dataset for the network, a new approach is used. Instead of labeling each individual training image as for the standard supervised learning, the initial coordinates of the patches are propagated on all the other images making use of the pose of the camera for all the pictures.

APA, Harvard, Vancouver, ISO, and other styles

14

Al, Hakim Ezeddin. "3D YOLO: End-to-End 3D Object Detection Using Point Clouds." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234242.

Full text

Abstract:

For safe and reliable driving, it is essential that an autonomous vehicle can accurately perceive the surrounding environment. Modern sensor technologies used for perception, such as LiDAR and RADAR, deliver a large set of 3D measurement points known as a point cloud. There is a huge need to interpret the point cloud data to detect other road users, such as vehicles and pedestrians. Many research studies have proposed image-based models for 2D object detection. This thesis takes it a step further and aims to develop a LiDAR-based 3D object detection model that operates in real-time, with emphasis on autonomous driving scenarios. We propose 3D YOLO, an extension of YOLO (You Only Look Once), which is one of the fastest state-of-the-art 2D object detectors for images. The proposed model takes point cloud data as input and outputs 3D bounding boxes with class scores in real-time. Most of the existing 3D object detectors use hand-crafted features, while our model follows the end-to-end learning fashion, which removes manual feature engineering. 3D YOLO pipeline consists of two networks: (a) Feature Learning Network, an artificial neural network that transforms the input point cloud to a new feature space; (b) 3DNet, a novel convolutional neural network architecture based on YOLO that learns the shape description of the objects. Our experiments on the KITTI dataset shows that the 3D YOLO has high accuracy and outperforms the state-of-the-art LiDAR-based models in efficiency. This makes it a suitable candidate for deployment in autonomous vehicles.
För att autonoma fordon ska ha en god uppfattning av sin omgivning används moderna sensorer som LiDAR och RADAR. Dessa genererar en stor mängd 3-dimensionella datapunkter som kallas point clouds. Inom utvecklingen av autonoma fordon finns det ett stort behov av att tolka LiDAR-data samt klassificera medtrafikanter. Ett stort antal studier har gjorts om 2D-objektdetektering som analyserar bilder för att upptäcka fordon, men vi är intresserade av 3D-objektdetektering med hjälp av endast LiDAR data. Därför introducerar vi modellen 3D YOLO, som bygger på YOLO (You Only Look Once), som är en av de snabbaste state-of-the-art modellerna inom 2D-objektdetektering för bilder. 3D YOLO tar in ett point cloud och producerar 3D lådor som markerar de olika objekten samt anger objektets kategori. Vi har tränat och evaluerat modellen med den publika träningsdatan KITTI. Våra resultat visar att 3D YOLO är snabbare än dagens state-of-the-art LiDAR-baserade modeller med en hög träffsäkerhet. Detta gör den till en god kandidat för kunna användas av autonoma fordon.

APA, Harvard, Vancouver, ISO, and other styles

15

Fucili, Mattia. "3D object detection from point clouds with dense pose voters." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17616/.

Full text

Abstract:

Il riconoscimento di oggetti è sempre stato un compito sﬁdante per la Computer Vision. Trova applicazione in molti campi, principalmente nell’industria, come ad esempio per permettere ad un robot di trovare gli oggetti da afferrare. Negli ultimi decenni tali compiti hanno trovato nuovi modi di essere raggiunti grazie alla riscoperta delle Reti Neurali, in particolare le Reti Neurali Convoluzionali. Questo tipo di reti ha raggiunto ottimi risultati in molte applicazioni per il riconoscimento e la classiﬁcazione degli oggetti. La tendenza, ora, `e quella di utilizzare tali reti anche nell’industria automobilistica per cercare di rendere reale il sogno delle automobili che guidano da sole. Ci sono molti lavori importanti sul riconoscimento delle auto dalle immagini. In questa tesi presentiamo la nostra architettura di Rete Neurale Convoluzionale per il riconoscimento di automobili e la loro posizione nello spazio, utilizzando solo input lidar. Salvando le informazioni riguardanti le bounding box attorno all’auto a livello del punto ci assicura una buona previsione anche in situazioni in cui le automobili sono occluse. I test vengono eseguiti sul dataset più utilizzato per il riconoscimento di automobili e pedoni nelle applicazioni di guida autonoma.

APA, Harvard, Vancouver, ISO, and other styles

16

Capuzzo, Davide. "3D StixelNet Deep Neural Network for 3D object detection stixel-based." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2020. http://amslaurea.unibo.it/22017/.

Full text

Abstract:

In this thesis it has been presented an algorithm of deep learning for 3D object detection from the point cloud in an outdoor environment. This algorithm is feed with stixel, a medium-type data generates starting from a point cloud or depth map. A stixel can be think as a small rectangle that start form the base of the road and then rises until the top of the obstacle summarizing the vertical surface of an object. The goal of stixel is to compress the data coming from sensors in order to have a fast transmission without losing information. The algorithm to generate stixel is a novel algorithm developed by myself that is able to be applied both from point cloud generated by lidar and also from depth map generated by stereo and mono camera. The main passage to create this type of data are: the elimination of points that lied on ground plane; the creation of an average matrix that summarizes the depth of group of stixel; the creation of stixel merging all the cells that are of the same object. The stixel generates reduce the points from 40 000 to 1200 for LIDAR point cloud and to 480 000 to 1200 for depth map. In order to extract 3D information from stixel this data has been feed into a deep learning algorithm adapted to receive as input this type of data. The adaptation has been made starting from an existing neural network use for 3D object detection in an indoor environment. This network has been adapted in order to overcome the sparsity of data and to the big size of the scene. Despite the reduction of the number of data, thanks to the right tuning the network created in this thesis have been able to achieve the state of the art for 3D object detection. This is a relevant result because it opens the way to the use of medium-type data and underlines that the reduction of points does not mean a reduction of information if the data are compressed in a smart way. oints not means a reduction of information if the data are compressed in a smart way.

APA, Harvard, Vancouver, ISO, and other styles

17

Azizpour, Hossein. "Visual Representations and Models: From Latent SVM to Deep Learning." Doctoral thesis, KTH, Datorseende och robotik, CVAP, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-192289.

Full text

Abstract:

Two important components of a visual recognition system are representation and model. Both involves the selection and learning of the features that are indicative for recognition and discarding those features that are uninformative. This thesis, in its general form, proposes different techniques within the frameworks of two learning systems for representation and modeling. Namely, latent support vector machines (latent SVMs) and deep learning. First, we propose various approaches to group the positive samples into clusters of visually similar instances. Given a fixed representation, the sampled space of the positive distribution is usually structured. The proposed clustering techniques include a novel similarity measure based on exemplar learning, an approach for using additional annotation, and augmenting latent SVM to automatically find clusters whose members can be reliably distinguished from background class. In another effort, a strongly supervised DPM is suggested to study how these models can benefit from privileged information. The extra information comes in the form of semantic parts annotation (i.e. their presence and location). And they are used to constrain DPMs latent variables during or prior to the optimization of the latent SVM. Its effectiveness is demonstrated on the task of animal detection. Finally, we generalize the formulation of discriminative latent variable models, including DPMs, to incorporate new set of latent variables representing the structure or properties of negative samples. Thus, we term them as negative latent variables. We show this generalization affects state-of-the-art techniques and helps the visual recognition by explicitly searching for counter evidences of an object presence. Following the resurgence of deep networks, in the last works of this thesis we have focused on deep learning in order to produce a generic representation for visual recognition. A Convolutional Network (ConvNet) is trained on a largely annotated image classification dataset called ImageNet with $\sim1.3$ million images. Then, the activations at each layer of the trained ConvNet can be treated as the representation of an input image. We show that such a representation is surprisingly effective for various recognition tasks, making it clearly superior to all the handcrafted features previously used in visual recognition (such as HOG in our first works on DPM). We further investigate the ways that one can improve this representation for a task in mind. We propose various factors involving before or after the training of the representation which can improve the efficacy of the ConvNet representation. These factors are analyzed on 16 datasets from various subfields of visual recognition.

QC 20160908

APA, Harvard, Vancouver, ISO, and other styles

18

Kalogeiton, Vasiliki. "Localizing spatially and temporally objects and actions in videos." Thesis, University of Edinburgh, 2018. http://hdl.handle.net/1842/28984.

Full text

Abstract:

The rise of deep learning has facilitated remarkable progress in video understanding. This thesis addresses three important tasks of video understanding: video object detection, joint object and action detection, and spatio-temporal action localization. Object class detection is one of the most important challenges in computer vision. Object detectors are usually trained on bounding-boxes from still images. Recently, video has been used as an alternative source of data. Yet, training an object detector on one domain (either still images or videos) and testing on the other one results in a significant performance gap compared to training and testing on the same domain. In the first part of this thesis, we examine the reasons behind this performance gap. We define and evaluate several domain shift factors: spatial location accuracy, appearance diversity, image quality, aspect distribution, and object size and camera framing. We examine the impact of these factors by comparing the detection performance before and after cancelling them out. The results show that all five factors affect the performance of the detectors and their combined effect explains the performance gap. While most existing approaches for detection in videos focus on objects or human actions separately, in the second part of this thesis we aim at detecting non-human centric actions, i.e., objects performing actions, such as cat eating or dog jumping. We introduce an end-to-end multitask objective that jointly learns object-action relationships. We compare it with different training objectives, validate its effectiveness for detecting object-action pairs in videos, and show that both tasks of object and action detection benefit from this joint learning. In experiments on the A2D dataset [Xu et al., 2015], we obtain state-of-the-art results on segmentation of object-action pairs. In the third part, we are the first to propose an action tubelet detector that leverages the temporal continuity of videos instead of operating at the frame level, as state-of-the-art approaches do. The same way modern detectors rely on anchor boxes, our tubelet detector is based on anchor cuboids by taking as input a sequence of frames and outputing tubelets, i.e., sequences of bounding boxes with associated scores. Our tubelet detector outperforms all state of the art on the UCF-Sports [Rodriguez et al., 2008], J-HMDB [Jhuang et al., 2013a], and UCF-101 [Soomro et al., 2012] action localization datasets especially at high overlap thresholds. The improvement in detection performance is explained by both more accurate scores and more precise localization.

APA, Harvard, Vancouver, ISO, and other styles

19

Söderlund, Henrik. "Real-time Detection and Tracking of Moving Objects Using Deep Learning and Multi-threaded Kalman Filtering : A joint solution of 3D object detection and tracking for Autonomous Driving." Thesis, Umeå universitet, Institutionen för tillämpad fysik och elektronik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-160180.

Full text

Abstract:

Perception for autonomous drive systems is the most essential function for safe and reliable driving. LiDAR sensors can be used for perception and are vying for being crowned as an essential element in this task. In this thesis, we present a novel real-time solution for detection and tracking of moving objects which utilizes deep learning based 3D object detection. Moreover, we present a joint solution which utilizes the predictability of Kalman Filters to infer object properties and semantics to the object detection algorithm, resulting in a closed loop of object detection and object tracking.On one hand, we present YOLO++, a 3D object detection network on point clouds only. A network that expands YOLOv3, the latest contribution to standard real-time object detection for three-channel images. Our object detection solution is fast. It processes images at 20 frames per second. Our experiments on the KITTI benchmark suite show that we achieve state-of-the-art efficiency but with a mediocre accuracy for car detection, which is comparable to the result of Tiny-YOLOv3 on the COCO dataset. The main advantage with YOLO++ is that it allows for fast detection of objects with rotated bounding boxes, something which Tiny-YOLOv3 can not do. YOLO++ also performs regression of the bounding box in all directions, allowing for 3D bounding boxes to be extracted from a bird's eye view perspective. On the other hand, we present a Multi-threaded Object Tracking (MTKF) solution for multiple object tracking. Each unique observation is associated to a thread with a novel concurrent data association process. Each of the threads contain an Extended Kalman Filter that is used for predicting and estimating an associated object's state over time. Furthermore, a LiDAR odometry algorithm was used to obtain absolute information about the movement of objects, since the movement of objects are inherently relative to the sensor perceiving them. We obtain 33 state updates per second with an equal amount of threads to the number of cores in our main workstation.Even if the joint solution has not been tested on a system with enough computational power, it is ready for deployment. Using YOLO++ in combination with MTKF, our real-time constraint of 10 frames per second is satisfied by a large margin. Finally, we show that our system can take advantage of the predicted semantic information from the Kalman Filters in order to enhance the inference process in our object detection architecture.

APA, Harvard, Vancouver, ISO, and other styles

20

Peng, Zeng. "Pedestrian Tracking by using Deep Neural Networks." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302107.

Full text

Abstract:

This project aims at using deep learning to solve the pedestrian tracking problem for Autonomous driving usage. The research area is in the domain of computer vision and deep learning. Multi-Object Tracking (MOT) aims at tracking multiple targets simultaneously in a video data. The main application scenarios of MOT are security monitoring and autonomous driving. In these scenarios, we often need to track many targets at the same time which is not possible with only object detection or single object tracking algorithms for their lack of stability and usability. Therefore we need to explore the area of multiple object tracking. The proposed method breaks the MOT into different stages and utilizes the motion and appearance information of targets to track them in the video data. We used three different object detectors to detect the pedestrians in frames, a person re-identification model as appearance feature extractor and Kalman filter as motion predictor. Our proposed model achieves 47.6% MOT accuracy and 53.2% in IDF1 score while the results obtained by the model without person re-identification module is only 44.8% and 45.8% respectively. Our experiment results indicate the fact that a robust multiple object tracking algorithm can be achieved by splitted tasks and improved by the representative DNN based appearance features.
Detta projekt syftar till att använda djupinlärning för att lösa problemet med att följa fotgängare för autonom körning. For ligger inom datorseende och djupinlärning. Multi-Objekt-följning (MOT) syftar till att följa flera mål samtidigt i videodata. de viktigaste applikationsscenarierna för MOT är säkerhetsövervakning och autonom körning. I dessa scenarier behöver vi ofta följa många mål samtidigt, vilket inte är möjligt med endast objektdetektering eller algoritmer för enkel följning av objekt för deras bristande stabilitet och användbarhet, därför måste utforska området för multipel objektspårning. Vår metod bryter MOT i olika steg och använder rörelse- och utseendinformation för mål för att spåra dem i videodata, vi använde tre olika objektdetektorer för att upptäcka fotgängare i ramar en personidentifieringsmodell som utseendefunktionsavskiljare och Kalmanfilter som rörelsesprediktor. Vår föreslagna modell uppnår 47,6 % MOT-noggrannhet och 53,2 % i IDF1 medan resultaten som erhållits av modellen utan personåteridentifieringsmodul är endast 44,8%respektive 45,8 %. Våra experimentresultat visade att den robusta algoritmen för multipel objektspårning kan uppnås genom delade uppgifter och förbättras av de representativa DNN-baserade utseendefunktionerna.

APA, Harvard, Vancouver, ISO, and other styles

21

Papakis, Ioannis. "A Graph Convolutional Neural Network Based Approach for Object Tracking Using Augmented Detections With Optical Flow." Thesis, Virginia Tech, 2021. http://hdl.handle.net/10919/103372.

Full text

Abstract:

This thesis presents a novel method for online Multi-Object Tracking (MOT) using Graph Convolutional Neural Network (GCNN) based feature extraction and end-to-end feature matching for object association. The Graph based approach incorporates both appearance and geometry of objects at past frames as well as the current frame into the task of feature learning. This new paradigm enables the network to leverage the "contextual" information of the geometry of objects and allows us to model the interactions among the features of multiple objects. Another central innovation of the proposed framework is the use of the Sinkhorn algorithm for end-to-end learning of the associations among objects during model training. The network is trained to predict object associations by taking into account constraints specific to the MOT task. Additionally, in order to increase the sensitivity of the object detector, a new approach is presented that propagates previous frame detections into each new frame using optical flow. These are treated as added object proposals which are then classified as objects. A new traffic monitoring dataset is also provided, which includes naturalistic video footage from current infrastructure cameras in Virginia Beach City with a variety of vehicle density and environment conditions. Experimental evaluation demonstrates the efficacy of the proposed approaches on the provided dataset and the popular MOT Challenge Benchmark.
Master of Science
This thesis presents a novel method for Multi-Object Tracking (MOT) in videos, with the main goal of associating objects between frames. The proposed method is based on a Deep Neural Network Architecture operating on a Graph Structure. The Graph based approach makes it possible to use both appearance and geometry of detected objects to retrieve high level information about their characteristics and interaction. The framework includes the Sinkhorn algorithm, which can be embedded in the training phase to satisfy MOT constraints, such as the 1 to 1 matching between previous and new objects. Another approach is also proposed to improve the sensitivity of the object detector by using previous frame detections as a guide to detect objects in each new frame, resulting in less missed objects. Alongside the new methods, a new dataset is also provided which includes naturalistic video footage from current infrastructure cameras in Virginia Beach City with a variety of vehicle density and environment conditions. Experimental evaluation demonstrates the eﬀicacy of the proposed approaches on the provided dataset and the popular MOT Challenge Benchmark.

APA, Harvard, Vancouver, ISO, and other styles

22

Mhalla, Ala. "Multi-object detection and tracking in video sequences." Thesis, Université Clermont Auvergne‎ (2017-2020), 2018. http://www.theses.fr/2018CLFAC084/document.

Full text

Abstract:

Le travail développé dans cette thèse porte sur l'analyse de séquences vidéo. Cette dernière est basée sur 3 taches principales : la détection, la catégorisation et le suivi des objets. Le développement de solutions fiables pour l'analyse de séquences vidéo ouvre de nouveaux horizons pour plusieurs applications telles que les systèmes de transport intelligents, la vidéosurveillance et la robotique. Dans cette thèse, nous avons mis en avant plusieurs contributions pour traiter les problèmes de détection et de suivi d'objets multiples sur des séquences vidéo. Les techniques proposées sont basées sur l’apprentissage profonds et des approches de transfert d'apprentissage. Dans une première contribution, nous abordons le problème de la détection multi-objets en proposant une nouvelle technique de transfert d’apprentissage basé sur le formalisme et la théorie du filtre SMC (Sequential Monte Carlo) afin de spécialiser automatiquement un détecteur de réseau de neurones convolutionnel profond (DCNN) vers une scène cible. Dans une deuxième contribution, nous proposons une nouvelle approche de suivi multi-objets original basé sur des stratégies spatio-temporelles (entrelacement / entrelacement inverse) et un détecteur profond entrelacé, qui améliore les performances des algorithmes de suivi par détection et permet de suivre des objets dans des environnements complexes (occlusion, intersection, fort mouvement). Dans une troisième contribution, nous fournissons un système de surveillance du trafic, qui intègre une extension du technique SMC afin d’améliorer la précision de la détection de jour et de nuit et de spécialiser tout détecteur DCNN pour les caméras fixes et mobiles. Tout au long de ce rapport, nous fournissons des résultats quantitatifs et qualitatifs. Sur plusieurs aspects liés à l’analyse de séquences vidéo, ces travaux surpassent les cadres de détection et de suivi de pointe. En outre, nous avons implémenté avec succès nos infrastructures dans une plate-forme matérielle intégrée pour la surveillance et la sécurité du trafic routier
The work developed in this PhD thesis is focused on video sequence analysis. Thelatter consists of object detection, categorization and tracking. The development ofreliable solutions for the analysis of video sequences opens new horizons for severalapplications such as intelligent transport systems, video surveillance and robotics.In this thesis, we put forward several contributions to deal with the problems ofdetecting and tracking multi-objects on video sequences. The proposed frameworksare based on deep learning networks and transfer learning approaches.In a first contribution, we tackle the problem of multi-object detection by puttingforward a new transfer learning framework based on the formalism and the theoryof a Sequential Monte Carlo (SMC) filter to automatically specialize a Deep ConvolutionalNeural Network (DCNN) detector towards a target scene. The suggestedspecialization framework is used in order to transfer the knowledge from the sourceand the target domain to the target scene and to estimate the unknown target distributionas a specialized dataset composed of samples from the target domain. Thesesamples are selected according to the importance of their weights which reflectsthe likelihood that they belong to the target distribution. The obtained specializeddataset allows training a specialized DCNN detector to a target scene withouthuman intervention.In a second contribution, we propose an original multi-object tracking frameworkbased on spatio-temporal strategies (interlacing/inverse interlacing) and aninterlaced deep detector, which improves the performances of tracking-by-detectionalgorithms and helps to track objects in complex videos (occlusion, intersection,strong motion).In a third contribution, we provide an embedded system for traffic surveillance,which integrates an extension of the SMC framework so as to improve the detectionaccuracy in both day and night conditions and to specialize any DCNN detector forboth mobile and stationary cameras.Throughout this report, we provide both quantitative and qualitative results.On several aspects related to video sequence analysis, this work outperformsthe state-of-the-art detection and tracking frameworks. In addition, we havesuccessfully implemented our frameworks in an embedded hardware platform forroad traffic safety and monitoring

APA, Harvard, Vancouver, ISO, and other styles

23

Grossman, Mikael. "Proposal networks in object detection." Thesis, KTH, Matematisk statistik, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-241918.

Full text

Abstract:

Locating and extracting useful data from images is a task that has been revolutionized in the last decade as computing power has risen to such a level to use deep neural networks with success. A type of neural network that uses the convolutional operation called convolutional neural network (CNN) is suited for image related tasks. Using the convolution operation creates opportunities for the network to learn their own ﬁlters, that previously had to be hand engineered. For locating objects in an image the state-of-the-art Faster R-CNN model predicts objects in two parts. Firstly, the region proposal network (RPN) extracts regions from the picture where it is likely to ﬁnd an object. Secondly, a detector veriﬁes the likelihood of an object being in that region.For this thesis, we review the current literature on artiﬁcial neural networks, object detection methods, proposal methods and present our new way of generating proposals. By replacing the RPN with our network, the multiscale proposal network (MPN), we increase the average precision (AP) with 12% and reduce the computation time per image by 10%.
Lokalisering av användbar data från bilder är något som har revolutionerats under det senaste decenniet när datorkraften har ökat till en nivå då man kan använda artiﬁciella neurala nätverk i praktiken. En typ av ett neuralt nätverk som använder faltning passar utmärkt till bilder eftersom det ger möjlighet för nätverket att skapa sina egna ﬁlter som tidigare skapades för hand. För lokalisering av objekt i bilder används huvudsakligen Faster R-CNN arkitekturen. Den fungerar i två steg, först skapar RPN boxar som innehåller regioner där nätverket tror det är störst sannolikhet att hitta ett objekt. Sedan är det en detektor som veriﬁerar om boxen är på ett objekt .I denna uppsats går vi igenom den nuvarande litteraturen i artiﬁciella neurala nätverk, objektdektektering, förslags metoder och presenterar ett nytt förslag att generera förslag på regioner. Vi visar att genom att byta ut RPN med vår metod (MPN) ökar vi precisionen med 12% och reducerar tiden med 10%.

APA, Harvard, Vancouver, ISO, and other styles

24

Moussallik, Laila. "Towards Condition-Based Maintenance of Catenary wires using computer vision : Deep Learning applications on eMaintenance & Industrial AI for railway industry." Thesis, Luleå tekniska universitet, Institutionen för samhällsbyggnad och naturresurser, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-83123.

Full text

Abstract:

Railways are a main element of a sustainable transport policy in several countries as they are considered a safe, efficient and green mode of transportation. Owing to these advantages, there is a cumulative request for the railway industry to increase the performance, the capacity and the availability in addition to safely transport goods and people at higher speeds. To meet the demand, large adjustment of the infrastructure and improvement of maintenance process are required. Inspection activities are essential in establishing the required maintenance, and it is periodically required to reduce unexpected failures and to prevent dangerous consequences. Maintenance of railway catenary systems is a critical task for warranting the safety of electrical railway operation.Usually, the catenary inspection is performed manually by trained personnel. However, as in all human-based inspections characterized by slowness and lack of objectivity, might have a number of crucial disadvantages and potentially lead to dangerous consequences. With the rapid progress of artificial intelligence, it is appropriate for computer vision detection approaches to replace the traditional manual methods during inspections. In this thesis, a strategy for monitoring the health of catenary wires is developed, which include the various steps needed to detect anomalies in this component. Moreover, a solution for detecting different types of wires in the railway catenary system was implemented, in which a deep learning framework is developed by combining the Convolutional Neural Network (CNN) and the Region Proposal Network (RPN).

APA, Harvard, Vancouver, ISO, and other styles

25

IACONO, MASSIMILIANO. "Object detection and recognition with event driven cameras." Doctoral thesis, Università degli studi di Genova, 2020. http://hdl.handle.net/11567/1005981.

Full text

Abstract:

This thesis presents study, analysis and implementation of algorithms to perform object detection and recognition using an event-based cam era. This sensor represents a novel paradigm which opens a wide range of possibilities for future developments of computer vision. In partic ular it allows to produce a fast, compressed, illumination invariant output, which can be exploited for robotic tasks, where fast dynamics and signiﬁcant illumination changes are frequent. The experiments are carried out on the neuromorphic version of the iCub humanoid platform. The robot is equipped with a novel dual camera setup mounted directly in the robot’s eyes, used to generate data with a moving camera. The motion causes the presence of background clut ter in the event stream. In such scenario the detection problem has been addressed with an at tention mechanism, speciﬁcally designed to respond to the presence of objects, while discarding clutter. The proposed implementation takes advantage of the nature of the data to simplify the original proto object saliency model which inspired this work. Successively, the recognition task was ﬁrst tackled with a feasibility study to demonstrate that the event stream carries suﬃcient informa tion to classify objects and then with the implementation of a spiking neural network. The feasibility study provides the proof-of-concept that events are informative enough in the context of object classiﬁ cation, whereas the spiking implementation improves the results by employing an architecture speciﬁcally designed to process event data. The spiking network was trained with a three-factor local learning rule which overcomes weight transport, update locking and non-locality problem. The presented results prove that both detection and classiﬁcation can be carried-out in the target application using the event data.

APA, Harvard, Vancouver, ISO, and other styles

26

Lamberti, Lorenzo. "A deep learning solution for industrial OCR applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19777/.

Full text

Abstract:

This thesis describes a project developed throughout a six months internship in the Machine Vision Laboratory of Datalogic based in Pasadena, California. The project aims to develop a deep learning system as a possible solution for industrial optical character recognition applications. In particular, the focus falls on a specific algorithm called You Only Look Once (YOLO), which is a general-purpose object detector based on convolutional neural networks that currently offers state-of-the-art performances in terms of trade-off between speed and accuracy. This algorithm is indeed well known for reaching impressive processing speeds, but its intrinsic structure makes it struggle in detecting small objects clustered together, which unfortunately matches our scenario: we are trying to read alphanumerical codes by detecting each single character and then reconstructing the final string. The final goal of this thesis is to overcome this drawback and push the accuracy performances of a general object detector convolutional neural network to its limits, in order to meet the demanding requirements of industrial OCR applications. To accomplish this, first YOLO's unique detecting approach was mastered in its original framework called Darknet, written in C and CUDA, then all the code was translated into Python programming language for a better flexibility, which also allowed the deployment of a custom architecture. Four different datasets with increasing complexity were used as case-studies and the final performances reached were surprising: the accuracy varies between 99.75\% and 99.97\% with a processing time of 15 ms for images $1000\times1000$ big, largely outperforming in speed the current deep learning solution deployed by Datalogic. On the downsides, the training phase usually requires a very large amount of data and time and YOLO also showed some memorization behaviours if not enough variability is given at training time.

APA, Harvard, Vancouver, ISO, and other styles

27

Cottignoli, Lorenzo. "Strumento di Realtà Aumentata su Dispositivi Mobili per Labeling di Immagini Semi-Automatico." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/17734/.

Full text

Abstract:

In questa tesi verrà proposta l'implementazione di un sistema innovativo di realtà aumentata su dispositivi mobili per il labeling semi-automatico di immagini. L'obiettivo di questa tesi è quello di fornire uno strumento di supporto semplice ed intuitivo, in grado di ridurre drasticamente i tempi di generazione di dataset, necessari per l'addestramento delle reti neurali per l'Object Detection. Per questo motivo, è stata sviluppata un'applicazione mobile per Android che permetta la creazione di dataset in maniera semi-automatica. Inoltre, è stato realizzato uno script Python per desktop in grado di convertire il dataset, precedentemente generato, nel formato di Tensorflow ed utilizzarlo per eseguire l'addestramento di una CNN.

APA, Harvard, Vancouver, ISO, and other styles

28

Arcidiacono, Claudio Salvatore. "An empirical study on synthetic image generation techniques for object detectors." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-235502.

Full text

Abstract:

Convolutional Neural Networks are a very powerful machine learning tool that outperformed other techniques in image recognition tasks. The biggest drawback of this method is the massive amount of training data required, since producing training data for image recognition tasks is very labor intensive. To tackle this issue, different techniques have been proposed to generate synthetic training data automatically. These synthetic data generation techniques can be grouped in two categories: the first category generates synthetic images using computer graphic software and CAD models of the objects to recognize; the second category generates synthetic images by cutting the object from an image and pasting it on another image. Since both techniques have their pros and cons, it would be interesting for industries to investigate more in depth the two approaches. A common use case in industrial scenarios is detecting and classifying objects inside an image. Different objects appertaining to classes relevant in industrial scenarios are often undistinguishable (for example, they all the same component). For these reasons, this thesis work aims to answer the research question “Among the CAD model generation techniques, the Cut-paste generation techniques and a combination of the two techniques, which technique is more suitable for generating images for training object detectors in industrial scenarios”. In order to answer the research question, two synthetic image generation techniques appertaining to the two categories are proposed.The proposed techniques are tailored for applications where all the objects appertaining to the same class are indistinguishable, but they can also be extended to other applications. The two synthetic image generation techniques are compared measuring the performances of an object detector trained using synthetic images on a test dataset of real images. The performances of the two synthetic data generation techniques used for data augmentation have been also measured. The empirical results show that the CAD models generation technique works significantly better than the Cut-Paste generation technique where synthetic images are the only source of training data (61% better),whereas the two generation techniques perform equally good as data augmentation techniques. Moreover, the empirical results show that the models trained using only synthetic images performs almost as good as the model trained using real images (7,4% worse) and that augmenting the dataset of real images using synthetic images improves the performances of the model (9,5% better).
Konvolutionella neurala nätverk är ett mycket kraftfullt verktyg för maskininlärning som överträffade andra tekniker inom bildigenkänning. Den största nackdelen med denna metod är den massiva mängd träningsdata som krävs, eftersom det är mycket arbetsintensivt att producera träningsdata för bildigenkänningsuppgifter. För att ta itu med detta problem har olika tekniker föreslagits för att generera syntetiska träningsdata automatiskt. Dessa syntetiska datagenererande tekniker kan grupperas i två kategorier: den första kategorin genererar syntetiska bilder med hjälp av datorgrafikprogram och CAD-modeller av objekten att känna igen; Den andra kategorin genererar syntetiska bilder genom att klippa objektet från en bild och klistra in det på en annan bild. Eftersom båda teknikerna har sina fördelar och nackdelar, skulle det vara intressant för industrier att undersöka mer ingående de båda metoderna. Ett vanligt fall i industriella scenarier är att upptäcka och klassificera objekt i en bild. Olika föremål som hänför sig till klasser som är relevanta i industriella scenarier är ofta oskiljbara (till exempel de är alla samma komponent). Av dessa skäl syftar detta avhandlingsarbete till att svara på frågan “Bland CAD-genereringsteknikerna, Cut-paste generationsteknikerna och en kombination av de två teknikerna, vilken teknik är mer lämplig för att generera bilder för träningsobjektdetektorer i industriellascenarier”. För att svara på forskningsfrågan föreslås två syntetiska bildgenereringstekniker som hänför sig till de två kategorierna. De föreslagna teknikerna är skräddarsydda för applikationer där alla föremål som tillhör samma klass är oskiljbara, men de kan också utökas till andra applikationer. De två syntetiska bildgenereringsteknikerna jämförs med att mäta prestanda hos en objektdetektor som utbildas med hjälp av syntetiska bilder på en testdataset med riktiga bilder. Föreställningarna för de två syntetiska datagenererande teknikerna som används för dataförökning har också uppmätts. De empiriska resultaten visar att CAD-modelleringstekniken fungerar väsentligt bättre än Cut-Paste-genereringstekniken, där syntetiska bilder är den enda källan till träningsdata (61% bättre), medan de två generationsteknikerna fungerar lika bra som dataförstoringstekniker. Dessutom visar de empiriska resultaten att modellerna som utbildats med bara syntetiska bilder utför nästan lika bra som modellen som utbildats med hjälp av riktiga bilder (7,4% sämre) och att förstora datasetet med riktiga bilder med hjälp av syntetiska bilder förbättrar modellens prestanda (9,5% bättre).

APA, Harvard, Vancouver, ISO, and other styles

29

Schennings, Jacob. "Deep Convolutional Neural Networks for Real-Time Single Frame Monocular Depth Estimation." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-336923.

Full text

Abstract:

Vision based active safety systems have become more frequently occurring in modern vehicles to estimate depth of the objects ahead and for autonomous driving (AD) and advanced driver-assistance systems (ADAS). In this thesis a lightweight deep convolutional neural network performing real-time depth estimation on single monocular images is implemented and evaluated. Many of the vision based automatic brake systems in modern vehicles only detect pre-trained object types such as pedestrians and vehicles. These systems fail to detect general objects such as road debris and roadside obstacles. In stereo vision systems the problem is resolved by calculating a disparity image from the stereo image pair to extract depth information. The distance to an object can also be determined using radar and LiDAR systems. By using this depth information the system performs necessary actions to avoid collisions with objects that are determined to be too close. However, these systems are also more expensive than a regular mono camera system and are therefore not very common in the average consumer car. By implementing robust depth estimation in mono vision systems the benefits from active safety systems could be utilized by a larger segment of the vehicle fleet. This could drastically reduce human error related traffic accidents and possibly save many lives. The network architecture evaluated in this thesis is more lightweight than other CNN architectures previously used for monocular depth estimation. The proposed architecture is therefore preferable to use on computationally lightweight systems. The network solves a supervised regression problem during the training procedure in order to produce a pixel-wise depth estimation map. The network was trained using a sparse ground truth image with spatially incoherent and discontinuous data and output a dense spatially coherent and continuous depth map prediction. The spatially incoherent ground truth posed a problem of discontinuity that was addressed by a masked loss function with regularization. The network was able to predict a dense depth estimation on the KITTI dataset with close to state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

30

Pitteri, Giorgia. "3D Object Pose Estimation in Industrial Context." Thesis, Bordeaux, 2020. http://www.theses.fr/2020BORD0202.

Full text

Abstract:

La détection d'objets 3D et l'estimation de leur pose à partir d'images sont très importantes pour des tâches comme la robotique et la réalité augmentée et font l'objet d'intenses recherches depuis le début de la vision par ordinateur. D'importants progrès ont été réalisés récemment grâce au développement des méthodes basées sur l'apprentissage profond. Ce type d'approche fait néanmoins face à plusieurs obstacles majeurs qui se révèlent en milieu industriel, notamment la gestion des objets contenant des symétries et la généralisation à de nouveaux objets jamais vus par les réseaux lors de l'apprentissage.Dans cette thèse, nous montrons d'abord le lien entre les symétries d'un objet 3D et son apparence dans les images de manière analytique expliquant pourquoi les objets symétriques représentent un défi. Nous proposons alors une solution efficace et simple qui repose sur la normalisation de la rotation de la pose. Cette approche est générale et peut être utilisée avec n'importe quel algorithme d'estimation de pose 3D.Ensuite, nous abordons le deuxième défi: la géneralisation aux objets jamais vus pendant l'apprentissage. De nombreuses méthodes récentes d'estimation de la pose 3D sont très efficaces mais leur succès peut être attribué à l'utilisation d'approches d'apprentissage automatique supervisé. Pour chaque nouvel objet, ces méthodes doivent être re-entrainées sur de nombreuses images différentes de cet objet, ces images n'étant pas toujours disponibles. Même si les méthodes de transfert de domaine permettent de réaliser l'entrainement sur des images synthétiques plutôt que sur des images réelles, ces sessions d'entrainement prennent du temps, et il est fortement souhaitable de les éviter dans la pratique. Nous proposons deux méthodes pour traiter ce problème. La première méthode s’appuie uniquement sur la géométrie des objets et se concentre sur les objets avec des coins proéminents, ce qui est le cas pour un grand nombre d’objets industriels. Nous apprenons dans un premier temps à détecter les coins des objets de différentes formes dans les images et à prédire leurs poses 3D, en utilisant des images d'apprentissage d'un petit ensemble d'objets. Pour détecter un nouvel objet dans une image donnée, on identifie ses coins à partir de son modèle CAO, on détecte également les coins visibles sur l'image et on prédit leurs poses 3D. Nous introduisons ensuite un algorithme de type RANSAC qui détecte et estime de manière robuste et efficace la pose 3D de l'objet en faisant correspondre ses coins sur le modèle CAO avec leurs correspondants détectés dans l'image. La deuxième méthode surmonte les limites de la première et ne nécessite pas que les objets aient des coins spécifiques et la sélection hors ligne des coins sur le modèle CAO. Il combine l'apprentissage profond et la géométrie 3D, et repose sur une représentation réduite de la géométrie 3D locale pour faire correspondre les modèles CAO aux images d'entrée. Pour les points sur la surface des objets, cette représentation peut être calculée directement à partir du modèle CAO; pour les points de l'image, nous apprenons à la prédire à partir de l'image elle-même. Cela établit des correspondances entre les points 3D sur le modèle CAO et les points 2D des images. Cependant, beaucoup de ces correspondances sont ambiguës car de nombreux points peuvent avoir des géométries locales similaires. Nous utilisons alors Mask-RCNN sans l'information de la classe des objets pour détecter les nouveaux objets sans ré-entraîner le réseau et ainsi limiter drastiquement le nombre de correspondances possibles. La pose 3D est estimée à partir de ces correspondances discriminantes en utilisant un algorithme de type RANSAC
3D object detection and pose estimation are of primary importance for tasks such as robotic manipulation, augmented reality and they have been the focus of intense research in recent years. Methods relying on depth data acquired by depth cameras are robust. Unfortunately, active depth sensors are power hungry or sometimes it is not possible to use them. It is therefore often desirable to rely on color images. When training machine learning algorithms that aim at estimate object's 6D poses from images, many challenges arise, especially in industrial context that requires handling objects with symmetries and generalizing to unseen objects, i.e. objects never seen by the networks during training.In this thesis, we first analyse the link between the symmetries of a 3D object and its appearance in images. Our analysis explains why symmetrical objects can be a challenge when training machine learning algorithms to predict their 6D pose from images. We then propose an efficient and simple solution that relies on the normalization of the pose rotation. This approach is general and can be used with any 6D pose estimation algorithm.Then, we address the second main challenge: the generalization to unseen objects. Many recent methods for 6D pose estimation are robust and accurate but their success can be attributed to supervised Machine Learning approaches. For each new object, these methods have to be retrained on many different images of this object, which are not always available. Even if domain transfer methods allow for training such methods with synthetic images instead of real ones-at least to some extent-such training sessions take time, and it is highly desirable to avoid them in practice.We propose two methods to handle this problem. The first method relies only on the objects’ geometries and focuses on objects with prominent corners, which covers a large number of industrial objects. We first learn to detect object corners of various shapes in images and also to predict their 3D poses, by using training images of a small set of objects. To detect a new object in a given image, we first identify its corners from its CAD model; we also detect the corners visible in the image and predict their 3D poses. We then introduce a RANSAC-like algorithm that robustly and efficiently detects and estimates the object’s 3D pose by matching its corners on the CAD model with their detected counterparts in the image.The second method overcomes the limitations of the first one as it does not require objects to have specific corners and the offline selection of the corners on the CAD model. It combines Deep Learning and 3D geometry and relies on an embedding of the local 3D geometry to match the CAD models to the input images. For points at the surface of objects, this embedding can be computed directly from the CAD model; for image locations, we learn to predict it from the image itself. This establishes correspondences between 3D points on the CAD model and 2D locations of the input images. However, many of these correspondences are ambiguous as many points may have similar local geometries. We also show that we can use Mask-RCNN in a class-agnostic way to detect the new objects without retraining and thus drastically limit the number of possible correspondences. We can then robustly estimate a 3D pose from these discriminative correspondences using a RANSAC-like algorithm

APA, Harvard, Vancouver, ISO, and other styles

31

Grard, Matthieu. "Generic instance segmentation for object-oriented bin-picking." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEC015.

Full text

Abstract:

Le dévracage robotisé est une tâche industrielle en forte croissance visant à automatiser le déchargement par unité d’une pile d’instances d'objet en vrac pour faciliter des traitements ultérieurs tels que la formation de kits ou l’assemblage de composants. Cependant, le modèle explicite des objets est souvent indisponible dans de nombreux secteurs industriels, notamment alimentaire et automobile, et les instances d'objet peuvent présenter des variations intra-classe, par exemple en raison de déformations élastiques.Les techniques d’estimation de pose, qui nécessitent un modèle explicite et supposent des transformations rigides, ne sont donc pas applicables dans de tels contextes. L'approche alternative consiste à détecter des prises sans notion explicite d’objet, ce qui pénalise fortement le dévracage lorsque l’enchevêtrement des instances est important. Ces approches s’appuient aussi sur une reconstruction multi-vues de la scène, difficile par exemple avec des emballages alimentaires brillants ou transparents, ou réduisant de manière critique le temps de cycle restant dans le cadre d’applications à haute cadence.En collaboration avec Siléane, une entreprise française de robotique industrielle, l’objectif de ce travail est donc de développer une solution par apprentissage pour la localisation des instances les plus prenables d’un vrac à partir d’une seule image, en boucle ouverte, sans modèles d'objet explicites. Dans le contexte du dévracage industriel, notre contribution est double.Premièrement, nous proposons un nouveau réseau pleinement convolutionnel (FCN) pour délinéer les instances et inférer un ordre spatial à leurs frontières. En effet, les méthodes état de l'art pour cette tâche reposent sur deux flux indépendants, respectivement pour les frontières et les occultations, alors que les occultations sont souvent sources de frontières. Plus précisément, l'approche courante, qui consiste à isoler les instances dans des boîtes avant de détecter les frontières et les occultations, se montre inadaptée aux scénarios de dévracage dans la mesure où une région rectangulaire inclut souvent plusieurs instances. A contrario, notre architecture sans détection préalable de régions détecte finement les frontières entre instances, ainsi que le bord occultant correspondant, à partir d'une représentation unifiée de la scène.Deuxièmement, comme les FCNs nécessitent de grands ensembles d'apprentissage qui ne sont pas disponibles dans les applications de dévracage, nous proposons une procédure par simulation pour générer des images d'apprentissage à partir de moteurs physique et de rendu. Plus précisément, des vracs d'instances sont simulés et rendus avec les annotations correspondantes à partir d'ensembles d'images de texture et de maillages auxquels sont appliquées de multiples déformations aléatoires. Nous montrons que les données synthétiques proposées sont vraisemblables pour des applications réelles au sens où elles permettent l'apprentissage de représentations profondes transférables à des données réelles. A travers de nombreuses expériences sur une maquette réelle avec robot, notre réseau entraîné sur données synthétiques surpasse la méthode industrielle de référence, tout en obtenant des performances temps réel. L'approche proposée établit ainsi une nouvelle référence pour le dévracage orienté-objet sans modèle d'objet explicite
Referred to as robotic random bin-picking, a fast-expanding industrial task consists in robotizing the unloading of many object instances piled up in bulk, one at a time, for further processing such as kitting or part assembling. However, explicit object models are not always available in many bin-picking applications, especially in the food and automotive industries. Furthermore, object instances are often subject to intra-class variations, for example due to elastic deformations.Object pose estimation techniques, which require an explicit model and assume rigid transformations, are therefore not suitable in such contexts. The alternative approach, which consists in detecting grasps without an explicit notion of object, proves hardly efficient when the object geometry makes bulk instances prone to occlusion and entanglement. These approaches also typically rely on a multi-view scene reconstruction that may be unfeasible due to transparent and shiny textures, or that reduces critically the time frame for image processing in high-throughput robotic applications.In collaboration with Siléane, a French company in industrial robotics, we thus aim at developing a learning-based solution for localizing the most affordable instance of a pile from a single image, in open loop, without explicit object models. In the context of industrial bin-picking, our contribution is two-fold.First, we propose a novel fully convolutional network (FCN) for jointly delineating instances and inferring the spatial layout at their boundaries. Indeed, the state-of-the-art methods for such a task rely on two independent streams for boundaries and occlusions respectively, whereas occlusions often cause boundaries. Specifically, the mainstream approach, which consists in isolating instances in boxes before detecting boundaries and occlusions, fails in bin-picking scenarios as a rectangle region often includes several instances. By contrast, our box proposal-free architecture recovers fine instance boundaries, augmented with their occluding side, from a unified scene representation. As a result, the proposed network outperforms the two-stream baselines on synthetic data and public real-world datasets.Second, as FCNs require large training datasets that are not available in bin-picking applications, we propose a simulation-based pipeline for generating training images using physics and rendering engines. Specifically, piles of instances are simulated and rendered with their ground-truth annotations from sets of texture images and meshes to which multiple random deformations are applied. We show that the proposed synthetic data is plausible for real-world applications in the sense that it enables the learning of deep representations transferable to real data. Through extensive experiments on a real-world robotic setup, our synthetically trained network outperforms the industrial baseline while achieving real-time performances. The proposed approach thus establishes a new baseline for model-free object-oriented bin-picking

APA, Harvard, Vancouver, ISO, and other styles

32

Rispoli, Luca. "Un approccio deep learning-based per il conteggio di persone tramite videocamere low-cost in un contesto Smart Campus." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/19567/.

Full text

Abstract:

I più recenti progressi tecnologici hanno provocato una rapida evoluzione del settore delle tecnologie cosiddette Smart, che, ad oggi, vengono integrati in un vasto numero di sistemi. La diffusione di tali tecnologie non si è tuttavia limitata a dispositivi ed apparecchiature informatiche, ma ha coinvolto anche altri settori, come quello edilizio, la quale influenza ha dato vita al concetto di "smart building". Un edificio intelligente ha lo scopo di offrire ai suoi abitanti un elevato livello di comfort, creando un ecosistema in cui i vari dispositivi elettronici possono operare interagendo tra essi in completa autonomia, ponendo tuttavia una considerevole attenzione al fine di evitare sprechi e ridurre, quanto più possibile, l'impatto ambientale. Il campus di Cesena è stato costruito secondo questi principi e rappresenta il contesto all'interno del quale si è voluto sviluppare il seguente progetto: un sistema scalabile e a basso costo il quale scopo è quello di monitorare il livello di utilizzo delle aule attraverso il conteggio delle persone effettuato utilizzando dispositivi embedded a basso costo ed algoritmi di intelligenza arti�ficiale, tale sistema deve essere in grado di operare in piena autonomia e deve offrire, secondo parametri definiti all'interno dell'elaborato di tesi, un certo grado di affidabilità e attendibilità. L'obiettivo è stato raggiunto tramite l'utilizzo di telecamere collegate a dei single-board computer sul quale sono stati configurati algoritmi di intelligenza artificiale per il riconoscimento di persone, il sistema dispone inoltre di un'applicazione web server che permette di consultare i conteggi effettuati e di ricevere segnalazioni riguardanti eventuali malfunzionamenti.

APA, Harvard, Vancouver, ISO, and other styles

33

Suzano, Massa Francisco Vitor. "Mise en relation d'images et de modèles 3D avec des réseaux de neurones convolutifs." Thesis, Paris Est, 2017. http://www.theses.fr/2017PESC1198/document.

Full text

Abstract:

La récente mise à disposition de grandes bases de données de modèles 3D permet de nouvelles possibilités pour un raisonnement à un niveau 3D sur les photographies. Cette thèse étudie l'utilisation des réseaux de neurones convolutifs (CNN) pour mettre en relation les modèles 3D et les images.Nous présentons tout d'abord deux contributions qui sont utilisées tout au long de cette thèse : une bibliothèque pour la réduction automatique de la mémoire pour les CNN profonds, et une étude des représentations internes apprises par les CNN pour la mise en correspondance d'images appartenant à des domaines différents. Dans un premier temps, nous présentons une bibliothèque basée sur Torch7 qui réduit automatiquement jusqu'à 91% des besoins en mémoire pour déployer un CNN profond. Dans un second temps, nous étudions l'efficacité des représentations internes des CNN extraites d'un réseau pré-entraîné lorsqu'il est appliqué à des images de modalités différentes (réelles ou synthétiques). Nous montrons que malgré la grande différence entre les images synthétiques et les images naturelles, il est possible d'utiliser certaines des représentations des CNN pour l'identification du modèle de l'objet, avec des applications possibles pour le rendu basé sur l'image.Récemment, les CNNs ont été utilisés pour l'estimation de point de vue des objets dans les images, parfois avec des choix de modélisation très différents. Nous présentons ces approches dans un cadre unifié et nous analysons les facteur clés qui ont une influence sur la performance. Nous proposons une méthode d'apprentissage jointe qui combine à la fois la détection et l'estimation du point de vue, qui fonctionne mieux que de considérer l'estimation de point de vue de manière indépendante.Nous étudions également l'impact de la formulation de l'estimation du point de vue comme une tâche discrète ou continue, nous quantifions les avantages des architectures de CNN plus profondes et nous montrons que l'utilisation des données synthétiques est bénéfique. Avec tous ces éléments combinés, nous améliorons l'état de l'art d'environ 5% pour la précision de point de vue moyenne sur l'ensemble des données Pascal3D+.Dans l'étude de recherche de modèle d'objet 3D dans une base de données, l'image de l'objet est fournie et l'objectif est d'identifier parmi un certain nombre d'objets 3D lequel correspond à l'image. Nous étendons ce travail à la détection d'objet, où cette fois-ci un modèle 3D est donné, et l'objectif consiste à localiser et à aligner le modèle 3D dans image. Nous montrons que l'application directe des représentations obtenues par un CNN ne suffit pas, et nous proposons d'apprendre une transformation qui rapproche les répresentations internes des images réelles vers les représentations des images synthétiques. Nous évaluons notre approche à la fois qualitativement et quantitativement sur deux jeux de données standard: le jeu de données IKEAobject, et le sous-ensemble du jeu de données Pascal VOC 2012 contenant des instances de chaises, et nous montrons des améliorations sur chacun des deux
The recent availability of large catalogs of 3D models enables new possibilities for a 3D reasoning on photographs. This thesis investigates the use of convolutional neural networks (CNNs) for relating 3D objects to 2D images.We first introduce two contributions that are used throughout this thesis: an automatic memory reduction library for deep CNNs, and a study of CNN features for cross-domain matching. In the first one, we develop a library built on top of Torch7 which automatically reduces up to 91% of the memory requirements for deploying a deep CNN. As a second point, we study the effectiveness of various CNN features extracted from a pre-trained network in the case of images from different modalities (real or synthetic images). We show that despite the large cross-domain difference between rendered views and photographs, it is possible to use some of these features for instance retrieval, with possible applications to image-based rendering.There has been a recent use of CNNs for the task of object viewpoint estimation, sometimes with very different design choices. We present these approaches in an unified framework and we analyse the key factors that affect performance. We propose a joint training method that combines both detection and viewpoint estimation, which performs better than considering the viewpoint estimation separately. We also study the impact of the formulation of viewpoint estimation either as a discrete or a continuous task, we quantify the benefits of deeper architectures and we demonstrate that using synthetic data is beneficial. With all these elements combined, we improve over previous state-of-the-art results on the Pascal3D+ dataset by a approximately 5% of mean average viewpoint precision.In the instance retrieval study, the image of the object is given and the goal is to identify among a number of 3D models which object it is. We extend this work to object detection, where instead we are given a 3D model (or a set of 3D models) and we are asked to locate and align the model in the image. We show that simply using CNN features are not enough for this task, and we propose to learn a transformation that brings the features from the real images close to the features from the rendered views. We evaluate our approach both qualitatively and quantitatively on two standard datasets: the IKEAobject dataset, and a subset of the Pascal VOC 2012 dataset of the chair category, and we show state-of-the-art results on both of them

APA, Harvard, Vancouver, ISO, and other styles

34

Carletti, Angelo. "Development of a machine learning algorithm for the automatic analysis of microscopy images in an in-vitro diagnostic platform." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

In this thesis we present the development of machine learning algorithms for single cell analysis in an in-vitro diagnostic platform for Cellply, a startup that operates in precision medicine. We researched the state of the art of deep learning for biomedical image analysis, and we analyzed the impact that convolutional neural networks have had in object detection tasks. Then we compared neural networks that are currently used for cell detection, and we chose the one (i.e. Stardist) that is able to perform a more efficient detection also in a crowded cells context. We could train models using Stardist algorithm in the open-source platform ZeroCostDL4Mic, using code and GPU in Colab environment. We trained different models, intended for distinct applications, and we evaluated them using metrics such as precision and recall. These are our results: • a model for single channel brightfield images taken from samples of Covid patients, that guarantees a precision of about 0.98 and a recall of about 0.96 • a model for multi-channel images (i.e. a stack of multiple images, each one highlighting different contents) taken from experiments about natural killer cells, with precision and recall of about 0.81 • a model for multi-channel images taken from samples of AML (Acute Myeloid Leukemia) patients, with precision and recall of about 0.73 • a simpler model, trained to detect the main area (named "well") on which cells can be found, in order to discard what is out of this area. This model has a precision of about 1 and a recall of about 0.98. Finally, we wrote Python code in order to read a text input file that contains the necessary information to run a specified trained model for cell detection, with certain parameters, on a given set of images of a certain experiment. The output of the code is a .csv file where measurements related to every detected “object of interest” (i.e. cells or other particles) are stored. We also talk about future developments in this field.

APA, Harvard, Vancouver, ISO, and other styles

35

Sievert, Rolf. "Instance Segmentation of Multiclass Litter and Imbalanced Dataset Handling : A Deep Learning Model Comparison." Thesis, Linköpings universitet, Datorseende, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-175173.

Full text

Abstract:

Instance segmentation has a great potential for improving the current state of littering by autonomously detecting and segmenting different categories of litter. With this information, litter could, for example, be geotagged to aid litter pickers or to give precise locational information to unmanned vehicles for autonomous litter collection. Land-based litter instance segmentation is a relatively unexplored field, and this study aims to give a comparison of the instance segmentation models Mask R-CNN and DetectoRS using the multiclass litter dataset called Trash Annotations in Context (TACO) in conjunction with the Common Objects in Context precision and recall scores. TACO is an imbalanced dataset, and therefore imbalanced data-handling is addressed, exercising a second-order relation iterative stratified split, and additionally oversampling when training Mask R-CNN. Mask R-CNN without oversampling resulted in a segmentation of 0.127 mAP, and with oversampling 0.163 mAP. DetectoRS achieved 0.167 segmentation mAP, and improves the segmentation mAP of small objects most noticeably, with a factor of at least 2, which is important within the litter domain since small objects such as cigarettes are overrepresented. In contrast, oversampling with Mask R-CNN does not seem to improve the general precision of small and medium objects, but only improves the detection of large objects. It is concluded that DetectoRS improves results compared to Mask R-CNN, as well does oversampling. However, using a dataset that cannot have an all-class representation for train, validation, and test splits, together with an iterative stratification that does not guarantee all-class representations, makes it hard for future works to do exact comparisons to this study. Results are therefore approximate considering using all categories since 12 categories are missing from the test set, where 4 of those were impossible to split into train, validation, and test set. Further image collection and annotation to mitigate the imbalance would most noticeably improve results since results depend on class-averaged values. Doing oversampling with DetectoRS would also help improve results. There is also the option to combine the two datasets TACO and MJU-Waste to enforce training of more categories.

APA, Harvard, Vancouver, ISO, and other styles

36

Gao, Yuan. "Surround Vision Object Detection Using Deep Learning." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-231929.

Full text

Abstract:

The thesis first develops an object detection framework for front view camera images in surround vision data set. And with the goal of reducing as much annotated data as possible, various domain adaptation methods are applied to train other camera images based on the pretraining of a baseline model. Relevant data analysis work is performed to reveal useful information in object distribution over all cameras. Regularization techniques involving dropout, weight decay, data augmentation are attempted to lower the complexity of training model. Also, the experiments of ratio reduction are carried out to find the relationship between model performance and the amount of training data. It is shown that 30% of the training data for left rear and left front view cameras can be reduced without hurting the model performance significantly. In addition, the thesis plots the errors regarding vehicle locations through heatmap which is useful for further study. Overall, the results in these extensive experiments indicate that the model trained by domain adaptation is effective as expected.
Avhandlingen börjar med att utveckla ett ramverk för objektdetektering i bilder från den framåtriktade kameran i surroundvision-data. Med målet att minska mängden annoterad data så mycket som möjligt, tillämpas olika metoder för domänanpassning för att träna på andra kamerabilder baserat på en basmodell. Relevant dataanalys utförs som avslöjar användbar information i objektdistributioner för alla kamerorna. Regulariseringstekniker som infattar Dropout, viktsönderfall, data-augmentering testas för att reducera träningsmodellens komplexitet. Experiment med kvotreduktion utförs också för att hitta förhållandet mellan modellens prestanda och mängden träningsdata. Det påvisas att 30% av träningsdata för den vänstra bakåtriktade och den vänstra framåtriktade kameran kan reduceras utan att modellens prestanda minskar väsentligt. Dessutom visas i avhandlingen felen angående fordonens placeringar genom värmekartor som är användbara för vidare studier. Sammantaget indikerar resultaten i dessa omfattande experiment på att modellen tränad med domänanpassning är, som förväntat, effektiv.

APA, Harvard, Vancouver, ISO, and other styles

37

Mordan, Taylor. "Conception d'architectures profondes pour l'interprétation de données visuelles." Electronic Thesis or Diss., Sorbonne université, 2018. http://www.theses.fr/2018SORUS270.

Full text

Abstract:

Aujourd’hui, les images sont omniprésentes à travers les smartphones et les réseaux sociaux. Il devient alors nécessaire d’avoir des moyens de traitement automatiques, afin d’analyser et d’interpréter les grandes quantités de données disponibles. Dans cette thèse, nous nous intéressons à la détection d’objets, i.e. au problème d’identification et de localisation de tous les objets présents dans une image. Cela peut être vu comme une première étape vers une interprétation complète des scènes. Nous l’abordons avec des réseaux de neurones profonds à convolutions, sous le paradigme de l’apprentissage profond. Un inconvénient de cette approche est le besoin de données annotées pour l’apprentissage. Puisque les annotations précises sont longues à produire, des jeux de données plus gros peuvent être construits à l’aide d’annotations partielles. Nous concevons des fonctions d’agrégation globale pour travailler avec celles-ci et retrouver l’information latente dans deux cas : l’apprentissage de représentations spatialement localisée et par parties, à partir de supervisions aux niveaux de l’image et des objets respectivement. Nous traitons la question de l’efficacité dans l’apprentissage de bout en bout de ces représentations en tirant parti de réseaux complètement convolutionnels. En outre, l’exploitation d’annotations supplémentaires sur les images disponibles peut être une alternative à l’obtention de plus d’images, particulièrement quand il y a peu d’images. Nous formalisons ce problème comme un type spécifique d’apprentissage multi-tâche avec un objectif primaire, et concevons une méthode pour apprendre de cette supervision auxiliaire
Nowadays, images are ubiquitous through the use of smartphones and social media. It then becomes necessary to have automatic means of processing them, in order to analyze and interpret the large amount of available data. In this thesis, we are interested in object detection, i.e. the problem of identifying and localizing all objects present in an image. This can be seen as a first step toward a complete visual understanding of scenes. It is tackled with deep convolutional neural networks, under the Deep Learning paradigm. One drawback of this approach is the need for labeled data to learn from. Since precise annotations are time-consuming to produce, bigger datasets can be built with partial labels. We design global pooling functions to work with them and to recover latent information in two cases: learning spatially localized and part-based representations from image- and object-level supervisions respectively. We address the issue of efficiency in end-to-end learning of these representations by leveraging fully convolutional networks. Besides, exploiting additional annotations on available images can be an alternative to having more images, especially in the data-deficient regime. We formalize this problem as a specific kind of multi-task learning with a primary objective to focus on, and design a way to effectively learn from this auxiliary supervision under this framework

APA, Harvard, Vancouver, ISO, and other styles

38

Zhang, Kaige. "Deep Learning for Crack-Like Object Detection." DigitalCommons@USU, 2019. https://digitalcommons.usu.edu/etd/7616.

Full text

Abstract:

Cracks are common defects on surfaces of man-made structures such as pavements, bridges, walls of nuclear power plants, ceilings of tunnels, etc. Timely discovering and repairing of the cracks are of great significance and importance for keeping healthy infrastructures and preventing further damages. Traditionally, the cracking inspection was conducted manually which was labor-intensive, time-consuming and costly. For example, statistics from the Central Intelligence Agency show that the world’s road network length has reached 64,285,009 km, of which the United States has 6,586,610 km. It is a huge cost to maintain and upgrade such an immense road network. Thus, fully automatic crack detection has received increasing attention. With the development of artificial intelligence (AI), the deep learning technique has achieved great success and has been viewed as the most promising way for crack detection. Based on deep learning, this research has solved four important issues existing in crack-like object detection. First, the noise problem caused by the textured background is solved by using a deep classification network to remove the non-crack region before conducting crack detection. Second, the computational efficiency is highly improved. Third, the crack localization accuracy is improved. Fourth, the proposed model is very stable and can be used to deal with a wide range of crack detection tasks. In addition, this research performs a preliminary study about the future AI system, which provides a concept that has potential to realize fully automatic crack detection without human’s intervention.

APA, Harvard, Vancouver, ISO, and other styles

39

Yellapantula, Sudha Ravali. "Synthesizing Realistic Data for Vision Based Drone-to-Drone Detection." Thesis, Virginia Tech, 2019. http://hdl.handle.net/10919/91460.

Full text

Abstract:

In the thesis, we aimed at building a robust UAV(drone) detection algorithm through which, one drone could detect another drone in flight. Though this was a straight forward object detection problem, the biggest challenge we faced for drone detection is the limited amount of drone images for training. To address this issue, we used Generative Adversarial Networks, CycleGAN to be precise, for the generation of realistic looking fake images which were indistinguishable from real data. CycleGAN is a classic example of Image to Image Translation technique, and we this applied in our situation where synthetic images from one domain were transformed into another domain, containing real data. The model, once trained, was capable of generating realistic looking images from synthetic data without the presence of real images. Following this, we employed a state of the art object detection model, YOLO(You Only Look Once), to build a Drone Detection model that was trained on the generated images. Finally, the performance of this model was compared against different datasets in order to evaluate its performance.
Master of Science
In the recent years, technologies like Deep Learning and Machine Learning have seen many rapid developments. Among the many applications they have, object detection is one of the widely used application and well established problems. In our thesis, we deal with a scenario where we have a swarm of drones and our aim is for one drone to recognize another drone in its field of vision. As there was no drone image dataset readily available, we explored different ways of generating realistic data to address this issue. Finally, we proposed a solution to generate realistic images using Deep Learning techniques and trained an object detection model on it where we evaluated how well it has performed against other models.

APA, Harvard, Vancouver, ISO, and other styles

40

Matosevic, Antonio. "Batch Active Learning for Deep Object Detection in Videos." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-292035.

Full text

Abstract:

Relatively recent progress in object detection can mainly be attributed to the success of deep neural networks. However, training such models requires large amounts of annotated data. This poses a two-fold problem, namely obtaining labelled data is a time-consuming process, and training models on many instances is computationally costly. To this end a common approach is to employ active learning, which amounts to constructing a strategy to interactively query much fewer data points while maximizing the performance. In the context of deep object detection in videos, two new challenges arise. Firstly, common uncertainty-based query strategies depend on the quality of uncertainty estimates, which often require special treatment for deep neural networks. Secondly, the nature of batch-based training calls for querying subsets of images, which due to inherent temporal similarity may lack informativeness to increase performance. In this work we attempt to remedy both issues by proposing strategies relying on improved uncertainty estimates and diversification methods. Experiments show that our proposed uncertainty-based strategies are comparable to a random baseline, while the diversity-based ones, conditioned on improved uncertainty estimates, yield significantly better performance than the baseline. In particular, our best strategy using only 15% of data comes to as close as 90:27% of the performance when using all the available data to train the detector.
De senaste framstegen inom objektdetektering kan till största delen tillskrivas djupa neurala nätverk. Träning av sådana modeller kräver dock stora mängder annoterad data. Detta utgör ett två-faldigt problem; att få tag i annoterad data är en tidskrävande process och själva träningen är i många fall beräkningsmässigt kostsam. En vanlig approach för att åtgärda dessa problem är att använda så kallad aktiv inlärning, vilket innebär att man skapar en strategi för att interaktivt välja ut och använda betydligt färre datapunkter samtidigt som prestandan maximeras. I samband med djup objektdetektering på videodata så uppstår två nya utmaningar. För det första så beror typiska osäkerhetsbaserade strategier på kvaliteten på osäkerhetsuppskattningarna, vilka ofta kräver särskild behandling av djupa neurala nätverk. För det andra, om karaktären av batch-baserad träning och likhet mellan bilder i en videosekvens ej beaktas kan det resultera i icke-informativa samlingar av datapunkter som saknar mångfald. I detta arbete försöker vi åtgärda båda problemen genom att föreslå strategier som bygger på förbättrade osäkerhetsuppskattningar och diversifieringsmetoder. Empiriska experiment demonstrerar att våra föreslagna osäkerhetsbaserade strategier är jämförbara med en referensmetod som väljer ut datapunkter slumpmässigt medan diversifieringsstrategierna, givet förbättrade osäkerhetsuppskattningar, ger betydligt bättre prestanda än referensmetoden. Noterbart är att med endast 15% av datamängden når vår bästa strategi så mycket som 90:27% av prestandan som när man använder all tillgänglig data för att träna detektorn.

APA, Harvard, Vancouver, ISO, and other styles

41

Ibrahim, Ahmed Sobhy Elnady. "End-To-End Text Detection Using Deep Learning." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/81277.

Full text

Abstract:

Text detection in the wild is the problem of locating text in images of everyday scenes. It is a challenging problem due to the complexity of everyday scenes. This problem possesses a great importance for many trending applications, such as self-driving cars. Previous research in text detection has been dominated by multi-stage sequential approaches which suffer from many limitations including error propagation from one stage to the next. Another line of work is the use of deep learning techniques. Some of the deep methods used for text detection are box detection models and fully convolutional models. Box detection models suffer from the nature of the annotations, which may be too coarse to provide detailed supervision. Fully convolutional models learn to generate pixel-wise maps that represent the location of text instances in the input image. These models suffer from the inability to create accurate word level annotations without heavy post processing. To overcome these aforementioned problems we propose a novel end-to-end system based on a mix of novel deep learning techniques. The proposed system consists of an attention model, based on a new deep architecture proposed in this dissertation, followed by a deep network based on Faster-RCNN. The attention model produces a high-resolution map that indicates likely locations of text instances. A novel aspect of the system is an early fusion step that merges the attention map directly with the input image prior to word-box prediction. This approach suppresses but does not eliminate contextual information from consideration. Progressively larger models were trained in 3 separate phases. The resulting system has demonstrated an ability to detect text under difficult conditions related to illumination, resolution, and legibility. The system has exceeded the state of the art on the ICDAR 2013 and COCO-Text benchmarks with F-measure values of 0.875 and 0.533, respectively.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

42

Rahman, Quazi Marufur. "Performance monitoring of deep learning vision systems during deployment." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/229733/1/Quazi%20Marufur_Rahman_Thesis.pdf.

Full text

Abstract:

This thesis investigates how to monitor the performance of deep learning vision systems in mobile robots. It conducts state-of-the-art research to validate the real-time performance of mobile robots such as self-driving cars. This research is significant for deploying visual sensor-dependent autonomous vehicles in our daily lives. This knowledge will alert a mobile robot about its performance degradation to take preventive measures to reduce the risk of hazardous consequences for the robot, its surroundings and any person involved.

APA, Harvard, Vancouver, ISO, and other styles

43

Capellier, Édouard. "Application of machine learning techniques for evidential 3D perception, in the context of autonomous driving." Thesis, Compiègne, 2020. http://www.theses.fr/2020COMP2534.

Full text

Abstract:

L’apprentissage machine a révolutionné la manière dont les problèmes de perception sont, actuellement, traités. En effet, la plupart des approches à l’état de l’art, dans de nombreux domaines de la vision par ordinateur, se reposent sur des réseaux de neurones profonds. Au moment de déployer, d’évaluer, et de fusionner de telles approches au sein de véhicules autonomes, la question de la représentation des connaissances extraites par ces approches se pose. Dans le cadre de ces travaux de thèse, effectués au sein de Renault SAS, nous avons supposé qu’une représentation crédibiliste permettait de représenter efficacement le comportement de telles approches. Ainsi, nous avons développé plusieurs modules de perception à destination d’un prototype de véhicule autonome, se basant sur l’apprentissage machine et le cadre crédibiliste. Nous nous sommes focalisés sur le traitement de données caméra RGB, et de nuages de points LIDAR. Nous avions également à disposition des cartes HD représentant le réseau routier, dans certaines zones d’intérêt. Nous avons tout d’abord proposé un système de fusion asynchrone, utilisant d’une part un réseau convolutionel profond pour segmenter une image RGB, et d’autre part un modèle géométrique simple pour traiter des scans LIDAR, afin de générer des grilles d’occupation crédibilistes. Etant donné le manque de robustesse des traitements géométriques LIDAR, les autres travaux se sont focalisés sur la détection d’objet LIDAR et leur classification par apprentissage machine, et la détection de route au sein de scans LIDAR. En particulier, ce second travail reposait sur l’utilisation de scans étiquetés automatiquement à partir de cartes HD
The perception task is paramount for self-driving vehicles. Being able to extract accurate and significant information from sensor inputs is mandatory, so as to ensure a safe operation. The recent progresses of machine-learning techniques revolutionize the way perception modules, for autonomous driving, are being developed and evaluated, while allowing to vastly overpass previous state-of-the-art results in practically all the perception-related tasks. Therefore, efficient and accurate ways to model the knowledge that is used by a self-driving vehicle is mandatory. Indeed, self-awareness, and appropriate modeling of the doubts, are desirable properties for such system. In this work, we assumed that the evidence theory was an efficient way to finely model the information extracted from deep neural networks. Based on those intuitions, we developed three perception modules that rely on machine learning, and the evidence theory. Those modules were tested on real-life data. First, we proposed an asynchronous evidential occupancy grid mapping algorithm, that fused semantic segmentation results obtained from RGB images, and LIDAR scans. Its asynchronous nature makes it particularly efficient to handle sensor failures. The semantic information is used to define decay rates at the cell level, and handle potentially moving object. Then, we proposed an evidential classifier of LIDAR objects. This system is trained to distinguish between vehicles and vulnerable road users, that are detected via a clustering algorithm. The classifier can be reinterpreted as performing a fusion of simple evidential mass functions. Moreover, a simple statistical filtering scheme can be used to filter outputs of the classifier that are incoherent with regards to the training set, so as to allow the classifier to work in open world, and reject other types of objects. Finally, we investigated the possibility to perform road detection in LIDAR scans, from deep neural networks. We proposed two architectures that are inspired by recent state-of-the-art LIDAR processing systems. A training dataset was acquired and labeled in a semi-automatic fashion from road maps. A set of fused neural networks reaches satisfactory results, which allowed us to use them in an evidential road mapping and object detection algorithm, that manages to run at 10 Hz

APA, Harvard, Vancouver, ISO, and other styles

44

Aytar, Yusuf. "Transfer learning for object category detection." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:c9e18ff9-df43-4f67-b8ac-28c3fdfa584b.

Full text

Abstract:

Object category detection, the task of determining if one or more instances of a category are present in an image with their corresponding locations, is one of the fundamental problems of computer vision. The task is very challenging because of the large variations in imaged object appearance, particularly due to the changes in viewpoint, illumination and intra-class variance. Although successful solutions exist for learning object category detectors, they require massive amounts of training data. Transfer learning builds upon previously acquired knowledge and thus reduces training requirements. The objective of this work is to develop and apply novel transfer learning techniques specific to the object category detection problem. This thesis proposes methods which not only address the challenges of performing transfer learning for object category detection such as finding relevant sources for transfer, handling aspect ratio mismatches and considering the geometric relations between the features; but also enable large scale object category detection by quickly learning from considerably fewer training samples and immediate evaluation of models on web scale data with the help of part-based indexing. Several novel transfer models are introduced such as: (a) rigid transfer for transferring knowledge between similar classes, (b) deformable transfer which tolerates small structural changes by deforming the source detector while performing the transfer, and (c) part level transfer particularly for the cases where full template transfer is not possible due to aspect ratio mismatches or not having adequately similar sources. Building upon the idea of using part-level transfer, instead of performing an exhaustive sliding window search, part-based indexing is proposed for efficient evaluation of templates enabling us to obtain immediate detection results in large scale image collections. Furthermore, easier and more robust optimization methods are developed with the help of feature maps defined between proposed transfer learning formulations and the “classical” SVM formulation.

APA, Harvard, Vancouver, ISO, and other styles

45

Moniruzzaman, Md. "Seagrass detection using deep learning." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2019. https://ro.ecu.edu.au/theses/2261.

Full text

Abstract:

Seagrasses play an essential role in the marine ecosystem by providing foods, nutrients, and habitat to the marine lives. They work as marine bioindicators by reflecting the health condition of aquatic environments. Seagrasses also act as a significant atmospheric carbon sink that mitigates global warming and rapid climate changes. Considering the importance, it is critical to monitor seagrasses across the coastlines which includes detection, mapping, percentage cover calculation, and health estimation. Remote sensing-based aerial and spectral images, acoustic images, underwater two-dimensional and three-dimensional digital images have so far been used to monitor seagrasses. For close monitoring, different machine learning classifiers such as the support vector machine (SVM), the maximum likelihood classifier (MLC), the logistic model tree (LMT) and the multilayer perceptron (MP) have been used for seagrass classification from two-dimensional digital images. All of these approaches used handcrafted feature extraction methods, which are semi-automatic. In recent years, deep learning-based automatic object detection and image classification have achieved tremendous success, especially in the computer vision area. However, to the best of our knowledge, no attempts have been made for using deep learning for seagrass detection from underwater digital images. Possible reasons include unavailability of enough image data to train a deep neural network. In this work, we have proposed a Faster R-CNN architecture based deep learning detector that automatically detects Halophila ovalis (a common seagrass species) from underwater digital images. To train the object detector, we have collected a total of 2,699 underwater images both from real-life shorelines, and from an experimental facility. The selected seagrass (Halophila ovalis) are labelled using LabelImg software, commonly used by the research community. An expert in seagrass reviewed the extracted labels. We have used VGG16, Resnet50, Inception V2, and NASNet in the Faster R-CNN object detection framework which were originally trained on COCO dataset. We have applied the transfer learning technique to re-train them using our collected dataset to be able to detect the seagrasses. Inception V2 based Faster R-CNN achieved the highest mean average precision (mAP) of 0.261. The detection models proposed in this dissertation can be transfer learned with labelled two-dimensional digital images of other seagrass species and can be used to detect them from underwater seabed images automatically.

APA, Harvard, Vancouver, ISO, and other styles

46

Runow, Björn. "Deep Learning for Point Detection in Images." Thesis, Linköpings universitet, Datorseende, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-166644.

Full text

Abstract:

The main result of this thesis is a deep learning model named BearNet, which can be trained to detect an arbitrary amount of objects as a set of points. The model is trained using the Weighted Hausdorff distance as loss function. BearNet has been applied and tested on two problems from the industry. These are: From an intensity image, detect two pocket points of an EU-pallet which an autonomous forklift could utilize when determining where to insert its forks. From a depth image, detect the start, bend and end points of a straw attached to a juice package, in order to help determine if the straw has been attached correctly. In the development process of BearNet I took inspiration from the designs of U-Net, UNet++ and a high resolution network named HRNet. Further, I used a dataset containing RGB-images from a surveillance camera located inside a mall, on which the aim was to detect head positions of all pedestrians. In an attempt to reproduce a result from another study, I found that the mall dataset suffers from training set contamination when a model is trained, validated, and tested on it with random sampling. Hence, I propose that the mall dataset is evaluated with a sequential data split strategy, to limit the problem. I found that the BearNet architecture is well suited for both the EU-pallet and straw datasets, and that it can be successfully used on either RGB, intensity or depth images. On the EU-pallet and straw datasets, BearNet consistently produces point estimates within five and six pixels of ground truth, respectively. I also show that the straw dataset only constitutes a small subset of all the challenges that exist in the problem domain related to the attachment of a straw to a juice package, and that one therefore cannot train a robust deep learning model on it. As an example of this, models trained on the straw dataset cannot correctly handle samples in which there is no straw visible.

APA, Harvard, Vancouver, ISO, and other styles

47

Öhman, Wilhelm. "Data augmentation using military simulators in deep learning object detection applications." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-264917.

Full text

Abstract:

While deep learning solutions have made great progress in recent years, the requirement of large labeled datasets still limit their practical use in certain areas. This problem is especially acute for solutions in domains where even unlabeled data is a limited resource, such as the military domain. Synthetic data, or artificially generated data, has recently attracted attention as a potential solution for this problem. This thesis explores the possibility of using synthetic data in order to improve the performance of a neural network aimed at detecting and localizing firearms in images. To generate the synthetic data the military simulator VBS3 is used. By utilizing a Faster R-CNN architecture multiple models were trained on a range of different datasets consisting of varying amounts of real and synthetic data. Moreover, the synthetic datasets were generated following two different philosophies. One dataset strives for realism while the other foregoes realism in favor of greater variation. It was shown that the synthetic dataset striving for variation gave increased performance in the task of object detection when used in conjunction with real data. The dataset striving for realism gave mixed results.
Lösningar som använder sig av djupinlärning har gjort stora framsteg under senare år, dock så är kravet på ett stort och etiketterat dataset en begränsande faktor. Detta är ett än större problem i domäner där även icke etiketterad data är svårtillgänglig, som till exempel den militära domänen. Syntetisk data har på sistone ådragit sig uppmärksamhet som en potentiell lösning för detta problem. Detta examensarbete utforskar möjligheten att använda syntetisk data som ett sätt att förbättra prestandan för en djupinlärningslösning. Detta neuronnätverk har i uppgift att detektera och lokalisera skjutvapen i bilder. För att generera syntetisk data används militärsimulatorn VBS3. Neuronnätverket använder sig av Faster R-CNN-arkitektur. Med hjälp av detta tränades flera modeller på varierande mängd riktig och syntetisk data. Vidare har de syntetiska dataseten genererats där de följt två olika filosofier. Ett dataset försöker efterlikna den verkliga världen, och det andra förkastar realism till förmån för variation. Det påvisas att det dataset som strävar efter variation medförde ökad prestanda i uppgiften att detektera vapen. Det dataset som eftersträver realism gav blandade resultat.

APA, Harvard, Vancouver, ISO, and other styles

48

Estgren, Martin. "Bone Fragment Segmentation Using Deep Interactive Object Selection." Thesis, Linköpings universitet, Datorseende, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-157668.

Full text

Abstract:

In recent years semantic segmentation models utilizing Convolutional Neural Networks (CNN) have seen significant success for multiple different segmentation problems. Models such as U-Net have produced promising results within the medical field for both regular 2D and volumetric imaging, rivalling some of the best classical segmentation methods. In this thesis we examined the possibility of using a convolutional neural network-based model to perform segmentation of discrete bone fragments in CT-volumes with segmentation-hints provided by a user. We additionally examined different classical segmentation methods used in a post-processing refinement stage and their effect on the segmentation quality. We compared the performance of our model to similar approaches and provided insight into how the interactive aspect of the model affected the quality of the result. We found that the combined approach of interactive segmentation and deep learning produced results on par with some of the best methods presented, provided there were adequate amount of annotated training data. We additionally found that the number of segmentation hints provided to the model by the user significantly affected the quality of the result, with convergence of the result around 8 provided hints.

APA, Harvard, Vancouver, ISO, and other styles

49

Case, Isaac. "Automatic object detection and tracking in video /." Online version of thesis, 2010. http://hdl.handle.net/1850/12332.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Cogswell, Michael Andrew. "Understanding Representations and Reducing their Redundancy in Deep Networks." Thesis, Virginia Tech, 2016. http://hdl.handle.net/10919/78167.

Full text

Abstract:

Neural networks in their modern deep learning incarnation have achieved state of the art performance on a wide variety of tasks and domains. A core intuition behind these methods is that they learn layers of features which interpolate between two domains in a series of related parts. The first part of this thesis introduces the building blocks of neural networks for computer vision. It starts with linear models then proceeds to deep multilayer perceptrons and convolutional neural networks, presenting the core details of each. However, the introduction also focuses on intuition by visualizing concrete examples of the parts of a modern network. The second part of this thesis investigates regularization of neural networks. Methods like dropout and others have been proposed to favor certain (empirically better) solutions over others. However, big deep neural networks still overfit very easily. This section proposes a new regularizer called DeCov, which leads to significantly reduced overfitting (difference between train and val performance) and greater generalization, sometimes better than dropout and other times not. The regularizer is based on the cross-covariance of hidden representations and takes advantage of the intuition that different features should try to represent different things, an intuition others have explored with similar losses. Experiments across a range of datasets and network architectures demonstrate reduced overfitting due to DeCov while almost always maintaining or increasing generalization performance and often improving performance over dropout.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Deep Learning, Computer Vision, Object Detection'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles