Log in

Relevant bibliographies by topics / Scene parsing / Dissertations / Theses

Dissertations / Theses on the topic 'Scene parsing'

To see the other types of publications on this topic, follow the link: Scene parsing.

Author: Grafiati

Published: 10 December 2022

Last updated: 20 February 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 16 dissertations / theses for your research on the topic 'Scene parsing.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Zhao, Hang Ph D. Massachusetts Institute of Technology. "Visual and auditory scene parsing." Thesis, Massachusetts Institute of Technology, 2019. https://hdl.handle.net/1721.1/122101.

Full text

Abstract:

This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Thesis: Ph. D. in Mechanical Engineering and Computation, Massachusetts Institute of Technology, Department of Mechanical Engineering, 2019
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 121-132).
Scene parsing is a fundamental topic in computer vision and computational audition, where people develop computational approaches to achieve human perceptual system's ability in understanding scenes, e.g. group visual regions of an image into objects and segregate sound components in a noisy environment. This thesis investigates fully-supervised and self-supervised machine learning approaches to parse visual and auditory signals, including images, videos, and audios. Visual scene parsing refers to densely grouping and labeling of image regions into object concepts. First I build the MIT scene parsing benchmark based on a large scale, densely annotated dataset ADE20K. This benchmark, together with the state-of-the-art models we open source, offers a powerful tool for the research community to solve semantic and instance segmentation tasks. Then I investigate the challenge of parsing a large number of object categories in the wild. An open vocabulary scene parsing model which combines a convolutional neural network with a structured knowledge graph is proposed to address the challenge. Auditory scene parsing refers to recognizing and decomposing sound components in complex auditory environments. I propose a general audio-visual self-supervised learning framework that learns from a large amount of unlabeled internet videos. The learning process discovers the natural synchronization of vision and sounds without human annotation. The learned model achieves the capability to localize sound sources in videos and separate them from mixture. Furthermore, I demonstrate that motion cues in videos are tightly associated with sounds, which help in solving sound localization and separation problems.
by Hang Zhao.
Ph. D. in Mechanical Engineering and Computation
Ph.D.inMechanicalEngineeringandComputation Massachusetts Institute of Technology, Department of Mechanical Engineering

APA, Harvard, Vancouver, ISO, and other styles

2

Lan, Cyril. "Urban scene parsing via low-rank texture patches." Thesis, Massachusetts Institute of Technology, 2012. http://hdl.handle.net/1721.1/77536.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 52-55).
Automatic 3-D reconstruction of city scenes from ground, aerial, and satellite imagery is a difficult problem that has seen active research for nearly two decades. The problem is difficult because many algorithms require salient areas in the image to be identified and segmented, a task that is typically done by humans. We propose a pipeline that detects these salient areas using low-rank texture patches. Areas in images such as building facades contain low-rank textures, which are an intrinsic property of the scene and invariant to viewpoint. The pipeline uses these low-rank patches to automatically rectify images and detect and segment out the patches with an energy-minimizing graph cut. The output is then further parameterized to provide useful data to existing 3-D reconstruction methods. The pipeline was evaluated on challenging test images from Microsoft Bing Maps oblique aerial photography and produced an 80% recall and precision with superb empirical results.
by Cyril Lan.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

3

Tung, Frederick. "Towards large-scale nonparametric scene parsing of images and video." Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/60790.

Full text

Abstract:

In computer vision, scene parsing is the problem of labelling every pixel in an image or video with its semantic category. Its goal is a complete and consistent semantic interpretation of the structure of the real world scene. Scene parsing forms a core component in many emerging technologies such as self-driving vehicles and prosthetic vision, and also informs complementary computer vision tasks such as depth estimation. This thesis presents a novel nonparametric scene parsing framework for images and video. In contrast to conventional practice, our scene parsing framework is built on nonparametric search-based label transfer instead of discriminative classification. We formulate exemplar-based scene parsing for both 2D (from images) and 3D (from video), and demonstrate accurate labelling on standard benchmarks. Since our framework is nonparametric, it is easily extensible to new categories and examples as the database grows. Nonparametric scene parsing is computationally demanding at test time, and requires methods for searching large collections of data that are time and memory efficient. This thesis also presents two novel binary encoding algorithms for large-scale approximate nearest neighbor search: the bank of random rotations is data independent and does not require training, while the supervised sparse projections algorithm targets efficient search of high-dimensional labelled data. We evaluate these algorithms on standard retrieval benchmarks, and then demonstrate their integration into our nonparametric scene parsing framework. Using 256-bit codes, binary encoding reduces search times by an order of magnitude and memory requirements by three orders of magnitude, while maintaining a mean per-class accuracy within 1% on the 3D scene parsing task.
Science, Faculty of
Computer Science, Department of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

4

Shu, Allen. "Use of shot/scene parsing in generating and browsing video databases." Thesis, Massachusetts Institute of Technology, 1995. http://hdl.handle.net/1721.1/36985.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Pan, Hong. "Superparsing with Improved Segmentation Boundaries through Nonparametric Context." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32329.

Full text

Abstract:

Scene parsing, or segmenting all the objects in an image and identifying their categories, is one of the core problems of computer vision. In order to achieve an object-level semantic segmentation, we build upon the recent superparsing approach by Tighe and Lazebnik, which is a nonparametric solution to the image labeling problem. Superparsing consists of four steps. For a new query image, the most similar images from the training dataset of labeled images is retrieved based on global features. In the second step, the query image is segmented into superpxiels and 20 di erent local features are computed for each superpixel. We propose to use the SLICO segmentation method to allow control of the size, shape and compactness of the superpixels because SLICO is able to produce accurate boundaries. After all superpixel features have been extracted, feature-based matching of superpixels is performed to nd the nearest-neighbour superpixels in the retrieval set for each query superpxiel. Based on the neighbouring superpixels a likelihood score for each class is calculated. Finally, we formulate a Conditional Random Field (CRF) using the likelihoods and a pairwise cost both computed from nonparametric estimation to optimize the labeling of the image. Speci cally, we de ne a novel pairwise cost to provide stronger semantic contextual constraints by incorporating the similarity of adjacent superpixels depending on local features. The optimized labeling obtained with the CRF results in superpixels with the same labels grouped together to generate segmentation results which also identify the categories of objects in an image. We evaluate our improvements to the superparsing approach using segmentation evaluation measures as well as the per-pixel rate and average per-class rate in a labeling evaluation. We demonstrate the success of our modi ed approach on the SIFT Flow dataset, and compare our results with the basic superparsing methods proposed by Tighe and Lazebnik.

APA, Harvard, Vancouver, ISO, and other styles

6

Munoz, Daniel. "Inference Machines: Parsing Scenes via Iterated Predictions." Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/305.

Full text

Abstract:

Extracting a rich representation of an environment from visual sensor readings canbenefit many tasks in robotics, e.g., path planning, mapping, and object manipulation.While important progress has been made, it remains a difficult problem to effectivelyparse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms.This process requires not only recognizing individual entities but also understandingthe contextual relations among them. The prevalent approach to encode such relationships is to use a joint probabilistic orenergy-based model which enables one to naturally write down these interactions. Unfortunately,performing exact inference over these expressive models is often intractableand instead we can only approximate the solutions. While there exists a set of sophisticatedapproximate inference techniques to choose from, the combination of learning andapproximate inference for these expressive models is still poorly understood in theoryand limited in practice. Furthermore, using approximate inference on any learned modeloften leads to suboptimal predictions due to the inherent approximations. As we ultimately care about predicting the correct labeling of a scene, and notnecessarily learning a joint model of the data, this work proposes to instead view theapproximate inference process as a modular procedure that is directly trained in orderto produce a correct labeling of the scene. Inspired by early hierarchical models in thecomputer vision literature for scene parsing, the proposed inference procedure is structuredto incorporate both feature descriptors and contextual cues computed at multipleresolutions within the scene. We demonstrate that this inference machine frameworkfor parsing scenes via iterated predictions offers the best of both worlds: state-of-the-artclassification accuracy and computational efficiency when processing images and/orunorganized 3-D point clouds. Additionally, we address critical problems that arise inpractice when parsing scenes on board real-world systems: integrating data from multiplesensor modalities and efficiently processing data that is continuously streaming fromthe sensors.

APA, Harvard, Vancouver, ISO, and other styles

7

Taghavi, Namin Sarah. "Scene Parsing using Multiple Modalities." Phd thesis, 2016. http://hdl.handle.net/1885/116781.

Full text

Abstract:

Scene parsing is the task of assigning a semantic class label to the elements of a scene. It has many applications in autonomous systems when we need to understand the visual data captured from our environment. Different sensing modalities, such as RGB cameras, multi-spectral cameras and Lidar sensors, can be beneﬁcial when pursuing this goal. Scene analysis using multiple modalities aims at leveraging complementary information captured by multiple sensing modalities. When multiple modalities are used together, the strength of each modality can combat the weaknesses of other modalities. Therefore, working with multiple modalities enables us to use powerful tools for scene analysis. However, possible gains of using multiple modalities come with new challenges such as dealing with misalignments between different modalities. In this thesis, our aim is to take advantage of multiple modalities to improve outdoor scene parsing and address the associated challenges. We initially investigate the potential of multi-spectral imaging for outdoor scene analysis. Our approach is to combine the discriminative strength of the multi-spectral signature in each pixel and the corresponding nature of the surrounding texture. Many materials appearing similar if viewed by a common RGB camera, will show discriminating properties if viewed by a camera capturing a greater number of separated wavelengths. When using imagery data for scene parsing, a number of challenges stem from, e.g., color saturation, shadow and occlusion. To address such challenges, we focus on scene parsing using multiple modalities, panoramic RGB images and 3D Lidar data in particular, and propose a multi-view approach to select the best 2D view that describes each element in the 3D point cloud data. Keeping our focus on using multiple modalities, we then introduce a multi-modal graphical model to address the problems of scene parsing using 2D3D data exhibiting extensive many-to-one correspondences. Existing methods often impose a hard correspondence between the 2D and 3D data, where the 2D and 3D corresponding regions are forced to receive identical labels. This results in performance degradation due to misalignments, 3D-2D projection errors and occlusions. We address this issue by deﬁning a graph over the entire set of data that models soft correspondences between the two modalities. This graph encourages each region in a modality to leverage the information from its corresponding regions in the other modality to better estimate its class label. Finally, we introduce latent nodes to explicitly model inconsistencies between the modalities. The latent nodes allow us not only to leverage information from various domains in order to improve the labeling of the modalities, but also to cut the edges between inconsistent regions. To eliminate the need for hand tuning the parameters of our model, we propose to learn potential functions from training data. In addition, to demonstrate the beneﬁts of the proposed approaches on publicly available multi-modality datasets, we introduce a new multi-modal dataset of panoramic images and 3D point cloud data captured from outdoor scenes (NICTA/2D3D Dataset).

APA, Harvard, Vancouver, ISO, and other styles

8

Wang, Ren, and 王任. "Transferring Weakly-Supervised Convolutional Networks for Scene Parsing." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/29046824010257775924.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
103
Deep neural networks have become more and more popular in computer vision because of their powerful ability to extract distinctive image features. In deep neural networks, transfer learning plays an important role to avoid overfitting. In this thesis, we present a clustering-based method to combine fully-labeled data with weakly-labeled data for convolutional networks. By transfer learning, these convolutional networks can be viewed as pre-trained models for another target task. Next, we design a framework of convolutional networks for scene parsing to demonstrate our idea. Preliminary experimental results show that it is helpful to use these pre-trained convolutional networks for transfer learning.

APA, Harvard, Vancouver, ISO, and other styles

9

Yu, Jie-Kuan, and 余界寬. "A Scene Parsing and Classification Method for Baseball Videos." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/wrt3k3.

Full text

Abstract:

碩士
國立臺北科技大學
資訊工程系所
94
The thesis proposes a scene parsing and classification system for baseball videos. The system automatically parses baseball video and extracts import scenes with image content analysis. Firstly, the system selects several candidate import scenes by field/cloth color ratio and scene change detection. Secondly, the system utilizes image features, e.g. object motion detection, field and cloth color detection, camera motion parameters, key-frame analysis, and motion-map comparison, etc, to analysis each candidate import scenes. Finally, the system classifies scenes according to above-mentioned features and predefined rules. Subsequently, the system will establish indexes of scenes correspond to the rules in baseball video database.

APA, Harvard, Vancouver, ISO, and other styles

10

He, Tong. "Efficient Scene Parsing with Imagery and Point Cloud Data." Thesis, 2020. http://hdl.handle.net/2440/129534.

Full text

Abstract:

Scene parsing, aiming to provide a comprehensive understanding of the scene, is a fundamental task in the field of computer vision and remains a challenging problem for the unconstrained environment and open scenes. The results of scene parsing can generate semantic labels, location distribution, as well as for instance shape information for each element, which has shown great potential in the applications like automatic driving, video surveillance, just to name a few. Also, the efficiency of the methods determines whether it can be used on a large scale. With the easy availability of various sensors, more and more solutions resort to different data modalities according to the requirements of the applications. Imagery and point cloud are two representative data sources. How to design efficient frameworks in separate domains remains an open problem and more importantly, lays a solid foundation for multimodal fusion. In this thesis, we study the task of scene parsing under different data modalities, i.e., imagery and point cloud data, by deep neural networks. The first part of this thesis addresses the task of efficient semantic segmentation in 2D image data. The aim is to improve the accuracy of small models while maintaining their fast inference speed without introducing extra computation overhead. To achieve this, we propose a knowledge-distillation-based method tailored for semantic segmentation to improve the performance of the small Fully Convolution Network (FCN) model by injecting compact feature representation and long-tail dependencies from the large complex FCN model (incorporated in Chapter 3). The second part of this thesis addresses the task of semantic and instance segmentation on point cloud data. Compared to rasterized image data, point cloud data often suffer from two problems: (1) how to efficiently extract and aggregate context information. (2) how to solve the forgetting issue Lin et al., 2017c caused by extreme data imbalance. For the first problem, we study the influence of instance-aware knowledge by proposing an Instance-Aware Module by learning discriminative instance embedding features via metric learning (incorporated in Chapter 4). We also address the second problem by proposing a memory-augmented network to learn and memorize the representative prototypes that cover diverse samples universally (incorporated in Chapter 5).
Thesis (Ph.D.) -- University of Adelaide, School of Computer Science, 2020

APA, Harvard, Vancouver, ISO, and other styles

11

Ma, Chih Hao, and 馬智豪. "Nonparametric Scene Parsing with Deep Convolutional Features and Dense Alignment." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/65701079918564835437.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系
103
This thesis addresses two key issues which concern the performance of nonparametric scene parsing: (1) the semantic quality of image retrieval; and (2) the accuracy in label transfer. First, because nonparametric methods annotate a query image through transferring labels from retrieved images, the task of image retrieval should find a set of “semantically similar” images to the query. Second, with the retrieval set, a good strategy should be developed to transfer semantic labels in pixel-level accuracy. In this thesis, we focus on improving scene parsing accuracy in these two issues. We propose using the state-of-the-art deep convolutional features as visual descriptors to improve the semantic quality of retrieved images. In addition, we include dense alignment into the Markov Random Field (MRF) inference framework to transfer labels at pixel-level accuracy. Next, we utilize the derived semantic labels as queries to expand the retrieval set and then conduct the second-round label transfer. Finally, we combine label transferring cues of two rounds into the MRF model to improve the labeling results. Our experiments on the SIFT Flow dataset and LMSun dataset show the improvement of the proposed approach over other nonparametric methods.

APA, Harvard, Vancouver, ISO, and other styles

12

Najafi, Mohammad. "On the Role of Context at Different Scales in Scene Parsing." Phd thesis, 2017. http://hdl.handle.net/1885/116302.

Full text

Abstract:

Scene parsing can be formulated as a labeling problem where each visual data element, e.g., each pixel of an image or each 3D point in a point cloud, is assigned a semantic class label. One can approach this problem by training a classifier and predicting a class label for the data elements purely based on their local properties. This approach, however, does not take into account any kind of contextual information between different elements in the image or point cloud. For example, in an application where we are interested in labeling roadside objects, the fact that most of the utility poles are connected to some power wires can be very helpful in disambiguating them from other similar looking classes. Recurrence of certain class combinations can be also considered as a good contextual hint since they are very likely to co-occur again. These forms of high-level contextual information are often formulated using pairwise and higher-order Conditional Random Fields (CRFs). A CRF is a probabilistic graphical model that encodes the contextual relationships between the data elements in a scene. In this thesis, we study the potential of contextual information at different scales (ranges) in scene parsing problems. First, we propose a model that utilizes the local context of the scene via a pairwise CRF. Our model acquires contextual interactions between different classes by assessing their misclassification rates using only the local properties of data. In other words, no extra training is required for obtaining the class interaction information. Next, we expand the context field of view from a local range to a longer range, and make use of higher-order models to encode more complex contextual cues. More specifically, we introduce a new model to employ geometric higher-order terms in a CRF for semantic labeling of 3D point cloud data. Despite the potential of the above models at capturing the contextual cues in the scene, there are higher-level context cues that cannot be encoded via pairwise and higher-order CRFs. For instance, a vehicle is very unlikely to appear in a sea scene, or buildings are frequently observed in a street scene. Such information can be described using scene context and are modeled using global image descriptors. In particular, through an image retrieval procedure, we find images whose content is similar to that of the query image, and use them for scene parsing. Another problem of the above methods is that they rely on a computationally expensive training process for the classification using the local properties of data elements, which needs to be repeated every time the training data is modified. We address this issue by proposing a fast and efficient approach that exempts us from the cumbersome training task, by transferring the ground-truth information directly from the training data to the test data.

APA, Harvard, Vancouver, ISO, and other styles

13

Wang, Yi-Ru, and 王怡儒. "Incremental object detection and scene parsing from a moving vehicle via exemplar cut." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/y9fr2u.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系所
105
This thesis presents a nonparametric scene parsing system based on superpixel matching and exemplar cut. Foreground classes are often neglected in other algorithms since they occupy only a small portion of the pixels in an image. To solve this problem, we utilize the concept of “exemplar” to improve their recognition rate. Our experimental images are unique as we photograph continuously from a moving vehicle. Thus, the characteristics of progressive images can be utilized to raise labeling accuracy. By adding the previous parsing result into retrieval set, we enhance the resemblance between query image and images in the retrieval set. We also remove the pictures which have large class proportion discrepancy compared with previous frame, which prevents the unlikely classes to appear on the query image. And we add exemplars in the previous image to candidate exemplars of query image. This novel idea can hopefully be applied on autonomous car driving in the near future. Our experimental dataset contains 4 foreground labels and 4 background labels. The system achieves state-of-the-art recognition rate on both per-pixel accuracy and per-class accuracy.

APA, Harvard, Vancouver, ISO, and other styles

14

Shih, Yi-Hsuan, and 施亦宣. "Vehicles Detection at Urban Intersections via Adaptive Neighbor Sets of Nonparametric Scene Parsing." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/2rpj85.

Full text

Abstract:

碩士
國立清華大學
資訊工程學系所
105
The challenges faced by many experiments with the vehicles detection at the urban intersections are that a bounding box is manually given to circle out the target object in the first frame and that the lost target object during a procedure of tracking might lead to tracking error. Hence, the specific objective of this thesis is to explore some solutions to these problems. We apply the nonparametric scene parsing method to the vehicles detection at the urban intersections to automatically find out the car and motorcycle objects in the first frame without manually giving a bounding box. Moreover, the annotation results of scene parsing can improve the lost object. Many researches about the nonparametric scene parsing have been studied currently. The nonparametric scene parsing is a method to annotate a query image by transferring labels from the training data set. Referring to the method of [5], our proposed method firstly segments the images into superpixels. By means of calculating features, we can extract similar image set as the retrieval set from the training data set. In addition, we learn weights for each image in the training data set to minimize classification error using a leave-one-out strategy. In order to boost the classification of rare classes, we compute the semantic context of segments in the training data set and add the nearest rare class examples into the retrieval set. Finally, we compute the energy function in Markov Random Field (MRF) to label the query image. Since the scene of urban intersections is our main testing data set, we use background subtraction to extract foregrounds so as to reduce classification error. Our experimental results show that combination with the nonparametric scene parsing and background subtraction can effectively solve the problems of the vehicles detection at the urban intersections.

APA, Harvard, Vancouver, ISO, and other styles

15

Liu, Keng-Chi, and 劉庚錡. "Low Discrepancy Adaptation with Weak Domain-specific Annotations for Efficient Indoor Scene Parsing." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/56zc4y.

Full text

Abstract:

碩士
國立臺灣大學
電子工程學研究所
107
Developing autonomous mobile agents that can perform behaviors like human based on their visual perception is an goal in the field of artificial intelligence and pixel-wise visual cues such as scene parsing are beneficial to such high-level applications. Significant improvement for these tasks have been made recent years due to the evolution of deep learning. Nevertheless, in addition to accuracy, efficiency remains a major issue. The term “efficiency” we have mentioned refers to both data collection and computational complexity. Remarkable scene parsing results made by supervised methods rely on numerous pixel-level annotations, which are time-consuming and expensive to obtain. Hence, to alleviate such cumbersome manual effort becomes a crucial issue during training procedure. Synthetic rendered data and weakly-supervised methods have been explored to overcome this challenge; unfortunately, the former suffer from severe domain shift, the latter with imprecise information. Moreover, majority of existing researches for weak supervision are only capable of handling foreground salient “things”. Hence, to address the issue, we employ an auxiliary teacher-student learning framework to train such untransferable task through pseudo-ground truths constructed by adapting auxiliary cues with lower domain discrepancy (e.g. depth) and leveraging domain-specific information (e.g. real appearance) in weak form. Thereafter, this imperfect information can be integrated effectively by developing a two-stage voting mechanism. From inference phase perspective, complexity has been the main issue for edge computing all the while. A typical network requires large run-time memory and 32-bit floating point computation. Furthermore, unlike general classification networks with only several category outputs,the hourglass network output is the same size and dimension as the input, which cost more resources. However, most of the previous researches focused on classification networks. In this thesis, considering the practicality as well as necessity of real world applications, our goal is to develop a “efficient” scene parsing algorithm with focus on three objectives: labeling, complexity, performance. First, it is shown that depth diminish more domain discrepancy for indoor scenes by introducing min-max normalization to the loss function. Additionally, we argue that the generator for real-to-sim reconstruction is capable of performing unsupervised sensor depth map restoration. Second, a scene parsing framework is proposed by performing auxiliary teacher-student learning with depth adaptation as well as domain-specific weak supervision information. We train a network based on the loss function that penalizes predictions disagreeing with the highly confident pseudo-ground truths provided by a two-stage integration mechanism so as to produce more accurate segmentations. The proposed method turns out to outperform the state-of-the-art adaptation method by 14.63% in terms of mean Intersection over Union (mIoU). Lastly, we extend the existing method to quantize the target lightweight scene parsing network into ternary weights and low bit-width activations (3-4 bits), which can reduce the model size to 21.9X and activation size to 8.2X smaller with only 1.8% mIoU loss.

APA, Harvard, Vancouver, ISO, and other styles

16

Liu, Buyu. "Efficient multi-level scene understanding in videos." Phd thesis, 2016. http://hdl.handle.net/1885/110787.

Full text

Abstract:

Automatic video parsing is a key step towards human-level dynamic scene understanding, and a fundamental problem in computer vision. A core issue in video understanding is to infer multiple scene properties of a video in an efficient and consistent manner. This thesis addresses the problem of holistic scene understanding from monocular videos, which jointly reason about semantic and geometric scene properties from multiple levels, including pixelwise annotation of video frames, object instance segmentation in spatio-temporal domain, and/or scene-level description in terms of scene categories and layouts. We focus on four main issues in the holistic video understanding: 1) what is the representation for consistent semantic and geometric parsing of videos? 2) how do we integrate high-level reasoning (e.g., objects) with pixel-wise video parsing? 3) how can we do efficient inference for multi-level video understanding? and 4) what is the representation learning strategy for efficient/cost-aware scene parsing? We discuss three multi-level video scene segmentation scenarios based on different aspects of scene properties and efficiency requirements. The first case addresses the problem of consistent geometric and semantic video segmentation for outdoor scenes. We propose a geometric scene layout representation, or a stage scene model, to efficiently capture the dependency between the semantic and geometric labels. We build a unified conditional random field for joint modeling of the semantic class, geometric label and the stage representation, and design an alternating inference algorithm to minimize the resulting energy function. The second case focuses on the problem of simultaneous pixel-level and object-level segmentation in videos. We propose to incorporate foreground object information into pixel labeling by jointly reasoning semantic labels of supervoxels, object instance tracks and geometric relations between objects. In order to model objects, we take an exemplar approach based on a small set of object annotations to generate a set of object proposals. We then design a conditional random field framework that jointly models the supervoxel labels and object instance segments. To scale up our method, we develop an active inference strategy to improve the efficiency of multi-level video parsing, which adaptively selects an informative subset of object proposals and performs inference on the resulting compact model. The last case explores the problem of learning a flexible representation for efficient scene labeling. We propose a dynamic hierarchical model that allows us to achieve flexible trade-offs between efficiency and accuracy. Our approach incorporates the cost of feature computation and model inference, and optimizes the model performance for any given test-time budget. We evaluate all our methods on several publicly available video and image semantic segmentation datasets, and demonstrate superior performance in efficiency and accuracy. Keywords: Semantic video segmentation, Multi-level scene understanding, Efficient inference, Cost-aware scene parsing

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!