Log in

Relevant bibliographies by topics / Learning with noisy labels / Dissertations / Theses

To see the other types of publications on this topic, follow the link: Learning with noisy labels.

Dissertations / Theses on the topic 'Learning with noisy labels'

Author: Grafiati

Published: 6 September 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Learning with noisy labels.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Yu, Xiyu. "Learning with Biased and Noisy Labels." Thesis, The University of Sydney, 2019. http://hdl.handle.net/2123/20125.

Full text

Abstract:

Recent advances in Artificial Intelligence (AI) have been built on large scale datasets. These advances often come with increasing demands on labeling, which are expensive and time consuming. Therefore, AI tends to develop its higher-level intelligence like human to capture knowledge from cheap but weak supervision such as the mislabeled data. However, current AI suffers from severely degraded performance on noisily labeled data. Thus, it is a compelling demand to design novel algorithms to enable AI to learn from noisy labels. Label noise methods such as robust loss functions assume that a fraction of data is correctly labeled to ensure effective learning. When all labels are incorrect, they often fail due to severe bias and noises. Here, we consider a kind of incorrect label, complementary label which specify a class that a feature do not belong to. We propose a general method to modify loss functions such that the classifier learned from biased complementary labels can be identical to the optimal one learned from true labels. Another challenge in label noise is the shift between distributions of training (source) and test (target) data. Existing methods often ignore these changes and they cannot learn transferable knowledge across domains. Therefore, we propose a novel Denoising Conditional Invariant Component framework which provably ensures identification of invariant representations and label distribution of target data given examples with noisy labels in source domain and unlabeled examples in target domain. Finally, we study how to estimate the noise rates in label noise. Previous methods deliver promising results but rely on strong assumptions. We can see, noise rate estimation is essentially a mixture proportion estimation problem. We also prove that noise rates can be uniquely identified and efficiently obtained under a weaker linear independent assumption.

APA, Harvard, Vancouver, ISO, and other styles

2

Caye, Daudt Rodrigo. "Convolutional neural networks for change analysis in earth observation images with noisy labels and domain shifts." Electronic Thesis or Diss., Institut polytechnique de Paris, 2020. http://www.theses.fr/2020IPPAT033.

Full text

Abstract:

L'analyse de l'imagerie satellitaire et aérienne d'observation de la Terre nous permet d'obtenir des informations précises sur de vastes zones. Une analyse multitemporelle de telles images est nécessaire pour comprendre l'évolution de ces zones. Dans cette thèse, les réseaux de neurones convolutifs sont utilisés pour détecter et comprendre les changements en utilisant des images de télédétection provenant de diverses sources de manière supervisée et faiblement supervisée. Des architectures siamoises sont utilisées pour comparer des paires d'images recalées et identifier les pixels correspondant à des changements. La méthode proposée est ensuite étendue à une architecture de réseau multitâche qui est utilisée pour détecter les changements et effectuer une cartographie automatique simultanément, ce qui permet une compréhension sémantique des changements détectés. Ensuite, un filtrage de classification et un nouvel algorithme de diffusion anisotrope guidée sont utilisés pour réduire l'effet du bruit d'annotation, un défaut récurrent pour les ensembles de données à grande échelle générés automatiquement. Un apprentissage faiblement supervisé est également réalisé pour effectuer une détection de changement au niveau des pixels en utilisant uniquement une supervision au niveau de l'image grâce à l'utilisation de cartes d'activation de classe et d'une nouvelle couche d'attention spatiale. Enfin, une méthode d'adaptation de domaine fondée sur un entraînement adverse est proposée. Cette méthode permet de projeter des images de différents domaines dans un espace latent commun où une tâche donnée peut être effectuée. Cette méthode est testée non seulement pour l'adaptation de domaine pour la détection de changement, mais aussi pour la classification d'images et la segmentation sémantique, ce qui prouve sa polyvalence
The analysis of satellite and aerial Earth observation images allows us to obtain precise information over large areas. A multitemporal analysis of such images is necessary to understand the evolution of such areas. In this thesis, convolutional neural networks are used to detect and understand changes using remote sensing images from various sources in supervised and weakly supervised settings. Siamese architectures are used to compare coregistered image pairs and to identify changed pixels. The proposed method is then extended into a multitask network architecture that is used to detect changes and perform land cover mapping simultaneously, which permits a semantic understanding of the detected changes. Then, classification filtering and a novel guided anisotropic diffusion algorithm are used to reduce the effect of biased label noise, which is a concern for automatically generated large-scale datasets. Weakly supervised learning is also achieved to perform pixel-level change detection using only image-level supervision through the usage of class activation maps and a novel spatial attention layer. Finally, a domain adaptation method based on adversarial training is proposed, which succeeds in projecting images from different domains into a common latent space where a given task can be performed. This method is tested not only for domain adaptation for change detection, but also for image classification and semantic segmentation, which proves its versatility

APA, Harvard, Vancouver, ISO, and other styles

3

Fang, Tongtong. "Learning from noisy labelsby importance reweighting: : a deep learning approach." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-264125.

Full text

Abstract:

Noisy labels could cause severe degradation to the classification performance. Especially for deep neural networks, noisy labels can be memorized and lead to poor generalization. Recently label noise robust deep learning has outperformed traditional shallow learning approaches in handling complex input data without prior knowledge of label noise generation. Learning from noisy labels by importance reweighting is well-studied. Existing work in this line using deep learning failed to provide reasonable importance reweighting criterion and thus got undesirable experimental performances. Targeting this knowledge gap and inspired by domain adaptation, we propose a novel label noise robust deep learning approach by importance reweighting. Noisy labeled training examples are weighted by minimizing the maximum mean discrepancy between the loss distributions of noisy labeled and clean labeled data. In experiments, the proposed approach outperforms other baselines. Results show a vast research potential of applying domain adaptation in label noise problem by bridging the two areas. Moreover, the proposed approach potentially motivate other interesting problems in domain adaptation by enabling importance reweighting to be used in deep learning.
Felaktiga annoteringar kan sänka klassificeringsprestanda.Speciellt för djupa nätverk kan detta leda till dålig generalisering. Nyligen har brusrobust djup inlärning överträffat andra inlärningsmetoder när det gäller hantering av komplexa indata Befintligta resultat från djup inlärning kan dock inte tillhandahålla rimliga viktomfördelningskriterier. För att hantera detta kunskapsgap och inspirerat av domänanpassning föreslår vi en ny robust djup inlärningsmetod som använder omviktning. Omviktningen görs genom att minimera den maximala medelavvikelsen mellan förlustfördelningen av felmärkta och korrekt märkta data. I experiment slår den föreslagna metoden andra metoder. Resultaten visar en stor forskningspotential för att tillämpa domänanpassning. Dessutom motiverar den föreslagna metoden undersökningar av andra intressanta problem inom domänanpassning genom att möjliggöra smarta omviktningar.

APA, Harvard, Vancouver, ISO, and other styles

4

Ainapure, Abhijeet Narhar. "Application and Performance Enhancement of Intelligent Cross-Domain Fault Diagnosis in Rotating Machinery." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1623164772153736.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Chan, Jeffrey (Jeffrey D. ). "On boosting and noisy labels." Thesis, Massachusetts Institute of Technology, 2015. http://hdl.handle.net/1721.1/100297.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2015.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 53-56).
Boosting is a machine learning technique widely used across many disciplines. Boosting enables one to learn from labeled data in order to predict the labels of unlabeled data. A central property of boosting instrumental to its popularity is its resistance to overfitting. Previous experiments provide a margin-based explanation for this resistance to overfitting. In this thesis, the main finding is that boosting's resistance to overfitting can be understood in terms of how it handles noisy (mislabeled) points. Confirming experimental evidence emerged from experiments using the Wisconsin Diagnostic Breast Cancer(WDBC) dataset commonly used in machine learning experiments. A majority vote ensemble filter identified on average that 2.5% of the points in the dataset as noisy. The experiments chiefly investigated boosting's treatment of noisy points from a volume-based perspective. While the cell volume surrounding noisy points did not show a significant difference from other points, the decision volume surrounding noisy points was two to three times less than that of non-noisy points. Additional findings showed that decision volume not only provides insight into boosting's resistance to overfitting in the context of noisy points, but also serves as a suitable metric for identifying which points in a dataset are likely to be mislabeled.
by Jeffrey Chan.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

6

Almansour, Amal. "Credibility assessment for Arabic micro-blogs using noisy labels." Thesis, King's College London (University of London), 2016. https://kclpure.kcl.ac.uk/portal/en/theses/credibility-assessment-for-arabic-microblogs-using-noisy-labels(6baf983a-940d-4c2c-8821-e992348b4097).html.

Full text

Abstract:

Due to their openness and low publishing barrier nature, User-Generated Content (UGC) platforms facilitate the creation of huge amounts of data, containing a substantial quantity of inaccurate content. The presence of misleading, questionable and inaccurate content may have detrimental effects on people's beliefs and decision-making and may create a public disturbance. Consequently, there is significant need to evaluate information coming from UGC platforms to differentiate credible information from misinformation and rumours. In this thesis, we present the need for research about online Arabic information credibility and argue that by extending the existing automated credibility assessment approaches to adding an extra step to evaluate labellers will lead to a more robust dataset for building the credibility classification model. This research focuses on modelling the credibility of Arabic information in the presence of disagreed judging credibility scores and ground truth of credibility information is not absolute. First, in order to achieve the stated goal, this study employs the idea of crowdsourcing whereby users can explicitly express their opinions about the credibility of a set of tweet messages. This information coupled with the data about tweets’ features enables us to identify messages’ prominent features with the highest usage in determining information credibility levels. Then experiments based on both statistical analysis using features’ distributions and machine learning methods are performed to predict and classify messages’ credibility levels. A novel credibility assessment model which integrates the labellers’ reliability weights is proposed when deriving the credibility labels for the messages in the training and testing dataset. This credibility model primarily uses similarity and accuracy rating measurements for evaluating the weighting of labellers. In order to evaluate proposed model, we compare the labelling obtained from the expert labellers with those from the weighted crowd labellers. Empirical evidence proposed that the credibility model is superior to the commonly used majority voting baseline compared to the experts’ rating evaluations. The observed experimental results exhibit a reduction of the effect of unreliable labellers’ credibility judgments and a moderate enhancement of the credibility classification results.

APA, Harvard, Vancouver, ISO, and other styles

7

Northcutt, Curtis George. "Classification with noisy labels : "Multiple Account" cheating detection in Open Online Courses." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/111870.

Full text

Abstract:

Thesis: S.M., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 113-122).
Massive Open Online Courses (MOOCs) have the potential to enhance socioeconomic mobility through education. Yet, the viability of this outcome largely depends on the reputation of MOOC certificates as a credible academic credential. I describe a cheating strategy that threatens this reputation and holds the potential to render the MOOC certificate valueless. The strategy, Copying Answers using Multiple Existences Online (CAMEO), involves a user who gathers solutions to assessment questions using one or more harvester accounts and then submits correct answers using one or more separate master accounts. To estimate a lower bound for CAMEO prevalence among 1.9 million course participants in 115 HarvardX and MITx courses, I introduce a filter-based CAMEO detection algorithm and use a small-scale experiment to verify CAMEO use with certainty. I identify preventive strategies that can decrease CAMEO rates and show evidence of their effectiveness in science courses. Because the CAMEO algorithm functions as a lower bound estimate, it fails to detect many CAMEO cheaters. As a novelty of this thesis, instead of improving the shortcomings of the CAMEO algorithm directly, I recognize that we can think of the CAMEO algorithm as a method for producing noisy predicted cheating labels. Then a solution to the more general problem of binary classification with noisy labels ( ~ P̃̃̃ Ñ learning) is a solution to CAMEO cheating detection. ~ P̃ Ñ learning is the problem of binary classification when training examples may be mislabeled (flipped) uniformly with noise rate 1 for positive examples and 0 for negative examples. I propose Rank Pruning to solve ~ P ~N learning and the open problem of estimating the noise rates. Unlike prior solutions, Rank Pruning is efficient and general, requiring O(T) for any unrestricted choice of probabilistic classifier with T fitting time. I prove Rank Pruning achieves consistent noise estimation and equivalent expected risk as learning with uncorrupted labels in ideal conditions, and derive closed-form solutions when conditions are non-ideal. Rank Pruning achieves state-of-the-art noise rate estimation and F1, error, and AUC-PR on the MNIST and CIFAR datasets, regardless of noise rates. To highlight, Rank Pruning with a CNN classifier can predict if a MNIST digit is a one or not one with only 0:25% error, and 0:46% error across all digits, even when 50% of positive examples are mislabeled and 50% of observed positive labels are mislabeled negative examples. Rank Pruning achieves similarly impressive results when as large as 50% of training examples are actually just noise drawn from a third distribution. Together, the CAMEO and Rank Pruning algorithms allow for a robust, general, and time-efficient solution to the CAMEO cheating detection problem. By ensuring the validity of MOOC credentials, we enable MOOCs to achieve both openness and value, and thus take one step closer to the greater goal of democratization of education.
by Curtis George Northcutt.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

8

Ekambaram, Rajmadhan. "Active Cleaning of Label Noise Using Support Vector Machines." Scholar Commons, 2017. http://scholarcommons.usf.edu/etd/6830.

Full text

Abstract:

Large scale datasets collected using non-expert labelers are prone to labeling errors. Errors in the given labels or label noise affect the classifier performance, classifier complexity, class proportions, etc. It may be that a relatively small, but important class needs to have all its examples identified. Typical solutions to the label noise problem involve creating classifiers that are robust or tolerant to errors in the labels, or removing the suspected examples using machine learning algorithms. Finding the label noise examples through a manual review process is largely unexplored due to the cost and time factors involved. Nevertheless, we believe it is the only way to create a label noise free dataset. This dissertation proposes a solution exploiting the characteristics of the Support Vector Machine (SVM) classifier and the sparsity of its solution representation to identify uniform random label noise examples in a dataset. Application of this method is illustrated with problems involving two real-world large scale datasets. This dissertation also presents results for datasets that contain adversarial label noise. A simple extension of this method to a semi-supervised learning approach is also presented. The results show that most mislabels are quickly and effectively identified by the approaches developed in this dissertation.

APA, Harvard, Vancouver, ISO, and other styles

9

Balasubramanian, Krishnakumar. "Learning without labels and nonnegative tensor factorization." Thesis, Georgia Institute of Technology, 2010. http://hdl.handle.net/1853/33926.

Full text

Abstract:

Supervised learning tasks like building a classifier, estimating the error rate of the predictors, are typically performed with labeled data. In most cases, obtaining labeled data is costly as it requires manual labeling. On the other hand, unlabeled data is available in abundance. In this thesis, we discuss methods to perform supervised learning tasks with no labeled data. We prove consistency of the proposed methods and demonstrate its applicability with synthetic and real world experiments. In some cases, small quantities of labeled data maybe easily available and supplemented with large quantities of unlabeled data (semi-supervised learning). We derive the asymptotic efficiency of generative models for semi-supervised learning and quantify the effect of labeled and unlabeled data on the quality of the estimate. Another independent track of the thesis is efficient computational methods for nonnegative tensor factorization (NTF). NTF provides the user with rich modeling capabilities but it comes with an added computational cost. We provide a fast algorithm for performing NTF using a modified active set method called block principle pivoting method and demonstrate its applicability to social network analysis and text mining.

APA, Harvard, Vancouver, ISO, and other styles

10

Nugyen, Duc Tam [Verfasser], and Thomas [Akademischer Betreuer] Brox. "Robust deep learning for computer vision to counteract data scarcity and label noise." Freiburg : Universität, 2020. http://d-nb.info/1226657060/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Fonseca, Eduardo. "Training sound event classifiers using different types of supervision." Doctoral thesis, Universitat Pompeu Fabra, 2021. http://hdl.handle.net/10803/673067.

Full text

Abstract:

The automatic recognition of sound events has gained attention in the past few years, motivated by emerging applications in fields such as healthcare, smart homes, or urban planning. When the work for this thesis started, research on sound event classification was mainly focused on supervised learning using small datasets, often carefully annotated with vocabularies limited to specific domains (e.g., urban or domestic). However, such small datasets do not support training classifiers able to recognize hundreds of sound events occurring in our everyday environment, such as kettle whistles, bird tweets, cars passing by, or different types of alarms. At the same time, large amounts of environmental sound data are hosted in websites such as Freesound or YouTube, which can be convenient for training large-vocabulary classifiers, particularly using data-hungry deep learning approaches. To advance the state-of-the-art in sound event classification, this thesis investigates several strands of dataset creation as well as supervised and unsupervised learning to train large-vocabulary sound event classifiers, using different types of supervision in novel and alternative ways. Specifically, we focus on supervised learning using clean and noisy labels, as well as self-supervised representation learning from unlabeled data. The first part of this thesis focuses on the creation of FSD50K, a large-vocabulary dataset with over 100h of audio manually labeled using 200 classes of sound events. We provide a detailed description of the creation process and a comprehensive characterization of the dataset. In addition, we explore architectural modifications to increase shift invariance in CNNs, improving robustness to time/frequency shifts in input spectrograms. In the second part, we focus on training sound event classifiers using noisy labels. First, we propose a dataset that supports the investigation of real label noise. Then, we explore network-agnostic approaches to mitigate the effect of label noise during training, including regularization techniques, noise-robust loss functions, and strategies to reject noisy labeled examples. Further, we develop a teacher-student framework to address the problem of missing labels in sound event datasets. In the third part, we propose algorithms to learn audio representations from unlabeled data. In particular, we develop self-supervised contrastive learning frameworks, where representations are learned by comparing pairs of examples computed via data augmentation and automatic sound separation methods. Finally, we report on the organization of two DCASE Challenge Tasks on automatic audio tagging with noisy labels. By providing data resources as well as state-of-the-art approaches and audio representations, this thesis contributes to the advancement of open sound event research, and to the transition from traditional supervised learning using clean labels to other learning strategies less dependent on costly annotation efforts.
El interés en el reconocimiento automático de eventos sonoros se ha incrementado en los últimos años, motivado por nuevas aplicaciones en campos como la asistencia médica, smart homes, o urbanismo. Al comienzo de esta tesis, la investigación en clasificación de eventos sonoros se centraba principalmente en aprendizaje supervisado usando datasets pequeños, a menudo anotados cuidadosamente con vocabularios limitados a dominios específicos (como el urbano o el doméstico). Sin embargo, tales datasets no permiten entrenar clasificadores capaces de reconocer los cientos de eventos sonoros que ocurren en nuestro entorno, como silbidos de kettle, sonidos de pájaros, coches pasando, o diferentes alarmas. Al mismo tiempo, websites como Freesound o YouTube albergan grandes cantidades de datos de sonido ambiental, que pueden ser útiles para entrenar clasificadores con un vocabulario más extenso, particularmente utilizando métodos de deep learning que requieren gran cantidad de datos. Para avanzar el estado del arte en la clasificación de eventos sonoros, esta tesis investiga varios aspectos de la creación de datasets, así como de aprendizaje supervisado y no supervisado para entrenar clasificadores de eventos sonoros con un vocabulario extenso, utilizando diferentes tipos de supervisión de manera novedosa y alternativa. En concreto, nos centramos en aprendizaje supervisado usando etiquetas sin ruido y con ruido, así como en aprendizaje de representaciones auto-supervisado a partir de datos no etiquetados. La primera parte de esta tesis se centra en la creación de FSD50K, un dataset con más de 100h de audio etiquetado manualmente usando 200 clases de eventos sonoros. Presentamos una descripción detallada del proceso de creación y una caracterización exhaustiva del dataset. Además, exploramos modificaciones arquitectónicas para aumentar la invariancia frente a desplazamientos en CNNs, mejorando la robustez frente a desplazamientos de tiempo/frecuencia en los espectrogramas de entrada. En la segunda parte, nos centramos en entrenar clasificadores de eventos sonoros usando etiquetas con ruido. Primero, proponemos un dataset que permite la investigación del ruido de etiquetas real. Después, exploramos métodos agnósticos a la arquitectura de red para mitigar el efecto del ruido en las etiquetas durante el entrenamiento, incluyendo técnicas de regularización, funciones de coste robustas al ruido, y estrategias para rechazar ejemplos etiquetados con ruido. Además, desarrollamos un método teacher-student para abordar el problema de las etiquetas ausentes en datasets de eventos sonoros. En la tercera parte, proponemos algoritmos para aprender representaciones de audio a partir de datos sin etiquetar. En particular, desarrollamos métodos de aprendizaje contrastivos auto-supervisados, donde las representaciones se aprenden comparando pares de ejemplos calculados a través de métodos de aumento de datos y separación automática de sonido. Finalmente, reportamos sobre la organización de dos DCASE Challenge Tasks para el tageado automático de audio a partir de etiquetas ruidosas. Mediante la propuesta de datasets, así como de métodos de vanguardia y representaciones de audio, esta tesis contribuye al avance de la investigación abierta sobre eventos sonoros y a la transición del aprendizaje supervisado tradicional utilizando etiquetas sin ruido a otras estrategias de aprendizaje menos dependientes de costosos esfuerzos de anotación.

APA, Harvard, Vancouver, ISO, and other styles

12

Louche, Ugo. "From confusion noise to active learning : playing on label availability in linear classification problems." Thesis, Aix-Marseille, 2016. http://www.theses.fr/2016AIXM4025/document.

Full text

Abstract:

Les travaux présentés dans cette thèse relèvent de l'étude des méthodes de classification linéaires, c'est à dire l'étude de méthodes ayant pour but la catégorisation de données en différents groupes à partir d'un jeu d'exemples, préalablement étiquetés, disponible en amont et appelés ensemble d'apprentissage. En pratique, l'acquisition d'un tel ensemble d'apprentissage peut être difficile et/ou couteux, la catégorisation d'un exemple étant de fait plus ardu que l'obtention de dudit exemple. Cette disparité entre la disponibilité des données et notre capacité à constituer un ensemble d'apprentissage étiqueté a été un des problèmes centraux de l'apprentissage automatique et ce manuscrit s’intéresse à deux solutions usuellement considérées pour contourner ce problème : l'apprentissage en présence de données bruitées et l'apprentissage actif
The works presented in this thesis fall within the general framework of linear classification, that is the problem of categorizing data into two or more classes based on on a training set of labelled data. In practice though acquiring labeled examples might prove challenging and/or costly as data are inherently easier to obtain than to label. Dealing with label scarceness have been a motivational goal in the machine learning literature and this work discuss two settings related to this problem: learning in the presence of noise and active learning

APA, Harvard, Vancouver, ISO, and other styles

13

Akavia, Adi. "Learning noisy characters, multiplication codes, and cryptographic hardcore predicates." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43032.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (p. 181-187).
We present results in cryptography, coding theory and sublinear algorithms. In cryptography, we introduce a unifying framework for proving that a Boolean predicate is hardcore for a one-way function and apply it to a broad family of functions and predicates, showing new hardcore predicates for well known one-way function candidates such as RSA and discrete-log as well as reproving old results in an entirely different way. Our proof framework extends the list-decoding method of Goldreich and Levin [38] for showing hardcore predicates, by introducing a new class of error correcting codes and new list-decoding algorithm we develop for these codes. In coding theory, we introduce a novel class of error correcting codes that we name: Multiplication codes (MPC). We develop decoding algorithms for MPC codes, showing they achieve desirable combinatorial and algorithmic properties, including: (1) binary MPC of constant distance and exponential encoding length for which we provide efficient local list decoding and local self correcting algorithms; (2) binary MPC of constant distance and polynomial encoding length for which we provide efficient decoding algorithm in random noise model; (3) binary MPC of constant rate and distance. MPC codes are unique in particular in achieving properties as above while having a large group as their underlying algebraic structure. In sublinear algorithms, we present the SFT algorithm for finding the sparse Fourier approximation of complex multi-dimensional signals in time logarithmic in the signal length. We also present additional algorithms for related settings, differing in the model by which the input signal is given, in the considered approximation measure, and in the class of addressed signals. The sublinear algorithms we present are central components in achieving our results in cryptography and coding theory.
(cont) Reaching beyond theoretical computer science, we suggest employing our algorithms as tools for performance enhancement in data intensive applications, in particular, we suggest replacing the O(log N)-time FFT algorithm with our e(log N)-time SFT algorithm for settings where a sparse approximation suffices.
by Adi Akavia.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

14

Zhao, Yan. "Deep learning methods for reverberant and noisy speech enhancement." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1593462119759348.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

CAPPOZZO, ANDREA. "Robust model-based classification and clustering: advances in learning from contaminated datasets." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/262919.

Full text

Abstract:

Al momento della stesura della tesi, ogni giorno viene raccolta una quantità sempre maggiore di dati, con un volume stimato che è destinato a raddoppiare ogni due anni. Grazie ai progressi tecnologici, i datasets stanno diventando enormi in termini di dimensioni e sostanzialmente più complessi in natura. Tuttavia, questa abbondanza di informazioni non elaborate ha un prezzo: misurazioni errate, errori di immissione dei dati, guasti dei sistemi di raccolta automatica e diverse altre cause possono in definitiva compromettere la qualità complessiva dei dati. I metodi robusti hanno un ruolo centrale nel convertire correttamente le informazioni grezze contaminate in conoscenze affidabili: un obiettivo primario di qualsiasi analisi statistica. La tesi presenta nuove metodologie per ottenere risultati affidabili, nell'ambito della classificazione e del clustering model-based, in presenza di dati contaminati. In primo luogo, si propone una modifica robusta di una famiglia di modelli semi-supervisionati, per ottenere una corretta classificazione in presenza di valori anomali ed errori nelle etichette. In secondo luogo, si sviluppa un metodo di analisi discriminante per il rilevamento di anomalie e novelties, con l'obiettivo finale di scoprire outliers, osservazioni assegnate a classi sbagliate e gruppi non precedentemente osservati nel training set. In terzo luogo, si introducono due metodi per la selezione delle variabili robusta, che eseguono efficacemente una high-dimensional classification in uno scenario adulterato.
At the time of writing, an ever-increasing amount of data is collected every day, with its volume estimated to be doubling every two years. Thanks to the technological advancements, datasets are becoming massive in terms of size and substantially more complex in nature. Nevertheless, this abundance of ``raw information'' does come at a price: wrong measurements, data-entry errors, breakdowns of automatic collection systems and several other causes may ultimately undermine the overall data quality. To this extent, robust methods have a central role in properly converting contaminated ``raw information'' to trustworthy knowledge: a primary goal of any statistical analysis. The present manuscript presents novel methodologies for performing reliable inference, within the model-based classification and clustering framework, in presence of contaminated data. First, we propose a robust modification to a family of semi-supervised patterned models, for accomplishing classification when dealing with both class and attribute noise. Second, we develop a discriminant analysis method for anomaly and novelty detection, with the final aim of discovering label noise, outliers and unobserved classes in an unlabelled dataset. Third, we introduce two robust variable selection methods, that effectively perform high-dimensional discrimination within an adulterated scenario.

APA, Harvard, Vancouver, ISO, and other styles

16

Kim, Seungyeon. "Novel document representations based on labels and sequential information." Diss., Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/53946.

Full text

Abstract:

A wide variety of text analysis applications are based on statistical machine learning techniques. The success of those applications is critically affected by how we represent a document. Learning an efficient document representation has two major challenges: sparsity and sequentiality. The sparsity often causes high estimation error, and text's sequential nature, interdependency between words, causes even more complication. This thesis presents novel document representations to overcome the two challenges. First, I employ label characteristics to estimate a compact document representation. Because label attributes implicitly describe the geometry of dense subspace that has substantial impact, I can effectively resolve the sparsity issue while only focusing the compact subspace. Second, while modeling a document as a joint or conditional distribution between words and their sequential information, I can efficiently reflect sequential nature of text in my document representations. Lastly, the thesis is concluded with a document representation that employs both labels and sequential information in a unified formulation. The following four criteria are utilized to evaluate the goodness of representations: how close a representation is to its original data, how strongly a representation can be distinguished from each other, how easy to interpret a representation by a human, and how much computational effort is needed for a representation. While pursuing those good representation criteria, I was able to obtain document representations that are closer to the original data, stronger in discrimination, and easier to be understood than traditional document representations. Efficient computation algorithms make the proposed approaches largely scalable. This thesis examines emotion prediction, temporal emotion analysis, modeling documents with edit histories, locally coherent topic modeling, and text categorization tasks for possible applications.

APA, Harvard, Vancouver, ISO, and other styles

17

He, Jin. "Robust Mote-Scale Classification of Noisy Data via Machine Learning." The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1440413201.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Moon, Taesup. "Learning from noisy data with applications to filtering and denoising /." May be available electronically:, 2008. http://proquest.umi.com/login?COPT=REJTPTU1MTUmSU5UPTAmVkVSPTI=&clientId=12498.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Du, Yuxuan. "The Power of Quantum Neural Networks in The Noisy Intermediate-Scale Quantum Era." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/24976.

Full text

Abstract:

Machine learning (ML) has revolutionized the world in recent years. Despite the success, the huge computational overhead required by ML models makes them approach the limits of Moore’s law. Quantum machine learning (QML) is a promising way to conquer this issue, empowered by Google's demonstration of quantum computational supremacy. Meanwhile, another cornerstone in QML is validating that quantum neural networks (QNNs) implemented on the noisy intermediate-scale quantum (NISQ) chips can accomplish classification and image generation tasks. Despite the experimental progress, little is known about the theoretical advances of QNNs. In this thesis, we explore the power of QNNs to fill this knowledge gap. First, we consider the potential advantages of QNNs in generative learning. We demonstrate that QNNs possess a stronger expressive power than that of classical neural networks in the measure of computational complexity and entanglement entropy. Moreover, we employ QNNs to tackle synthetic generation tasks with state-of-the-art performance. Next, we propose a Grover-search based quantum classifier, which can tackle specific classification tasks with quadratic runtime speedups. Furthermore, we exhibit that the proposed scheme allows batch gradient descent optimization, which is different from previous studies. This property is crucial to train large-scale datasets. Then, we study the capabilities and limitations of QNNs in the view of optimization theory and learning theory. The achieved results imply that a large system noise can destroy the trainability of QNNs. Meanwhile, we show that QNNs can tackle parity learning and juntas learning with provable advantages. Last, we devise a quantum auto-ML scheme to enhance the trainability QNNs under the NISQ setting. The achieved results indicate that our proposal effectively mitigates system noise and alleviates barren plateaus for both conventional machine learning and quantum chemistry tasks.

APA, Harvard, Vancouver, ISO, and other styles

20

Vafaie, Parsa. "Learning in the Presence of Skew and Missing Labels Through Online Ensembles and Meta-reinforcement Learning." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42636.

Full text

Abstract:

Data streams are large sequences of data, possibly endless and temporarily ordered, that are common-place in Internet of Things (IoT) applications such as intrusion detection in computer networking, fraud detection in financial institutions, real-time tumor tracking in radiotherapy and social media analysis. Algorithms learning from such streams need to be able to construct near real-time models that continuously adapt to potential changes in patterns, in order to retain high performance throughout the stream. It follows that there are numerous challenges involved in supervised learning (or so-called classification) in such environments. One of the challenges in learning from streams is multi-class imbalance, in which the rates of instances in the different class labels differ substantially. Notably, classification algorithms may become biased towards the classes with more frequent instances, sacrificing the performance of the less frequent or so-called minority classes. Further, minority instances often arrive infrequently and in bursts, making accurate model construction problematic. For example, network intrusion detection systems must be able to distinguish between normal traffic and multiple minority classes corresponding to a variety of different types of attacks. Further, having labels for all instances are often infeasible, since we might have missing or late-arriving labels. For instance, when learning from a stream regarding the task of detecting network intrusions, the true label for all instances might not be available, or it might take time until the label is made available, especially for new types of attacks. In this thesis, we contribute to the advancements of online learning from evolving streams by focusing on the above-mentioned areas of multi-class imbalance and missing labels. First, we introduce a multi-class online ensemble algorithm designed to maintain a balanced performance over all classes. Specifically, our approach samples instances with replacement while dynamically increasing the weights of under-represented classes, in order to produce models that benefit all classes. Our experimental results show that our online ensemble method performs well against multi-class imbalanced data in various datasets. We further continue our study by introducing an approach to dealing with missing labels that utilize both labelled and unlabelled data to increase a model’s performance. That is, our method utilizes labelled data for pseudo-labelling unlabelled instances, allowing the model to perform better in environments where labels are scarce. More specifically, our approach features a meta-reinforcement learning agent, trained on multiple-source streams, that can effectively select the prediction of a K nearest neighbours (K-NN) classifier as the label for unlabelled instances. Extensive experiments on benchmark datasets demonstrate the value and effectiveness of our approach and confirm that our method outperforms state-of-the-art.

APA, Harvard, Vancouver, ISO, and other styles

21

Brodin, Johan. "Working with emotions : Recommending subjective labels to music tracks using machine learning." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-199278.

Full text

Abstract:

Curated music collection is a growing field as a result of the freedom and supply that streaming music services like Spotify provide us with. To be able to categorize music tracks based on subjective core values in a scalable manner, this thesis has explored if recommending such labels are possible through machine learning. When analysing 2464 tracks with one or more of the 22 different core values a profile was built up for each track by features from three different categories: editorial, cultural and acoustic. When classifying the tracks into core values different methods of multi-label classification were explored. By combining five different transformation approaches with three base classifiers and using two algorithm adaptations a total of 17 different configurations were constructed. The different configu- rations were evaluated with multiple measurements including (but not limited to) Hamming Loss, Ranking Loss, One error, F1 score, exact match and both training and testing time. The results showed that the problem transformation algorithm Label Powerset together with Sequential minimal optimization outper- formed the other configurations. We also found promising results for neural networks, something that should be investigated further in the future.
Kurerade musiksamlingar är ett växande område som en direkt följd av den frihet som strömmande musiktjänster som Spotify ger oss. För att kunna kategorisera låtar baserade på subjektiva värderingar på ett skalbart sätt har denna avhandling undersökt om rekommendationer av sådana etiketter är möjliga genom maskininlärning. När 2464 spår med ett eller flera av 22 olika kärnvärden analyserades byggdes en profil för varje spår upp av attribut från tre olika kategorier: redaktionella, kulturella och akustiska. Vid klassificering av spåren undersöktes flera olika metoder för fleretikettsklassificering. Genom att kombinera fem olika transformationsmetoder med tre bas-klassificerare och använda två algoritm-anpassningar konstruerades totalt 17 olika konfigurationer. De olika konfigurationerna utvärderades med flera olika mätvärden, inkluderat (men inte begränsat till) Hamming Loss, Ranking Loss, One error, F1 score, exakt matchning och både träningstid och testningstid. Resultaten visade att transformationsalgoritmen ”Label Powerset” tillsammans med Sekventiell Minimal Optimering utklassade de andra konfigurationerna. Vi fann också lovande resultat för artificiella neuronnät, något som bör undersökas ytterligare i framtiden.

APA, Harvard, Vancouver, ISO, and other styles

22

Schaeffer, Laura M. "Interaction of instructional material order and subgoal labels on learning in programming." Thesis, Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54459.

Full text

Abstract:

Expository instructions, worked examples, and subgoal labels have all been shown to positively impact student learning and performance in computer science education. This study examined whether learning and problem solving performance differed based on the sequence of the instructional materials (expository and worked examples) and the presence of subgoal labels within the instructional materials. Participants were 138 undergraduate college students, age 17-25, who watched two instructional videos on creating an application in the App Inventor programming language before completing several learning assessments. A significant interaction showed that when learners were presented with the worked example followed by the expository instructions containing subgoal labels, the learner was better at outlining the procedure for creating an application. These manipulations did not affect cognitive load, novel problem solving performance, explanations of solutions, or the amount of time spent on instructions and completing the assessments. These results suggest that the order instructional materials are presented have has little impact on problem solving, although some benefit can be gained from presenting the worked example before the expository instructions when subgoal labels are included. This suggests the order the instructions are presented to learners does not impact learning. Previous studies demonstrating an effect of subgoal labels used text instructions as opposed to the video instructions used in the present study. Future research should investigate how these manipulations differ for text instructions and video instructions.

APA, Harvard, Vancouver, ISO, and other styles

23

Hsu, Wei-Ning Ph D. Massachusetts Institute of Technology. "Speech processing with less supervision : learning from weak labels and multiple modalities." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/127021.

Full text

Abstract:

Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
Cataloged from the official PDF of thesis.
Includes bibliographical references (pages 191-217).
In recent years, supervised learning has achieved great success in speech processing with powerful neural network models and vast quantities of in-domain labeled data. However, collecting a labeled dataset covering all domains can be either expensive due to the diversity of speech or almost impossible for some tasks such as speech-to-speech translation. Such a paradigm limits the applicability of speech technologies to high-resource settings. In sharp contrast, humans are good at reading the training signals from indirect supervision, such as from small amount of explicit labels and from different modalities. This capability enables humans to learn from a wider variety of resources, including better domain coverage. In light of this observation, this thesis focuses on learning algorithms for speech processing that can utilize weak and indirect supervision to overcome the restrictions imposed by the supervised paradigm and make the most out of the data at hand for learning.
In the first part of the thesis, we devise a self-training algorithm for speech recognition that distills knowledge from a trained language model, a compact form of external non-speech prior knowledge. The algorithm is inspired by how humans use contextual and prior information to bias speech recognition and produce confident predictions. To distill knowledge within the language model, we implement a beam-search based objective to align the prediction probability with the likelihood of the language model among candidate hypotheses. Experimental results demonstrate state-of-the-art performance that recover word error rates by up to 90% relative to using the same data with ground truth transcripts. Moreover, we show that the proposed algorithm can scale to 60,000 hours of unlabeled speech and yield further reduction in word error rates.
In the second part of the thesis, we present several text-to-speech synthesis models that enable fine-grained control of unlabeled non-textual attributes, including voice, prosody, acoustic environment properties and microphone channel effects. We achieve controllability of unlabeled attributes by formulating a text-to-speech system as a generative model with structured latent variables, and learn this generative process along with an efficient approximate inference model by adopting the variational autoencoder framework. We demonstrate that those latent variables can then be used to control the unlabeled variations in speech, making it possible to build a high-quality speech synthesis model using weakly-labeled mixed-quality speech data as the model learns to control the hidden factors. In the last part of the thesis, we extend a cross-modal semantic embedding learning framework proposed in Harwath et al.
(2019) to learn hierarchical discrete linguistic units from visually grounded speech, a form of multimodal sensory data. By utilizing a discriminative, multimodal grounding objective, the proposed framework forces the learned units to be useful for semantic image retrieval. In contrast, most of the previous work on linguistic unit discovery do not use multimodal data--they consider a reconstruction objective that encourages the learned units to be useful for reconstructing the speech, and hence those units may also encode non-linguistic factors. Experimental results show that the proposed framework outperforms state-of-the-art phonetic unit discovery frameworks by almost 50% on the ZeroSpeech 2019 ABX phone discriminative task, and learns word detectors that discover over 270 words with an F1 score of greater than 0.5. In addition, the learned units from the proposed framework are also more robust to nuisance variation compared to frameworks that learn from only speech.
by Wei-Ning Hsu.
Ph. D.
Ph.D. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

24

Alt, Jonathan K. "Learning from Noisy and Delayed Rewards The Value of Reinforcement Learning to Defense Modeling and Simulation." Thesis, Monterey, California. Naval Postgraduate School, 2012. http://hdl.handle.net/10945/17313.

Full text

Abstract:

Approved for public release; distribution is unlimited
Modeling and simulation of military operations requires human behavior models capable of learning from experi-ence in complex environments where feedback on action quality is noisy and delayed. This research examines the potential of reinforcement learning, a class of AI learning algorithms, to address this need. A novel reinforcement learning algorithm that uses the exponentially weighted average reward as an action-value estimator is described. Empirical results indicate that this relatively straight-forward approach improves learning speed in both benchmark environments and in challenging applied settings. Applications of reinforcement learning in the verification of the re-ward structure of a training simulation, the improvement in the performance of a discrete event simulation scheduling tool, and in enabling adaptive decision-making in combat simulation are presented. To place reinforcement learning within the context of broader models of human information processing, a practical cognitive architecture is devel-oped and applied to the representation of a population within a conflict area. These varied applications and domains demonstrate that the potential for the use of reinforcement learning within modeling and simulation is great.

APA, Harvard, Vancouver, ISO, and other styles

25

Tabassum, Binte Jafar Jeniya. "Information Extraction From User Generated Noisy Texts." The Ohio State University, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=osu1606315356821532.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Balda, Cañizares Emilio Rafael [Verfasser], Rudolf [Akademischer Betreuer] Mathar, and Bastian [Akademischer Betreuer] Leibe. "Robustness analysis of deep neural networks in the presence of adversarial perturbations and noisy labels / Emilio Rafael Balda Canizares ; Rudolf Mathar, Bastian Leibe." Aachen : Universitätsbibliothek der RWTH Aachen, 2019. http://d-nb.info/1216040931/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Holland, Hans Mullinnix. "Treatment of Instance-Based Classifiers Containing Ambiguous Attributes and Class Labels." Scholarly Repository, 2007. http://scholarlyrepository.miami.edu/oa_theses/84.

Full text

Abstract:

The importance of attribute vector ambiguity has been largely overlooked by the machine learning community. A pattern recognition problem can be solved in many ways within the scope of machine learning. Neural Networks, Decision Tree Algorithms such as C4.5, Bayesian Classifiers, and Instance Based Learning are the main algorithms. All listed solutions fail to address ambiguity in the attribute vector. The research reported shows, ignoring this ambiguity leads to problems of classifier scalability and issues with instance collection and aggregation. The Algorithm presented accounts for both ambiguity of the attribute vector and class label thus solving both issues of scalability and instance collection. The research also shows that when applied to sanitized data sets, suitable for traditional instance based learning, the presented algorithm performs equally as well.

APA, Harvard, Vancouver, ISO, and other styles

28

Khasgiwala, Anuj. "Word Recognition in Nutrition Labels with Convolutional Neural Network." DigitalCommons@USU, 2018. https://digitalcommons.usu.edu/etd/7101.

Full text

Abstract:

Nowadays, everyone is very busy and running around trying to maintain a balance between their work life and family, as the working hours are increasing day by day. In such hassled life people either ignore or do not give enough attention to a healthy diet. An imperative part of a healthy eating routine is the cognizance and maintenance of nourishing data and comprehension of how extraordinary sustenance and nutritious constituents influence our bodies. Besides in the USA, in many other countries, nutritional information is fundamentally passed on to consumers through nutrition labels (NLs) which can be found in all packaged food products in the form of nutrition table. However, sometimes it turns out to be challenging to utilize this information available in these NLs notwithstanding for consumers who are health conscious as they may not be familiar with nutritional terms and discover it hard to relate nutritional information into their day by day activities because of lack of time, inspiration, or training. So it is essential to automate this information gathering and interpretation procedure by incorporating Machine Learning based algorithm to abstract nutritional information from NLs on the grounds that it enhances the consumer’s capacity to participate in nonstop nutritional information gathering and analysis.

APA, Harvard, Vancouver, ISO, and other styles

29

Jimenez, Blazquez Lara. "Mathematical Methods for Maritime Signal Curation in Noisy Environments." Thesis, Mälardalens högskola, Akademin för utbildning, kultur och kommunikation, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-43653.

Full text

Abstract:

QTAGG has designed a real-time autonomous system that continuously calculates an optimum propulsion plan controlling the engines and propellers of a vessel. In this way, the precision of the signals that are used is very important, as any little error in the signal can produce incorrect control effects and cause critical damages to the equipment or passengers. This thesis describes the mathematics and implementation of a system to detect and correct disturbances in the data signals of a vessel. The system applies a signal curation based on mathematical modelling and statistics leading to clean data to use in QTAGG’s control system.

APA, Harvard, Vancouver, ISO, and other styles

30

Qin, Zengchang. "Learning with fuzzy labels : a random set approach towards intelligent data mining systems." Thesis, University of Bristol, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.422575.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Mediani, Mohammed [Verfasser], and A. [Akademischer Betreuer] Waibel. "Learning from Noisy Data in Statistical Machine Translation / Mohammed Mediani ; Betreuer: A. Waibel." Karlsruhe : KIT-Bibliothek, 2017. http://d-nb.info/1137946598/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Williamson, Donald S. "DEEP LEARNING METHODS FOR IMPROVING THE PERCEPTUAL QUALITY OF NOISY AND REVERBERANT SPEECH." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1461018277.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Mariello, Andrea. "Learning from noisy data through robust feature selection, ensembles and simulation-based optimization." Doctoral thesis, Università degli studi di Trento, 2019. https://hdl.handle.net/11572/367772.

Full text

Abstract:

The presence of noise and uncertainty in real scenarios makes machine learning a challenging task. Acquisition errors or missing values can lead to models that do not generalize well on new data. Under-fitting and over-fitting can occur because of feature redundancy in high-dimensional problems as well as data scarcity. In these contexts the learning task can show difficulties in extracting relevant and stable information from noisy features or from a limited set of samples with high variance. In some extreme cases, the presence of only aggregated data instead of individual samples prevents the use of instance-based learning. In these contexts, parametric models can be learned through simulations to take into account the inherent stochastic nature of the processes involved. This dissertation includes contributions to different learning problems characterized by noise and uncertainty. In particular, we propose i) a novel approach for robust feature selection based on the neighborhood entropy, ii) an approach based on ensembles for robust salary prediction in the IT job market, and iii) a parametric simulation-based approach for dynamic pricing and what-if analyses in hotel revenue management when only aggregated data are available.

APA, Harvard, Vancouver, ISO, and other styles

34

Mariello, Andrea. "Learning from noisy data through robust feature selection, ensembles and simulation-based optimization." Doctoral thesis, University of Trento, 2019. http://eprints-phd.biblio.unitn.it/3545/1/tesi_mariello.pdf.

Full text

Abstract:

The presence of noise and uncertainty in real scenarios makes machine learning a challenging task. Acquisition errors or missing values can lead to models that do not generalize well on new data. Under-fitting and over-fitting can occur because of feature redundancy in high-dimensional problems as well as data scarcity. In these contexts the learning task can show difficulties in extracting relevant and stable information from noisy features or from a limited set of samples with high variance. In some extreme cases, the presence of only aggregated data instead of individual samples prevents the use of instance-based learning. In these contexts, parametric models can be learned through simulations to take into account the inherent stochastic nature of the processes involved. This dissertation includes contributions to different learning problems characterized by noise and uncertainty. In particular, we propose i) a novel approach for robust feature selection based on the neighborhood entropy, ii) an approach based on ensembles for robust salary prediction in the IT job market, and iii) a parametric simulation-based approach for dynamic pricing and what-if analyses in hotel revenue management when only aggregated data are available.

APA, Harvard, Vancouver, ISO, and other styles

35

Jayal, Ambikesh. "Framework to manage labels for e-assessment of diagrams." Thesis, Brunel University, 2010. http://bura.brunel.ac.uk/handle/2438/4496.

Full text

Abstract:

Automatic marking of coursework has many advantages in terms of resource benefits and consistency. Diagrams are quite common in many domains including computer science but marking them automatically is a challenging task. There has been previous research to accomplish this, but results to date have been limited. Much of the meaning of a diagram is contained in the labels and in order to automatically mark the diagrams the labels need to be understood. However the choice of labels used by students in a diagram is largely unrestricted and diversity of labels can be a problem while matching. This thesis has measured the extent of the diagram label matching problem and proposed and evaluated a configurable extensible framework to solve it. A new hybrid syntax matching algorithm has also been proposed and evaluated. This hybrid approach is based on the multiple existing syntax algorithms. Experiments were conducted on a corpus of coursework which was large scale, realistic and representative of UK HEI students. The results show that the diagram label matching is a substantial problem and cannot be easily avoided for the e-assessment of diagrams. The results also show that the hybrid approach was better than the three existing syntax algorithms. The results also show that the framework has been effective but only to limited extent and needs to be further refined for the semantic stage. The framework proposed in this Thesis is configurable and extensible. It can be extended to include other algorithms and set of parameters. The framework uses configuration XML, dynamic loading of classes and two design patterns namely strategy design pattern and facade design pattern. A software prototype implementation of the framework has been developed in order to evaluate it. Finally this thesis also contributes the corpus of coursework and an open source software implementation of the proposed framework. Since the framework is configurable and extensible, its software implementation can be extended and used by the research community.

APA, Harvard, Vancouver, ISO, and other styles

36

Bista, Shachi. "Extracting Adverse Drug Reactions from Product Labels using Deep Learning and Natural Language Processing." Thesis, KTH, Skolan för kemi, bioteknologi och hälsa (CBH), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-277815.

Full text

Abstract:

Pharamacovigilance relates to activities involving drug safety monitoring in the post-marketing phase of the drug development life-cycle. Despite rigorous trials and experiments that drugs undergo before they are available in the market, they can still cause previously unobserved side-effects (also known as adverse events) due to drug–drug interaction, genetic, physiological or demographic reasons. The Uppsala Monitoring Centre (UMC) is the custodian of the global reporting system, VigiBase, for adverse drug reactions in collaboration with the World Health Organization (WHO). VigiBase houses over 20 million case reports of suspected adverse drug reactions from all around the world. However, not all case reports that the UMC receives pertains to adverse reactions that are novel in the safety profile of the drugs. In fact, many of the reported reactions found in the database are known adverse events for the reported drugs. With more than 3 million potential associations between all possible drugs and all possible adverse events present in the database, identifying associations that are likely to represent previously unknown safety concerns requires powerful statistical methods and knowledge of the known safety profiles of the drugs. Therefore, there is a need for a knowledge base with mappings of drugs to their known adverse reactions. To-date, such a knowledge base does not exist. The purpose of this thesis is to develop a deep-learning model that learns to extract adverse reactions from product labels — regulatory documents providing the current state of knowledge of the safety profile of a given product — and map them to a standardized terminology with high precision. To achieve this, I propose a two-phase algorithm, with a first scanning phase aimed at finding regions of the text representing adverse reactions, and a second mapping phase aiming at normalizing the detected text fragments into Medical Dictionary for Regulatory Activities (MedDRA) terms, the terminology used at the UMC to represent adverse reactions. A previous dictionary-based algorithm developed at the UMC achieved a scanning F1 of 0.42 (0.31 precision, 0.66 recall) and mapping macro-averaged F1 of 0.43 (0.39 macro-averaged precision, 0.64 macro-averaged recall). State-of-the-art methods achieve F1 above 0.8 and above 0.7 for the scanning and mapping problems respectively. To develop algorithms for adverse reaction extraction, I use the 2019 ADE Evaluation Challenge data, a dataset made by the FDA with 100 product labels annotated for adverse events and their mappings to MedDRA. This thesis explores three architectures for the scanning problem: 1) a Bidirectional Long Short-Term Memory (BiLSTM) encoder followed by a softmax classifier, 2) a BiLSTM encoder with Conditional Random Field (CRF) classifier and finally, 3) a BiLSTM encoder with CRF classifier with Embeddings from Language Model (ELMo) embeddings. For the mapping problem, I explore Information Retrieval techniques using the search engines whoosh and Solr, as well as a Learning to Rank algorithm. The BiLSTM encoder with CRF gave the highest performance on finding the adverse events in the texts, with an F1 of 0.67 (0.75 precision, 0.61 recall), representing a 0.06 absolute increase in F1 over the simpler BiLSTM encoder with softmax. Using the ELMo embeddings was proven detrimental and lowered the F1 to 0.62. Error analysis revealed the adopted Inside, Beginning, Outside (IOB2) labelling scheme to be poorly adapted for denoting discontinuous and compound spans while introducing ambiguity in the training data. Based on the gold standard annotated mappings, I also evaluated the whoosh and Solr search engines, with and without Learning to Rank. The best performing search engine on this data was Solr, with a macro-averaged F1 of 0.49 compared to the macro-averaged F1 of 0.47 for the whoosh search engine. Adding a Learning to Rank algorithm on top of each engine did not improve mapping performance, as both macro-averaged F1 dropped by over 0.1 when using the re-ranking approach. Finally, the best performing scanning and mapping algorithms beat the aforementioned dictionary-based baseline F1 by 0.25 in the scanning phase and 0.06 in the mapping phase. A large source of error for the Solr search engine came from tokenisation issues, which had a detrimental impact on the performance of the entire pipeline. In conclusion, modern Natural Language Processing (NLP) techniques can significantly improve the performance of adverse event detection from free-formtext compared to dictionary-based approaches, especially in cases where context is important.
Farmakovigilans berör de aktiviteter som förbättrar förståelsen av biverkningar av läkemedel. Trots de stränga prövningar som behövs för läkemedelsutvecklingen finns ändå en del biverkningar som är okända p.g.a. genetik, fysiologiska eller demografiska faktorer. Uppsala Monitoring Centre (UMC), i samarbete med World Health Organization (WHO) är vårdnadshavare till den globala databasen av rapporter på medicinska biverkningar, VigiBase. VigiBase innehåller över 20 miljoner misstänkta rapporter från hela världen. Dock, en andel av dessa rapporter beskriver biverkningar som är redan kända. Egentligen finns det över 3 miljoner potentiella samband mellan alla läkemedel och biverkningar i databasen. Att hitta den riktiga och okända biverkningar behövs kraftfulla statistiska metoder samt kunskap om det kända säkerhetsprofil av läkemedlet. Det finns ett behöv för ett databas som kartlägger läkemedel med alla kända biverkningar men, inget sådant databas finns idag. Syftet med detta examensarbete är att utveckla en djup-lärandemodell som kan läsa av texter på läkemedels etiketter — tillsynsdokument som beskriver säkerhetsprofil av läkemedel — och kartlägga dem till ett standardiserat terminologi med hög precision. Problemet kan brytas in i två fas, den första scanning och den andra mapping. Scanning handlar om att kartlägga position av text-fragmentet i etiketter. Mapping handlar om att kartlägga de detekterade text-fragmentet till Medical Dictionary for Regulatory Activities (MedDRA), den terminologi som används i UMC för biverkningar. Tidigare försök, s.k. dictionary-based approach på UMC uppnådde scanning F1 i 0,42 (0,31 precision; 0,64 recall) och mapping macro-averaged F1 i 0,43 (0,39 macro-averaged precision; 0,64 macro-averaged recall). De bästa systemen (s.k. state-of-the-art) uppnådde scanning F1 över 0,8 och 0,7 för den scanning respektive mapping problemet. Jag använder den 2019 ADE Evaluation Challenge dataset att utveckla algoritmerna i projektet. Detta dataset innehåller 100 läkemedels etiketter annoterad med biverkningar och deras kartläggning i MedDRA. Denna avhandling utforskar tre arkitekturer till scanning problemet: 1) Bidirectional Long Short-Term Memory (BiLSTM) och softmax för klassificering, 2) BiLSTM med Conditional Random Field (CRF) klassificering och, till sist, 3) BiLSTM med CRF klassificering och Embeddings from Language Model (ELMo) embeddings. Med avseende till mapping problematiken utforskar jag metoder inom Information Retrieval genom användning av sökmotorerna whoosh och Solr. För att förbättra prestandan i mapping utforskar jag Learning to Rank metoder. BiLSTM med CRF presterade bäst inom scanning problematiken med F1 i 0,67 (0,75 precision; 0,61 recall) som är ett 0,06 absolut ökning över den BiLSTM encoder med softmax klassificering. Med ELMo försämrade F1 till 0,62. Analys av felet visade att Inside, Beginning, Outside (IOB2) märkning som jag har valt att använda passar inte till att beteckna diskontinuerliga och sammansatta spans, och tillför betydande osäkerhet i träningsdata. Med avseende till mapping problematiken har jag kollat på sökmotorn Solr och whoosh, med, och utan Learning to Rank. Solr visade sig som den bäst presterande sökmotorn med macro-averaged F1 i 0,49 jämfört med whoosh som visade macro-averaged F1 i 0,47. Learning to Rank algoritmerna försämrade F1 med över 0,1 för båda sökmotorer. Den bäst presterande scanning och mapping algoritmer slog den baseline systemets F1 med 0,25 i scanning faset, och 0,06 i mapping fasen. Ett stor källa av fel för den Solr sökmotorn har kommit från tokeniserings-fel, som hade en försämringseffekt i prestanda genom hela pipelinen. I slutsats, moderna Natural Language Processing (NLP) tekniker kan kraftigt öka prestanda inom detektering av biverkningar från etiketter och texter, jämfört med gamla dictionary metoder, särskilt när kontexten är viktigt.

APA, Harvard, Vancouver, ISO, and other styles

37

Sävhammar, Simon. "Uniform interval normalization : Data representation of sparse and noisy data sets for machine learning." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-19194.

Full text

Abstract:

The uniform interval normalization technique is proposed as an approach to handle sparse data and to handle noise in the data. The technique is evaluated transforming and normalizing the MoodMapper and Safebase data sets, the predictive capabilities are compared by forecasting the data set with aLSTM model. The results are compared to both the commonly used MinMax normalization technique and MinMax normalization with a time2vec layer. It was found the uniform interval normalization performed better on the sparse MoodMapper data set, and the denser Safebase data set. Future works consist of studying the performance of uniform interval normalization on other data sets and with other machine learning models.

APA, Harvard, Vancouver, ISO, and other styles

38

Mirylenka, Katsiaryna. "Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings." Doctoral thesis, Università degli studi di Trento, 2015. https://hdl.handle.net/11572/368620.

Full text

Abstract:

Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

APA, Harvard, Vancouver, ISO, and other styles

39

Mirylenka, Katsiaryna. "Mining and Learning in Sequential Data Streams: Interesting Correlations and Classification in Noisy Settings." Doctoral thesis, University of Trento, 2015. http://eprints-phd.biblio.unitn.it/1398/1/mirylenka.pdf.

Full text

Abstract:

Sequential data streams describe a variety of real life processes: from sensor readings of natural phenomena to robotics, moving trajectories and network monitoring scenarios. An item in a sequential data stream often depends on its previous values, subsequent items being strongly correlated. In this thesis we address the problem of extracting the most significant sequential patterns from a data stream, with applications to real-time data summarization and classification and estimating generative models of the data. The first contribution of this thesis is the notion of Conditional Heavy Hitters, which describes the items that are frequent conditionally – that is, within the context of their parent item. Conditional Heavy Hitters are useful in a variety of applications in sensor monitoring, analysis, Markov chain modeling, and more. We develop algorithms for efficient detection of Conditional Heavy Hitters depending on the characteristics of the data, and provide analytical quality guarantees for their performance. We also study the behavior of the proposed algorithms for different types of data and demonstrate the efficacy of our methods by experimental evaluation on several synthetic and real-world datasets. The second contribution of the thesis is the extension of Conditional Heavy Hitters to patterns of variable order, which we formalize in the notion of Variable Order Conditional Heavy Hitters. The significance of the variable order patterns is measured in terms of high conditional and joint probability and their difference from the independent case in terms of statistical significance. The approximate online solution in the variable order case exploits lossless compression approaches. Facing the tradeoff between memory usage and accuracy of the pattern extraction, we introduce several online space pruning strategies and study their quality guarantees. The strategies can be chosen depending on the estimation objectives, such as maximizing the precision or recall of extracted significant patterns. The efficiency of our approach is experimentally evaluated on three real datasets. The last contribution of the thesis is related to the prediction quality of the classical and sequential classification algorithms under varying levels of label noise. We present the "Sigmoid Rule" Framework, which allows choosing the most appropriate learning algorithm depending on the properties of the data. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, hinting at how the inherent properties of a dataset affect learning. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.

APA, Harvard, Vancouver, ISO, and other styles

40

Roy, Sujan K. "Kalman Filtering with Machine Learning Methods for Speech Enhancement." Thesis, Griffith University, 2021. http://hdl.handle.net/10072/404456.

Full text

Abstract:

Speech corrupted by background noise (or noisy speech) can reduce the efficiency of communication between man-man and man-machine. A speech enhancement algorithm (SEA) can be used to suppress the embedded background noise and increase the quality and intelligibility of noisy speech. Many applications, such as speech communication systems, hearing aid devices, and speech recognition systems, typically rely upon speech enhancement algorithms for robustness. This dissertation focuses on single-channel speech enhancement using Kalman filtering with machine learning methods. In Kalman filter (KF)-based speech enhancement, each clean speech frame is represented by an auto-regressive (AR) process, whose parameters comprise the linear prediction coefficients (LPCs) and prediction error variance. The LPC parameters and the additive noise variance are used to form the recursive equations of the KF. In augmented KF (AKF), both the clean speech and additive noise LPC parameters are incorporated into an augmented matrix to construct the recursive equations of AKF. Given a frame of noisy speech samples, the KF and AKF give a linear MMSE estimate of the clean speech samples using the recursive equations. Usually, the inaccurate estimates of the parameters introduce bias in the KF and AKF gain, leading to a degradation in speech enhancement performance. The research contributions in this dissertation can be grouped into three focus areas. In the first work, we propose an iterative KF (IT-KF) to offset the bias in KF gain for speech enhancement through utilizing the parameters in real-life noise conditions. In the second work, we jointly incorporate the robustness and sensitivity metrics to offset the bias in the KF and AKF gain - which address speech enhancement in real-life noise conditions. The third focus area consists of the deep neural network (DNN) and whitening filter assisted KF and AKF for speech enhancement. Specifically, DNN and whitening filter-based approaches utilize the parameter estimates for the KF and AKF for speech enhancement. However, the whitening filter still produces biased speech LPC estimates for the KF and AKF, results in degraded speech. To address this, we propose a DeepLPC framework constructed with the state-of-the-art residual network and temporal convolutional network (ResNet-TCN) to jointly estimate the speech and noise LPC parameters from the noisy speech for the AKF. Recently, the multi-head self-attention network (MHANet) has demonstrated the ability to more efficiently model the long-term dependencies of noisy speech than ResNet-TCN. Therefore, we employ the MHANet within DeepLPC, termed as DeepLPC-MHANet, to further improve the speech and noise LPC parameter estimates for the AKF. Finally, we perform a comprehensive study on four different training targets for LPC estimation using ResNet-TCN and MHANet. This study aims to determine which training target as well as DNN method produces accurate speech and noise LPC parameter with an application of AKF-based speech enhancement in practice. Objective and subjective scores demonstrate that the proposed methods in this dissertation produce enhanced speech with higher quality and intelligibility than the competing methods in various noise conditions for a wide range of signal-to-noise ratio (SNR) levels.
Thesis (PhD Doctorate)
Doctor of Philosophy (PhD)
School of Eng & Built Env
Science, Environment, Engineering and Technology
Full Text

APA, Harvard, Vancouver, ISO, and other styles

41

Nguyen, Thanh Tan. "Selected non-convex optimization problems in machine learning." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/200748/1/Thanh_Nguyen_Thesis.pdf.

Full text

Abstract:

Non-convex optimization is an important and rapidly growing research area. It is tied to the latest success of deep learning, reinforcement learning, matrix factorization, and more. As a contribution to this area, this thesis provides analyses and algorithms for three important problems. The first one is optimization of noisy functions defined on a large graph, which is useful for AB testing, digital marketing. The second one is learning a convex ensemble of basis models, with application in regression and classification. The last one is optimization of ResNet with restricted residual modules, which leads to better performance over standard ResNet.

APA, Harvard, Vancouver, ISO, and other styles

42

Young, William Albert II. "LEARNING RATES WITH CONFIDENCE LIMITS FOR JET ENGINE MANUFACTURING PROCESSES AND PART FAMILIES FROM NOISY DATA." Ohio University / OhioLINK, 2005. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1131637106.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Tandon, Prateek. "Bayesian Aggregation of Evidence for Detection and Characterization of Patterns in Multiple Noisy Observations." Research Showcase @ CMU, 2015. http://repository.cmu.edu/dissertations/658.

Full text

Abstract:

Effective use of Machine Learning to support extracting maximal information from limited sensor data is one of the important research challenges in robotic sensing. This thesis develops techniques for detecting and characterizing patterns in noisy sensor data. Our Bayesian Aggregation (BA) algorithmic framework can leverage data fusion from multiple low Signal-To-Noise Ratio (SNR) sensor observations to boost the capability to detect and characterize the properties of a signal generating source or process of interest. We illustrate our research with application to the nuclear threat detection domain. Developed algorithms are applied to the problem of processing the large amounts of gamma ray spectroscopy data that can be produced in real-time by mobile radiation sensors. The thesis experimentally shows BA’s capability to boost sensor performance in detecting radiation sources of interest, even if the source is faint, partiallyoccluded, or enveloped in the noisy and variable radiation background characteristic of urban scenes. In addition, BA provides simultaneous inference of source parameters such as the source intensity or source type while detecting it. The thesis demonstrates this capability and also develops techniques to efficiently optimize these parameters over large possible setting spaces. Methods developed in this thesis are demonstrated both in simulation and in a radiation-sensing backpack that applies robotic localization techniques to enable indoor surveillance of radiation sources. The thesis further improves the BA algorithm’s capability to be robust under various detection scenarios. First, we augment BA with appropriate statistical models to improve estimation of signal components in low photon count detection, where the sensor may receive limited photon counts from either source and/or background. Second, we develop methods for online sensor reliability monitoring to create algorithms that are resilient to possible sensor faults in a data pipeline containing one or multiple sensors. Finally, we develop Retrospective BA, a variant of BA that allows reinterpretation of past sensor data in light of new information about percepts. These Retrospective capabilities include the use of Hidden Markov Models in BA to allow automatic correction of a sensor pipeline when sensor malfunction may be occur, an Anomaly- Match search strategy to efficiently optimize source hypotheses, and prototyping of a Multi-Modal Augmented PCA to more flexibly model background and nuisance source fluctuations in a dynamic environment.

APA, Harvard, Vancouver, ISO, and other styles

44

Jones, Nelda Morreau Lanny E. Lian Ming-Gon John. "Relationship between special education diagnostic labels and placement characteristics of children in foster care." Normal, Ill. Illinois State University, 1996. http://wwwlib.umi.com/cr/ilstu/fullcit?p9633420.

Full text

Abstract:

Thesis (Ed. D.)--Illinois State University, 1996.
Title from title page screen, viewed May 23, 2006. Dissertation Committee: Lanny E. Morreau, Ming-Gon J. Lian (co-chairs), Keith E. Stearns, Kenneth H. Strand, Jeanne A. Howard. Includes bibliographical references (leaves 140-165) and abstract. Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

45

Leoni, Cristian. "Interpretation of Dimensionality Reduction with Supervised Proxies of User-defined Labels." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105622.

Full text

Abstract:

Research on Machine learning (ML) explainability has received a lot of focus in recent times. The interest, however, mostly focused on supervised models, while other ML fields have not had the same level of attention. Despite its usefulness in a variety of different fields, unsupervised learning explainability is still an open issue. In this paper, we present a Visual Analytics framework based on eXplainable AI (XAI) methods to support the interpretation of Dimensionality reduction methods. The framework provides the user with an interactive and iterative process to investigate and explain user-perceived patterns for a variety of DR methods by using XAI methods to explain a supervised method trained on the selected data. To evaluate the effectiveness of the proposed solution, we focus on two main aspects: the quality of the visualization and the quality of the explanation. This challenge is tackled using both quantitative and qualitative methods, and due to the lack of pre-existing test data, a new benchmark has been created. The quality of the visualization is established using a well-known survey-based methodology, while the quality of the explanation is evaluated using both case studies and a controlled experiment, where the generated explanation accuracy is evaluated on the proposed benchmark. The results show a strong capacity of our framework to generate accurate explanations, with an accuracy of 89% over the controlled experiment. The explanation generated for the two case studies yielded very similar results when compared with pre-existing, well-known literature on ground truths. Finally, the user experiment generated high quality overall scores for all assessed aspects of the visualization.

APA, Harvard, Vancouver, ISO, and other styles

46

Reich, Christian [Verfasser], and Laerhoven Kristof [Gutachter] Van. "Learning machine monitoring models from sparse and noisy sensor data annotations / Christian Reich ; Gutachter: Kristof Van Laerhoven." Siegen : Universitätsbibliothek der Universität Siegen, 2020. http://d-nb.info/122050615X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Reich, Christian [Verfasser], and Kristof Van [Gutachter] Laerhoven. "Learning machine monitoring models from sparse and noisy sensor data annotations / Christian Reich ; Gutachter: Kristof Van Laerhoven." Siegen : Universitätsbibliothek der Universität Siegen, 2020. http://nbn-resolving.de/urn:nbn:de:hbz:467-17183.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Zlicar, Blaz. "Algorithms for noisy and nonstationary data : advances in financial time series forecasting and pattern detection with machine learning." Thesis, University College London (University of London), 2018. http://discovery.ucl.ac.uk/10043123/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Kraus, Vivien. "Apprentissage semi-supervisé pour la régression multi-labels : application à l’annotation automatique de pneumatiques." Thesis, Lyon, 2021. https://tel.archives-ouvertes.fr/tel-03789608.

Full text

Abstract:

Avec l’avènement et le développement rapide des technologies numériques, les données sont devenues à la fois un bien précieux et très abondant. Cependant, avec une telle profusion, se posent des questions relatives à la qualité et l’étiquetage de ces données. En effet, à cause de l’augmentation des volumes de données disponibles, alors que le coût de l’étiquetage par des experts humains reste très important, il est de plus en plus nécessaire de pouvoir renforcer l’apprentissage semi-supervisé grâce l’exploitation des données nonlabellisées. Ce problème est d’autant plus marqué dans le cas de l’apprentissage multilabels, et en particulier pour la régression, où chaque unité statistique est guidée par plusieurs cibles différentes, qui prennent la forme de scores numériques. C’est dans ce cadre fondamental, que s’inscrit cette thèse. Tout d’abord, nous commençons par proposer une méthode d’apprentissage pour la régression semi-supervisée, que nous mettons à l’épreuve à travers une étude expérimentale détaillée. Grâce à cette nouvelle méthode, nous présentons une deuxième contribution, plus adaptée au contexte multi-labels. Nous montrons également son efficacité par une étude comparative, sur des jeux de données issues de la littérature. Par ailleurs, la dimensionnalité du problème demeure toujours la difficulté de l’apprentissage automatique, et sa réduction suscite l’intérêt de plusieurs chercheurs dans la communauté. Une des tâches majeures répondant à cette problématique est la sélection de variables, que nous proposons d’étudier ici dans un cadre complexe : semi-supervisé, multi-labels et pour la régression
With the advent and rapid growth of digital technologies, data has become a precious asset as well as plentiful. However, with such an abundance come issues about data quality and labelling. Because of growing numbers of available data volumes, while human expert labelling is still important, it is more and more necessary to reinforce semi-supervised learning with the exploitation of unlabeled data. This problem is all the more noticeable in the multi-label learning framework, and in particular for regression, where each statistical unit is guided by many different targets, taking the form of numerical scores. This thesis focuses on this fundamental framework. First, we begin by proposing a method for semi-supervised regression, that we challenge through a detailed experimental study. Thanks to this new method, we present a second contribution, more fitted to the multi-label framework. We also show its efficiency with a comparative study on literature data sets. Furthermore, the problem dimension is always a pain point of machine learning, and reducing it sparks the interest of many researchers. Feature selection is one of the major tasks addressing this problem, and we propose to study it here in a complex framework : for semi-supervised, multi-label regression. Finally, an experimental validation is proposed on a real problem about automatic annotation of tires, to tackle the needs expressed by the industrial partner of this thesis

APA, Harvard, Vancouver, ISO, and other styles

50

MANSERVIGI, LUCREZIA. "Detection and classification of fults and anomalies in gas turbine sensors by means of statistical filters and machine learning models." Doctoral thesis, Università degli studi di Ferrara, 2021. http://hdl.handle.net/11392/2478821.

Full text

Abstract:

Il monitoraggio e la diagnosi delle turbine a gas sono essenziali e possono essere efficacemente effettuati solo se i sensori installati forniscono una misura attendibile del funzionamento della macchina. Perciò, l’affidabilità dei sensori è un prerequisito indispensabile ai fini di valutare l’effettivo stato di salute della macchina. Infatti, un sensore guasto potrebbe fornire informazioni inesatte, causando perciò l’interruzione della produzione e un incremento dei costi di manutenzione. Per questo motivo, questa tesi sviluppa, calibra e valida metodologie finalizzate ad individuare e classificare i guasti e le anomalie dei sensori installati nelle turbine a gas. La tesi documenta due attività di ricerca con cui è stato raggiunto l’obiettivo prefissato. In primo luogo, è stato sviluppato lo strumento diagnostico denominato “Improved Detection, Classification and Integrated Diagnostics of Gas Turbine Sensors” (I-DCIDS). Tale strumento è costituito dal Fault Detection Tool e Sensor Overall Health State Analysis (SOHSA). Il Fault Detection Tool individua e classifica le categorie di guasto più frequenti. Invece, SOHSA valuta lo stato di salute complessivo del sensore. I-DCIDS può essere utilizzato per valutare lo stato di salute sia di sensori singoli sia di sensori ridondanti/correlati, utilizzando equazioni matematiche che richiedono il settaggio di alcuni parametri di configurazione. A tal fine, viene effettuata un’analisi di sensibilità mediante quattro set di dati eterogenei per definire il valore ottimale di tali parametri. Successivamente, I-DCIDS viene validato su un ulteriore set di dati. Inoltre, I-DCIDS viene anche utilizzato per valutare lo stato di salute di numerosi sensori, analizzando un elevato numero di dati, rappresentativi di sei grandezze fisiche. Queste analisi sono volte ad individuare regole generali con l’obiettivo di determinare la magnitudo del guasto del sensore e l’istante di tempo in cui si verifica. I risultati ottenuti testimoniano la capacità diagnostica di I-DCIDS sul campo sperimentale. Inoltre, si dimostra che la nuova metodologia può analizzare qualsiasi tipo di dataset e grandezza fisica; infatti, grazie al suo settaggio ottimale, I-DCIDS può anche individuare l’esatto istante di tempo in cui il guasto si è verificato. Un altro studio condotto in questa tesi riguarda la valutazione dell’affidabilità dei dati acquisiti, che può essere compromessa a causa di anomalie di processo. Questa tipologia di anomalie, raramente investigata in letteratura, può causare errori tali per cui l’unità di misura di un sensore viene erroneamente assegnata. In questa tesi, tale situazione è denominata “Unit Of Measure Inconsistency” (UMI). Quindi, il secondo obiettivo di questa tesi è quello di individuare lo strumento migliore per diagnosticare con successo l’UMI e per assegnare la corretta unità di misura ai dati privi di tale informazione. A tal fine, vengono esaminati tre classificatori di Machine Learning supervisionato, cioè Support Vector Machine, Naive Bayes e K-Nearest Neighbor. Inoltre, viene proposta ed analizzata una nuova metodologia, chiamata Improved Nearest Neighbor. Le potenzialità di ogni classificatore sono valutate mediante numerose analisi, per verificare come l’affidabilità dei dati utilizzati in fase di addestramento e il numero di classi influenzino le prestazioni delle varie metodologie. Si dimostra che il classificatore Naive Bayes e l’Improved Nearest Neighbor sono i più promettenti in termini di efficacia, robustezza e generalità nel maggior numero di casi considerati. In questo modo, si può assegnare la corretta unità di misura e la diagnosi del sensore potrà quindi essere effettuata efficacemente. Si segnala infine che tutte le analisi riportate in questa tesi utilizzano dati sperimentali acquisiti da sensori installati su turbine a gas di Siemens.
Monitoring and diagnostics of gas turbines is a key challenge that can be performed only if the unit is equipped with reliable sensors, thus providing the actual operating condition of the energy system under investigation. Thus, the evaluation of sensor reliability is fundamental since only a reliable measurement can lead to proper decisions about system operation and health state. In fact, a faulty sensor may provide misleading information for decision making, at the expense of business interruption and maintenance-related costs. For this reason, this thesis develops, tunes and validates comprehensive methodologies for the detection and classification of both faults and anomalies affecting gas turbine sensors. This purpose is achieved by means of two different analyses and related tools. First, the Improved Detection, Classification and Integrated Diagnostics of Gas Turbine Sensors (I-DCIDS) tool is developed. The I-DCIDS tool comprises two kernels, namely Fault Detection Tool and Sensor Overall Health State Analysis (SOHSA). The former detects and classifies the most frequent fault classes. The latter evaluates the sensor overall health state. The novel diagnostic tool is suitable for assessing the health state of both single sensors and redundant/correlated sensors. The methodology uses basic mathematical laws that require some user-defined configuration parameters. Thus, a sensitivity analysis is carried out on I-DCIDS parameters to derive their optimal setting. The sensitivity analysis is performed on four heterogeneous and challenging field datasets referring to correlated sensors. Then, the I-DCIDS tool is validated by means of an additional field dataset, by proving its detection capability. Furthermore, the I-DCIDS tool is also exploited to evaluate the health state of several single sensors, by analyzing a huge amount of field data that refer to six different physical quantities. These analyses provide some rules of thumb for field operation, with the final aim of identifying time occurrence and magnitude of faulty sensors. The results demonstrate the diagnostic capability of the I-DCIDS approach in a real-world scenario. Moreover, the methodology proves to be suitable for all types of datasets and physical quantities and, thanks to its optimal tuning, also capable of identifying the actual time point of fault onset. A further challenge addressed in this thesis relies on the evaluation of raw data reliability, which may be compromised because of process anomalies. Such anomalies, which have been rarely investigated in the literature, may introduce errors whereby the unit of measure of a sensor is wrongly assumed. In this thesis such a situation is named Unit of Measure Inconsistency (UMI). Thus, this thesis is also aimed at identifying the approach that is mostly able to successfully detect UMI occurrence and classify unlabeled data. Among several alternatives, the capability of three supervised Machine Learning classifiers, i.e., Support Vector Machine, Naïve Bayes and K-Nearest Neighbors is investigated. In addition, a novel methodology, namely Improved Nearest Neighbor is proposed and investigated. The capability of each classifier is assessed by means of several analyses, so that the influence of the reliability of the data used for training the classifier and the number of classes is investigated. Among all tested approaches, the Naïve Bayes classifier and the novel Improved Nearest Neighbor prove to be the most effective, since they demonstrate their effectiveness, robustness and general validity in the majority of the cases. Thanks to the selected classifiers, the actual unit of measure of raw data can be provided and further sensor diagnoses can be safely performed. Finally, it has to be highlighted that all analyses reported in this thesis make use of field data acquired from sensors installed on Siemens gas turbines.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!