Kliknij ten link, aby zobaczyć inne rodzaje publikacji na ten temat: PREDICTION DATASET.

Rozprawy doktorskie na temat „PREDICTION DATASET”

Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych

Wybierz rodzaj źródła:

Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „PREDICTION DATASET”.

Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.

Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.

Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.

1

Klus, Petr 1985. ""The Clever machine"- a computational tool for dataset exploration and prediction". Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/482051.

Pełny tekst źródła
Streszczenie:
The purpose of my doctoral studies was to develop an algorithm for large-scale analysis of protein sets. This thesis outlines the methodology and technical work performed as well as relevant biological cases involved in creation of the core algorithm, the cleverMachine (CM), and its extensions multiCleverMachine (mCM) and cleverGO. The CM and mCM provide characterisation and classification of protein groups based on physico-chemical features, along with protein abundance and Gene Ontology annotation information, to perform an accurate data exploration. My method provides both computational and experimental scientists with a comprehensive, easy to use interface for high-throughput protein sequence screening and classification.
El propósito de mis estudios doctorales era desarrollar un algoritmo para el análisis a gran escala de conjuntos de datos de proteínas. Esta tesis describe la metodología, el trabajo técnico desarrollado y los casos biológicos envueltos en la creación del algoritmo principal –el cleverMachine (CM) y sus extensiones multiCleverMachine (mCM) y cleverGO. El CM y mCM permiten la caracterización y clasificación de grupos de proteínas basados en características físico-químicas, junto con la abundancia de proteínas y la anotación de ontología de genes, para así elaborar una exploración de datos correcta. Mi método está compuesto por científicos tanto computacionales como experimentales con una interfaz amplia, fácil de usar para un monitoreo y clasificación de secuencia de proteínas de alto rendimiento.
Style APA, Harvard, Vancouver, ISO itp.
2

Clayberg, Lauren (Lauren W. ). "Web element role prediction from visual information using a novel dataset". Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/132734.

Pełny tekst źródła
Streszczenie:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, May, 2020
Cataloged from the official PDF of thesis.
Includes bibliographical references (pages 89-90).
Machine learning has enhanced many existing tech industries, including end-to-end test automation for web applications. One of the many goals that mabl and other companies have in this new tech initiative is to automatically gain insight into how web applications work. The task of web element role prediction is vital for the advancement of this newly emerging product category. I applied supervised visual machine learning techniques to the task. In addition, I created a novel dataset and present detailed attribute distribution and bias information. The dataset is used to provide updated baselines for performance using current day web applications, and a novel metric is provided to better quantify the performance of these models. The top performing model achieves an F1-score of 0.45 on ten web element classes. Additional findings include color distributions for different web element roles, and how some color spaces are more intuitive to humans than others.
by Lauren Clayberg.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science
Style APA, Harvard, Vancouver, ISO itp.
3

Oppon, Ekow CruickShank. "Synergistic use of promoter prediction algorithms: a choice of small training dataset?" Thesis, University of the Western Cape, 2000. http://etd.uwc.ac.za/index.php?module=etd&action=viewtitle&id=gen8Srv25Nme4_8222_1185436339.

Pełny tekst źródła
Streszczenie:

Promoter detection, especially in prokaryotes, has always been an uphill task and may remain so, because of the many varieties of sigma factors employed by various organisms in transcription. The situation is made more complex by the fact, that any seemingly unimportant sequence segment may be turned into a promoter sequence by an activator or repressor (if the actual promoter sequence is made unavailable). Nevertheless, a computational approach to promoter detection has to be performed due to number of reasons. The obvious that comes to mind is the long and tedious process involved in elucidating promoters in the &lsquo
wet&rsquo
laboratories not to mention the financial aspect of such endeavors. Promoter detection/prediction of an organism with few characterized promoters (M.tuberculosis) as envisaged at the beginning of this work was never going to be easy. Even for the few known Mycobacterial promoters, most of the respective sigma factors associated with their transcription were not known. If the information (promoter-sigma) were available, the research would have been focused on categorizing the promoters according to sigma factors and training the methods on the respective categories. That is assuming that, there would be enough training data for the respective categories. Most promoter detection/prediction studies have been carried out on E.coli because of the availability of a number of experimentally characterized promoters (+- 310). Even then, no researcher to date has extended the research to the entire E.coli genome.

Style APA, Harvard, Vancouver, ISO itp.
4

Vandehei, Bailey R. "Leveraging Defects Life-Cycle for Labeling Defective Classes". DigitalCommons@CalPoly, 2019. https://digitalcommons.calpoly.edu/theses/2111.

Pełny tekst źródła
Streszczenie:
Data from software repositories are a very useful asset to building dierent kinds of models and recommender systems aimed to support software developers. Specically, the identication of likely defect-prone les (i.e., classes in Object-Oriented systems) helps in prioritizing, testing, and analysis activities. This work focuses on automated methods for labeling a class in a version as defective or not. The most used methods for automated class labeling belong to the SZZ family and fail in various circum- stances. Thus, recent studies suggest the use of aect version (AV) as provided by developers and available in the issue tracker such as JIRA. However, in many cir- cumstances, the AV might not be used because it is unavailable or inconsistent. The aim of this study is twofold: 1) to measure the AV availability and consistency in open-source projects, 2) to propose, evaluate, and compare to SZZ, a new method for labeling defective classes which is based on the idea that defects have a stable life-cycle in terms of proportion of versions needed to discover the defect and to x the defect. Results related to 212 open-source projects from the Apache ecosystem, featuring a total of about 125,000 defects, show that the AV cannot be used in the majority (51%) of defects. Therefore, it is important to investigate automated meth- ods for labeling defective classes. Results related to 76 open-source projects from the Apache ecosystem, featuring a total of about 6,250,000 classes that are are aected by 60,000 defects and spread over 4,000 versions and 760,000 commits, show that the proposed method for labeling defective classes is, in average among projects and de- fects, more accurate, in terms of Precision, Kappa, F1 and MCC than all previously proposed SZZ methods. Moreover, the improvement in accuracy from combining SZZ with defects life-cycle information is statistically signicant but practically irrelevant ( overall and in average, more accurate via defects' life-cycle than any SZZ method.
Style APA, Harvard, Vancouver, ISO itp.
5

Sousa, Massáine Bandeira e. "Improving accuracy of genomic prediction in maize single-crosses through different kernels and reducing the marker dataset". Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-07032018-163203/.

Pełny tekst źródła
Streszczenie:
In plant breeding, genomic prediction (GP) may be an efficient tool to increase the accuracy of selecting genotypes, mainly, under multi-environments trials. This approach has the advantage to increase genetic gains of complex traits and reduce costs. However, strategies are needed to increase the accuracy and reduce the bias of genomic estimated breeding values. In this context, the objectives were: i) to compare two strategies to obtain markers subsets based on marker effect regarding their impact on the prediction accuracy of genome selection; and, ii) to compare the accuracy of four GP methods including genotype × environment interaction and two kernels (GBLUP and Gaussian). We used a rice diversity panel (RICE) and two maize datasets (HEL and USP). These were evaluated for grain yield and plant height. Overall, the prediction accuracy and relative efficiency of genomic selection were increased using markers subsets, which has the potential for build fixed arrays and reduce costs with genotyping. Furthermore, using Gaussian kernel and the including G×E effect, there is an increase in the accuracy of the genomic prediction models.
No melhoramento de plantas, a predição genômica (PG) é uma eficiente ferramenta para aumentar a eficiência seletiva de genótipos, principalmente, considerando múltiplos ambientes. Esta técnica tem como vantagem incrementar o ganho genético para características complexas e reduzir os custos. Entretanto, ainda são necessárias estratégias que aumentem a acurácia e reduzam o viés dos valores genéticos genotípicos. Nesse contexto, os objetivos foram: i) comparar duas estratégias para obtenção de subconjuntos de marcadores baseado em seus efeitos em relação ao seu impacto na acurácia da seleção genômica; ii) comparar a acurácia seletiva de quatro modelos de PG incluindo o efeito de interação genótipo × ambiente (G×A) e dois kernels (GBLUP e Gaussiano). Para isso, foram usados dados de um painel de diversidade de arroz (RICE) e dois conjuntos de dados de milho (HEL e USP). Estes foram avaliados para produtividade de grãos e altura de plantas. Em geral, houve incremento da acurácia de predição e na eficiência da seleção genômica usando subconjuntos de marcadores. Estes poderiam ser utilizados para construção de arrays e, consequentemente, reduzir os custos com genotipagem. Além disso, utilizando o kernel Gaussiano e incluindo o efeito de interação G×A há aumento na acurácia dos modelos de predição genômica.
Style APA, Harvard, Vancouver, ISO itp.
6

Johansson, David. "Price Prediction of Vinyl Records Using Machine Learning Algorithms". Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-96464.

Pełny tekst źródła
Streszczenie:
Machine learning algorithms have been used for price prediction within several application areas. Examples include real estate, the stock market, tourist accommodation, electricity, art, cryptocurrencies, and fine wine. Common approaches in studies are to evaluate the accuracy of predictions and compare different algorithms, such as Linear Regression or Neural Networks. There is a thriving global second-hand market for vinyl records, but the research of price prediction within the area is very limited. The purpose of this project was to expand on existing knowledge within price prediction in general to evaluate some aspects of price prediction of vinyl records. That included investigating the possible level of accuracy and comparing the efficiency of algorithms. A dataset of 37000 samples of vinyl records was created with data from the Discogs website, and multiple machine learning algorithms were utilized in a controlled experiment. Among the conclusions drawn from the results was that the Random Forest algorithm generally generated the strongest results, that results can vary substantially between different artists or genres, and that a large part of the predictions had a good accuracy level, but that a relatively small amount of large errors had a considerable effect on the general results.
Style APA, Harvard, Vancouver, ISO itp.
7

Baveye, Yoann. "Automatic prediction of emotions induced by movies". Thesis, Ecully, Ecole centrale de Lyon, 2015. http://www.theses.fr/2015ECDL0035/document.

Pełny tekst źródła
Streszczenie:
Jamais les films n’ont été aussi facilement accessibles aux spectateurs qui peuvent profiter de leur potentiel presque sans limite à susciter des émotions. Savoir à l’avance les émotions qu’un film est susceptible d’induire à ses spectateurs pourrait donc aider à améliorer la précision des systèmes de distribution de contenus, d’indexation ou même de synthèse des vidéos. Cependant, le transfert de cette expertise aux ordinateurs est une tâche complexe, en partie due à la nature subjective des émotions. Cette thèse est donc dédiée à la détection automatique des émotions induites par les films, basée sur les propriétés intrinsèques du signal audiovisuel. Pour s’atteler à cette tâche, une base de données de vidéos annotées selon les émotions induites aux spectateurs est nécessaire. Cependant, les bases de données existantes ne sont pas publiques à cause de problèmes de droit d’auteur ou sont de taille restreinte. Pour répondre à ce besoin spécifique, cette thèse présente le développement de la base de données LIRIS-ACCEDE. Cette base a trois avantages principaux: (1) elle utilise des films sous licence Creative Commons et peut donc être partagée sans enfreindre le droit d’auteur, (2) elle est composée de 9800 extraits vidéos de bonne qualité qui proviennent de 160 films et courts métrages, et (3) les 9800 extraits ont été classés selon les axes de “valence” et “arousal” induits grâce un protocole de comparaisons par paires mis en place sur un site de crowdsourcing. L’accord inter-annotateurs élevé reflète la cohérence des annotations malgré la forte différence culturelle parmi les annotateurs. Trois autres expériences sont également présentées dans cette thèse. Premièrement, des scores émotionnels ont été collectés pour un sous-ensemble de vidéos de la base LIRIS-ACCEDE dans le but de faire une validation croisée des classements obtenus via crowdsourcing. Les scores émotionnels ont aussi rendu possible l’apprentissage d’un processus gaussien par régression, modélisant le bruit lié aux annotations, afin de convertir tous les rangs liés aux vidéos de la base LIRIS-ACCEDE en scores émotionnels définis dans l’espace 2D valence-arousal. Deuxièmement, des annotations continues pour 30 films ont été collectées dans le but de créer des modèles algorithmiques temporellement fiables. Enfin, une dernière expérience a été réalisée dans le but de mesurer de façon continue des données physiologiques sur des participants regardant les 30 films utilisés lors de l’expérience précédente. La corrélation entre les annotations physiologiques et les scores continus renforce la validité des résultats de ces expériences. Equipée d’une base de données, cette thèse présente un modèle algorithmique afin d’estimer les émotions induites par les films. Le système utilise à son avantage les récentes avancées dans le domaine de l’apprentissage profond et prend en compte la relation entre des scènes consécutives. Le système est composé de deux réseaux de neurones convolutionnels ajustés. L’un est dédié à la modalité visuelle et utilise en entrée des versions recadrées des principales frames des segments vidéos, alors que l’autre est dédié à la modalité audio grâce à l’utilisation de spectrogrammes audio. Les activations de la dernière couche entièrement connectée de chaque réseau sont concaténées pour nourrir un réseau de neurones récurrent utilisant des neurones spécifiques appelés “Long-Short-Term- Memory” qui permettent l’apprentissage des dépendances temporelles entre des segments vidéo successifs. La performance obtenue par le modèle est comparée à celle d’un modèle basique similaire à l’état de l’art et montre des résultats très prometteurs mais qui reflètent la complexité de telles tâches. En effet, la prédiction automatique des émotions induites par les films est donc toujours une tâche très difficile qui est loin d’être complètement résolue
Never before have movies been as easily accessible to viewers, who can enjoy anywhere the almost unlimited potential of movies for inducing emotions. Thus, knowing in advance the emotions that a movie is likely to elicit to its viewers could help to improve the accuracy of content delivery, video indexing or even summarization. However, transferring this expertise to computers is a complex task due in part to the subjective nature of emotions. The present thesis work is dedicated to the automatic prediction of emotions induced by movies based on the intrinsic properties of the audiovisual signal. To computationally deal with this problem, a video dataset annotated along the emotions induced to viewers is needed. However, existing datasets are not public due to copyright issues or are of a very limited size and content diversity. To answer to this specific need, this thesis addresses the development of the LIRIS-ACCEDE dataset. The advantages of this dataset are threefold: (1) it is based on movies under Creative Commons licenses and thus can be shared without infringing copyright, (2) it is composed of 9,800 good quality video excerpts with a large content diversity extracted from 160 feature films and short films, and (3) the 9,800 excerpts have been ranked through a pair-wise video comparison protocol along the induced valence and arousal axes using crowdsourcing. The high inter-annotator agreement reflects that annotations are fully consistent, despite the large diversity of raters’ cultural backgrounds. Three other experiments are also introduced in this thesis. First, affective ratings were collected for a subset of the LIRIS-ACCEDE dataset in order to cross-validate the crowdsourced annotations. The affective ratings made also possible the learning of Gaussian Processes for Regression, modeling the noisiness from measurements, to map the whole ranked LIRIS-ACCEDE dataset into the 2D valence-arousal affective space. Second, continuous ratings for 30 movies were collected in order develop temporally relevant computational models. Finally, a last experiment was performed in order to collect continuous physiological measurements for the 30 movies used in the second experiment. The correlation between both modalities strengthens the validity of the results of the experiments. Armed with a dataset, this thesis presents a computational model to infer the emotions induced by movies. The framework builds on the recent advances in deep learning and takes into account the relationship between consecutive scenes. It is composed of two fine-tuned Convolutional Neural Networks. One is dedicated to the visual modality and uses as input crops of key frames extracted from video segments, while the second one is dedicated to the audio modality through the use of audio spectrograms. The activations of the last fully connected layer of both networks are conv catenated to feed a Long Short-Term Memory Recurrent Neural Network to learn the dependencies between the consecutive video segments. The performance obtained by the model is compared to the performance of a baseline similar to previous work and shows very promising results but reflects the complexity of such tasks. Indeed, the automatic prediction of emotions induced by movies is still a very challenging task which is far from being solved
Style APA, Harvard, Vancouver, ISO itp.
8

Lamichhane, Niraj. "Prediction of Travel Time and Development of Flood Inundation Maps for Flood Warning System Including Ice Jam Scenario. A Case Study of the Grand River, Ohio". Youngstown State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=ysu1463789508.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
9

Rai, Manisha. "Topographic Effects in Strong Ground Motion". Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/56593.

Pełny tekst źródła
Streszczenie:
Ground motions from earthquakes are known to be affected by earth's surface topography. Topographic effects are a result of several physical phenomena such as the focusing or defocusing of seismic waves reflected from a topographic feature and the interference between direct and diffracted seismic waves. This typically causes an amplification of ground motion on convex features such as hills and ridges and a de-amplification on concave features such as valleys and canyons. Topographic effects are known to be frequency dependent and the spectral accelerations can sometimes reach high values causing significant damages to the structures located on the feature. Topographically correlated damage pattern have been observed in several earthquakes and topographic amplifications have also been observed in several recorded ground motions. This phenomenon has also been extensively studied through numerical analyses. Even though different studies agree on the nature of topographic effects, quantifying these effects have been challenging. The current literature has no consensus on how to predict topographic effects at a site. With population centers growing around regions of high seismicity and prominent topographic relief, such as California, and Japan, the quantitative estimation of the effects have become very important. In this dissertation, we address this shortcoming by developing empirical models that predict topographic effects at a site. These models are developed through an extensive empirical study of recorded ground motions from two large strong-motion datasets namely the California small to medium magnitude earthquake dataset and the global NGA-West2 datasets, and propose topographic modification factors that quantify expected amplification or deamplification at a site. To develop these models, we required a parameterization of topography. We developed two types of topographic parameters at each recording stations. The first type of parameter is developed using the elevation data around the stations, and comprise of parameters such as smoothed slope, smoothed curvature, and relative elevation. The second type of parameter is developed using a series of simplistic 2D numerical analysis. These numerical analyses compute an estimate of expected 2D topographic amplification of a simple wave at a site in several different directions. These 2D amplifications are used to develop a family of parameters at each site. We study the trends in the ground motion model residuals with respect to these topographic parameters to determine if the parameters can capture topographic effects in the recorded data. We use statistical tests to determine if the trends are significant, and perform mixed effects regression on the residuals to develop functional forms that can be used to predict topographic effect at a site. Finally, we compare the two types of parameters, and their topographic predictive power.
Ph. D.
Style APA, Harvard, Vancouver, ISO itp.
10

Cooper, Heather. "Comparison of Classification Algorithms and Undersampling Methods on Employee Churn Prediction: A Case Study of a Tech Company". DigitalCommons@CalPoly, 2020. https://digitalcommons.calpoly.edu/theses/2260.

Pełny tekst źródła
Streszczenie:
Churn prediction is a common data mining problem that many companies face across industries. More commonly, customer churn has been studied extensively within the telecommunications industry where there is low customer retention due to high market competition. Similar to customer churn, employee churn is very costly to a company and by not deploying proper risk mitigation strategies, profits cannot be maximized, and valuable employees may leave the company. The cost to replace an employee is exponentially higher than finding a replacement, so it is in any company’s best interest to prioritize employee retention. This research combines machine learning techniques with undersampling in hopes of identifying employees at risk of churn so retention strategies can be implemented before it is too late. Four different classification algorithms are tested on a variety of undersampled datasets in order to find the most effective undersampling and classification method for predicting employee churn. Statistical analysis is conducted on the appropriate evaluation metrics to find the most significant methods. The results of this study can be used by the company to target individuals at risk of churn so that risk mitigation strategies can be effective in retaining the valuable employees. Methods and results can be tested and applied across different industries and companies.
Style APA, Harvard, Vancouver, ISO itp.
11

Šalanda, Ondřej. "Strojové učení v úloze predikce vlivu nukleotidového polymorfismu". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2015. http://www.nusl.cz/ntk/nusl-234918.

Pełny tekst źródła
Streszczenie:
This thesis brings a new approach to the prediction of the effect of nucleotide polymorphism on human genome. The main goal is to create a new meta-classifier, which combines predictions of several already implemented software classifiers. The novelty of developed tool lies in using machine learning methods to find consensus over those tools, that would enhance accuracy and versatility of prediction. Final experiments show, that compared to the best integrated tool, the meta-classifier increases the area under ROC curve by 3,4 in average and normalized accuracy is improved by up to 7\,\%. The new classifying service is available at http://ll06.sci.muni.cz:6232/snpeffect/.
Style APA, Harvard, Vancouver, ISO itp.
12

Phanse, Shruti. "Study on the performance of ontology based approaches to link prediction in social networks as the number of users increases". Thesis, Kansas State University, 2010. http://hdl.handle.net/2097/6914.

Pełny tekst źródła
Streszczenie:
Master of Science
Department of Computing and Information Sciences
Doina Caragea
Recent advances in social network applications have resulted in millions of users joining such networks in the last few years. User data collected from social networks can be used for various data mining problems such as interest recommendations, friendship recommendations and many more. Social networks, in general, can be seen as a huge directed network graph representing users of the network (together with their information, e.g., user interests) and their interactions (also known as friendship links). Previous work [Hsu et al., 2007] on friendship link prediction has shown that graph features contain important predictive information. Furthermore, it has been shown that user interests can be used to improve link predictions, if they are organized into an explicitly or implicitly ontology [Haridas, 2009; Parimi, 2010]. However, the above mentioned previous studies have been performed using a small set of users in the social network LiveJournal. The goal of this work is to study the performance of the ontology based approach proposed in [Haridas, 2009], when number of users in the dataset is increased. More precisely, we study the performance of the approach in terms of performance for data sets consisting of 1000, 2000, 3000 and 4000 users. Our results show that the performance generally increases with the number of users. However, the problem becomes quickly intractable from a computation time point of view. As a part of our study, we also compare our results obtained using the ontology-based approach [Haridas, 2009] with results obtained with the LDA based approach in [Parimi, 2010], when such results are available.
Style APA, Harvard, Vancouver, ISO itp.
13

Andruccioli, Matteo. "Previsione del Successo di Prodotti di Moda Prima della Commercializzazione: un Nuovo Dataset e Modello di Vision-Language Transformer". Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021. http://amslaurea.unibo.it/24956/.

Pełny tekst źródła
Streszczenie:
A differenza di quanto avviene nel commercio tradizionale, in quello online il cliente non ha la possibilità di toccare con mano o provare il prodotto. La decisione di acquisto viene maturata in base ai dati messi a disposizione dal venditore attraverso titolo, descrizioni, immagini e alle recensioni di clienti precedenti. É quindi possibile prevedere quanto un prodotto venderà sulla base di queste informazioni. La maggior parte delle soluzioni attualmente presenti in letteratura effettua previsioni basandosi sulle recensioni, oppure analizzando il linguaggio usato nelle descrizioni per capire come questo influenzi le vendite. Le recensioni, tuttavia, non sono informazioni note ai venditori prima della commercializzazione del prodotto; usando solo dati testuali, inoltre, si tralascia l’influenza delle immagini. L'obiettivo di questa tesi è usare modelli di machine learning per prevedere il successo di vendita di un prodotto a partire dalle informazioni disponibili al venditore prima della commercializzazione. Si fa questo introducendo un modello cross-modale basato su Vision-Language Transformer in grado di effettuare classificazione. Un modello di questo tipo può aiutare i venditori a massimizzare il successo di vendita dei prodotti. A causa della mancanza, in letteratura, di dataset contenenti informazioni relative a prodotti venduti online che includono l’indicazione del successo di vendita, il lavoro svolto comprende la realizzazione di un dataset adatto a testare la soluzione sviluppata. Il dataset contiene un elenco di 78300 prodotti di Moda venduti su Amazon, per ognuno dei quali vengono riportate le principali informazioni messe a disposizione dal venditore e una misura di successo sul mercato. Questa viene ricavata a partire dal gradimento espresso dagli acquirenti e dal posizionamento del prodotto in una graduatoria basata sul numero di esemplari venduti.
Style APA, Harvard, Vancouver, ISO itp.
14

Karamichalis, Nikolaos. "Using Machine Learning techniques to understand glucose fluctuation in response to breathing signals". Thesis, Luleå tekniska universitet, Institutionen för system- och rymdteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-87348.

Pełny tekst źródła
Streszczenie:
Blood glucose (BG) prediction and classification plays big role in diabetic patients' daily lives. Based on International Diabetes Federation (IDF) in 2019, 463 million people are diabetic globally and the projection by 2045 is that the number will rise to 700 million people. Continuous glucose monitor (CGM) systems assist diabetic patients daily, by alerting them about their BG levels fluctuations continuously. The history of CGM systems started in 1999, when the Food and Drug Administration (FDA) approved the first CGM system, until nowadays where the developments of the system's accurate reading and delay on reporting are continuously improving. CGM systems are key elements in closed-loop systems, that are using BG monitoring in order to calculate and deliver with the patient's supervision the needed insulin to the patient automatically. Data quality and the feature variation are essential for CGM systems, therefore many studies are being conducted in order to support the developments and improvements of CGM systems and diabetics daily lives. This thesis aims to show that physiological signals retrieved from various sensors, can assist the classification and prediction of BG levels and more specifically that breathing rate can enhance the accuracy of CGM systems for diabetic patients and also healthy individuals. The results showed that physiological data can improve the accuracy of prediction and classification of BG levels and improve the performance of CGM systems during classification and prediction tasks. Finally, future improvements could include the use of predictive horizon (PH) regarding the data and also the selection and use of different models.
Style APA, Harvard, Vancouver, ISO itp.
15

Al, Tobi Amjad Mohamed. "Anomaly-based network intrusion detection enhancement by prediction threshold adaptation of binary classification models". Thesis, University of St Andrews, 2018. http://hdl.handle.net/10023/17050.

Pełny tekst źródła
Streszczenie:
Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the performance (accuracy) of anomaly-based network Intrusion Detection Systems (IDS) that are built using predictive models in a batch-learning setup. This thesis investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these Intrusion Detection models. Specifically, this thesis studied the adaptability features of three well known Machine Learning algorithms: C5.0, Random Forest, and Support Vector Machine. The ability of these algorithms to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. A new dataset (STA2018) was generated for this thesis and used for the analysis. This thesis has demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation (test) traffic have different statistical properties. Further investigation was undertaken to analyse the effects of feature selection and data balancing processes on a model's accuracy when evaluation traffic with different significant features were used. The effects of threshold adaptation on reducing the accuracy degradation of these models was statistically analysed. The results showed that, of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates. This thesis then extended the analysis to apply threshold adaptation on sampled traffic subsets, by using different sample sizes, sampling strategies and label error rates. This investigation showed the robustness of the Random Forest algorithm in identifying the best threshold. The Random Forest algorithm only needed a sample that was 0.05% of the original evaluation traffic to identify a discriminating threshold with an overall accuracy rate of nearly 90% of the optimal threshold.
Style APA, Harvard, Vancouver, ISO itp.
16

Ward, Neil M. "Tropical North African rainfall and worldwide monthly to multi-decadal climate variations : directed towards the development of a corrected ship wind dataset, and improved diagnosis, understanding and prediction of North African rainfall". Thesis, University of Reading, 1994. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.385252.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
17

Sengupta, Aritra. "Empirical Hierarchical Modeling and Predictive Inference for Big, Spatial, Discrete, and Continuous Data". The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1350660056.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
18

Chen, Yang. "Robust Prediction of Large Spatio-Temporal Datasets". Thesis, Virginia Tech, 2013. http://hdl.handle.net/10919/23098.

Pełny tekst źródła
Streszczenie:
This thesis describes a robust and efficient design of Student-t based Robust Spatio-Temporal Prediction, namely, St-RSTP, to provide estimation based on observations over spatio-temporal neighbors. It is crucial to many applications in geographical information systems, medical imaging, urban planning, economy study, and climate forecasting. The proposed St-RSTP is more resilient to outliers or other small departures from model assumptions than its ancestor, the Spatio-Temporal Random Effects (STRE) model. STRE is a statistical model with linear order complexity for processing large scale spatiotemporal data.

However, STRE has been shown sensitive to outliers or anomaly observations. In our design, the St-RSTP model assumes that the measurement error follows Student\'s t-distribution, instead of a traditional Gaussian distribution. To handle the analytical intractable inference of Student\'s t model, we propose an approximate inference algorithm in the framework of Expectation Propagation (EP). Extensive experimental evaluations, based on both simulation
and real-life data sets, demonstrated the robustness and the efficiency of our Student-t prediction model compared with the STRE model.
Master of Science
Style APA, Harvard, Vancouver, ISO itp.
19

Schöner, Holger. "Working with real world datasets preprocessing and prediction with large incomplete and heterogeneous datasets /". [S.l.] : [s.n.], 2005. http://deposit.ddb.de/cgi-bin/dokserv?idn=973424672.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
20

Vagh, Yunous. "Mining climate data for shire level wheat yield predictions in Western Australia". Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2013. https://ro.ecu.edu.au/theses/695.

Pełny tekst źródła
Streszczenie:
Climate change and the reduction of available agricultural land are two of the most important factors that affect global food production especially in terms of wheat stores. An ever increasing world population places a huge demand on these resources. Consequently, there is a dire need to optimise food production. Estimations of crop yield for the South West agricultural region of Western Australia have usually been based on statistical analyses by the Department of Agriculture and Food in Western Australia. Their estimations involve a system of crop planting recommendations and yield prediction tools based on crop variety trials. However, many crop failures arise from adherence to these crop recommendations by farmers that were contrary to the reported estimations. Consequently, the Department has sought to investigate new avenues for analyses that improve their estimations and recommendations. This thesis explores a new approach in the way analyses are carried out. This is done through the introduction of new methods of analyses such as data mining and online analytical processing in the strategy. Additionally, this research attempts to provide a better understanding of the effects of both gradual variation parameters such as soil type, and continuous variation parameters such as rainfall and temperature, on the wheat yields. The ultimate aim of the research is to enhance the prediction efficiency of wheat yields. The task was formidable due to the complex and dichotomous mixture of gradual and continuous variability data that required successive information transformations. It necessitated the progressive moulding of the data into useful information, practical knowledge and effective industry practices. Ultimately, this new direction is to improve the crop predictions and to thereby reduce crop failures. The research journey involved data exploration, grappling with the complexity of Geographic Information System (GIS), discovering and learning data compatible software tools, and forging an effective processing method through an iterative cycle of action research experimentation. A series of trials was conducted to determine the combined effects of rainfall and temperature variations on wheat crop yields. These experiments specifically related to the South Western Agricultural region of Western Australia. The study focused on wheat producing shires within the study area. The investigations involved a combination of macro and micro analyses techniques for visual data mining and data mining classification techniques, respectively. The research activities revealed that wheat yield was most dependent upon rainfall and temperature. In addition, it showed that rainfall cyclically affected the temperature and soil type due to the moisture retention of crop growing locations. Results from the regression analyses, showed that the statistical prediction of wheat yields from historical data, may be enhanced by data mining techniques including classification. The main contribution to knowledge as a consequence of this research was the provision of an alternate and supplementary method of wheat crop prediction within the study area. Another contribution was the division of the study area into a GIS surface grid of 100 hectare cells upon which the interpolated data was projected. Furthermore, the proposed framework within this thesis offers other researchers, with similarly structured complex data, the benefits of a general processing pathway to enable them to navigate their own investigations through variegated analytical exploration spaces. In addition, it offers insights and suggestions for future directions in other contextual research explorations.
Style APA, Harvard, Vancouver, ISO itp.
21

Velecký, Jan. "Predikce vlivu mutace na rozpustnost proteinů". Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-417288.

Pełny tekst źródła
Streszczenie:
The goal of the thesis is to create a predictor of the effect of a mutation on protein solubility given its initial 3D structure. Protein solubility prediction is a bioinformatics problem which is still considered unsolved. Especially a prediction using a 3D structure has not gained much attention yet. A relevant knowledge about proteins, protein solubility and existing predictors is included in the text. The principle of the designed predictor is inspired by the Surface Patches article and therefore it also aims to validate the results achieved by its authors. The designed tool uses changes of positive regions of the electric potential above the protein's surface to make a prediction. The tool has been successfully implemented and series of computationally expensive experiments have been performed. It was shown that the electric potential, hence the predictor itself too, can be successfully used just for a limited set of proteins. On top of that, the method used in the article correlates with a much simpler variable - the protein's net charge.
Style APA, Harvard, Vancouver, ISO itp.
22

Giommi, Luca. "Predicting CMS datasets popularity with machine learning". Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2015. http://amslaurea.unibo.it/9136/.

Pełny tekst źródła
Streszczenie:
In CMS è stato lanciato un progetto di Data Analytics e, all’interno di esso, un’attività specifica pilota che mira a sfruttare tecniche di Machine Learning per predire la popolarità dei dataset di CMS. Si tratta di un’osservabile molto delicata, la cui eventuale predizione premetterebbe a CMS di costruire modelli di data placement più intelligenti, ampie ottimizzazioni nell’uso dello storage a tutti i livelli Tiers, e formerebbe la base per l’introduzione di un solito sistema di data management dinamico e adattivo. Questa tesi descrive il lavoro fatto sfruttando un nuovo prototipo pilota chiamato DCAFPilot, interamente scritto in python, per affrontare questa sfida.
Style APA, Harvard, Vancouver, ISO itp.
23

Chen, Linchao. "Predictive Modeling of Spatio-Temporal Datasets in High Dimensions". The Ohio State University, 2015. http://rave.ohiolink.edu/etdc/view?acc_num=osu1429586479.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
24

Yang, Chaozheng. "Sufficient Dimension Reduction in Complex Datasets". Diss., Temple University Libraries, 2016. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/404627.

Pełny tekst źródła
Streszczenie:
Statistics
Ph.D.
This dissertation focuses on two problems in dimension reduction. One is using permutation approach to test predictor contribution. The permutation approach applies to marginal coordinate tests based on dimension reduction methods such as SIR, SAVE and DR. This approach no longer requires calculation of the method-specific weights to determine the asymptotic null distribution. The other one is through combining clustering method with robust regression (least absolute deviation) to estimate dimension reduction subspace. Compared with ordinary least squares, the proposed method is more robust to outliers; also, this method replaces the global linearity assumption with the more flexible local linearity assumption through k-means clustering.
Temple University--Theses
Style APA, Harvard, Vancouver, ISO itp.
25

Van, Koten Chikako, i n/a. "Bayesian statistical models for predicting software effort using small datasets". University of Otago. Department of Information Science, 2007. http://adt.otago.ac.nz./public/adt-NZDU20071009.120134.

Pełny tekst źródła
Streszczenie:
The need of today�s society for new technology has resulted in the development of a growing number of software systems. Developing a software system is a complex endeavour that requires a large amount of time. This amount of time is referred to as software development effort. Software development effort is the sum of hours spent by all individuals involved. Therefore, it is not equal to the duration of the development. Accurate prediction of the effort at an early stage of development is an important factor in the successful completion of a software system, since it enables the developing organization to allocate and manage their resource effectively. However, for many software systems, accurately predicting the effort is a challenge. Hence, a model that assists in the prediction is of active interest to software practitioners and researchers alike. Software development effort varies depending on many variables that are specific to the system, its developmental environment and the organization in which it is being developed. An accurate model for predicting software development effort can often be built specifically for the target system and its developmental environment. A local dataset of similar systems to the target system, developed in a similar environment, is then used to calibrate the model. However, such a dataset often consists of fewer than 10 software systems, causing a serious problem in the prediction, since predictive accuracy of existing models deteriorates as the size of the dataset decreases. This research addressed this problem with a new approach using Bayesian statistics. This particular approach was chosen, since the predictive accuracy of a Bayesian statistical model is not so dependent on a large dataset as other models. As the size of the dataset decreases to fewer than 10 software systems, the accuracy deterioration of the model is expected to be less than that of existing models. The Bayesian statistical model can also provide additional information useful for predicting software development effort, because it is also capable of selecting important variables from multiple candidates. In addition, it is parametric and produces an uncertainty estimate. This research developed new Bayesian statistical models for predicting software development effort. Their predictive accuracy was then evaluated in four case studies using different datasets, and compared with other models applicable to the same small dataset. The results have confirmed that the best new models are not only accurate but also consistently more accurate than their regression counterpart, when calibrated with fewer than 10 systems. They can thus replace the regression model when using small datasets. Furthermore, one case study has shown that the best new models are more accurate than a simple model that predicts the effort by calculating the average value of the calibration data. Two case studies has also indicated that the best new models can be more accurate for some software systems than a case-based reasoning model. Since the case studies provided sufficient empirical evidence that the new models are generally more accurate than existing models compared, in the case of small datasets, this research has produced a methodology for predicting software development effort using the new models.
Style APA, Harvard, Vancouver, ISO itp.
26

Veganzones, David. "Corporate failure prediction models : contributions from a novel explanatory variable and imbalanced datasets approach". Thesis, Lille, 2018. http://www.theses.fr/2018LIL1A004.

Pełny tekst źródła
Streszczenie:
Cette thèse explore de nouvelles approches pour développer des modèles de prédiction de la faillite. Elle contient alors trois nouveaux domaines d'intervention. La première est une nouvelle variable explicative basée sur la gestion des résultats. À cette fin, nous utilisons deux mesures (accruals et activités réelles) qui évaluent la manipulation potentielle des bénéfices. Nous avons mis en évidence que les modèles qui incluent cette nouvelle variable en combinaison avec des informations financières sont plus précis que ceux qui dépendent uniquement de données financières. La seconde analyse la capacité des modèles de faillite d'entreprise dans des ensembles de données déséquilibrés. Nous avons mis en relation les différents degrés de déséquilibre, la perte de performance et la capacité de récupération de performance, qui n'ont jamais été étudiés dans les modèles de prédiction de la faillite. Le troisième unifie les domaines précédents en évaluant la capacité de notre modèle de gestion des résultats proposé dans des ensembles de données déséquilibrés. Les recherches abordées dans cette thèse fournissent des contributions uniques et pertinentes à la littérature sur les finances d'entreprise, en particulier dans le domaine de la prédiction de la faillite
This dissertation explores novel approaches to develop corporate failure prediction models. This thesis then contains three new areas for intervention. The first is a novel explanatory variable based on earnings management. For this purpose, we use two measures (accruals and real activities) that assess potential earnings manipulation. We evidenced that models which include this novel variable in combination with financial information are more accurate than those relying only on financial data. The second analyzes the capacity of corporate failure models in imbalanced datasets. We put into relation the different degrees of imbalance, the loss on performance and the performance recovery capacity, which have never been studied in corporate failure. The third unifies the previous areas by evaluating the capacity of our proposed earnings management model in imbalanced datasets. Researches covered in this thesis provide unique and relevant contributions to corporate finance literature, especially to corporate failure domain
Style APA, Harvard, Vancouver, ISO itp.
27

Bsoul, Abed Al-Raoof. "PROCESSING AND CLASSIFICATION OF PHYSIOLOGICAL SIGNALS USING WAVELET TRANSFORM AND MACHINE LEARNING ALGORITHMS". VCU Scholars Compass, 2011. http://scholarscompass.vcu.edu/etd/258.

Pełny tekst źródła
Streszczenie:
Over the last century, physiological signals have been broadly analyzed and processed not only to assess the function of the human physiology, but also to better diagnose illnesses or injuries and provide treatment options for patients. In particular, Electrocardiogram (ECG), blood pressure (BP) and impedance are among the most important biomedical signals processed and analyzed. The majority of studies that utilize these signals attempt to diagnose important irregularities such as arrhythmia or blood loss by processing one of these signals. However, the relationship between them is not yet fully studied using computational methods. Therefore, a system that extract and combine features from all physiological signals representative of states such as arrhythmia and loss of blood volume to predict the presence and the severity of such complications is of paramount importance for care givers. This will not only enhance diagnostic methods, but also enable physicians to make more accurate decisions; thereby the overall quality of care provided to patients will improve significantly. In the first part of the dissertation, analysis and processing of ECG signal to detect the most important waves i.e. P, QRS, and T, are described. A wavelet-based method is implemented to facilitate and enhance the detection process. The method not only provides high detection accuracy, but also efficient in regards to memory and execution time. In addition, the method is robust against noise and baseline drift, as supported by the results. The second part outlines a method that extract features from ECG signal in order to classify and predict the severity of arrhythmia. Arrhythmia can be life-threatening or benign. Several methods exist to detect abnormal heartbeats. However, a clear criterion to identify whether the detected arrhythmia is malignant or benign still an open problem. The method discussed in this dissertation will address a novel solution to this important issue. In the third part, a classification model that predicts the severity of loss of blood volume by incorporating multiple physiological signals is elaborated. The features are extracted in time and frequency domains after transforming the signals with Wavelet Transformation (WT). The results support the desirable reliability and accuracy of the system.
Style APA, Harvard, Vancouver, ISO itp.
28

Granato, Italo Stefanine Correia. "snpReady and BGGE: R packages to prepare datasets and perform genome-enabled predictions". Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/11/11137/tde-21062018-134207/.

Pełny tekst źródła
Streszczenie:
The use of molecular markers allows an increase in efficiency of the selection as well as better understanding of genetic resources in breeding programs. However, with the increase in the number of markers, it is necessary to process it before it can be ready to use. Also, to explore Genotype x Environment (GE) in the context of genomic prediction some covariance matrices needs to be set up before the prediction step. Thus, aiming to facilitate the introduction of genomic practices in the breeding program pipelines, we developed two R-packages. The former is called snpReady, which is set to prepare data sets to perform genomic studies. This package offers three functions to reach this objective, from organizing and apply the quality control, build the genomic relationship matrix and a summary of a population genetics. Furthermore, we present a new imputation method for missing markers. The latter is the BGGE package that was built to generate kernels for some GE genomic models and perform predictions. It consists of two functions (getK and BGGE). The former is helpful to create kernels for the GE genomic models, and the latter performs genomic predictions with some features for GE kernels that decreases the computational time. The features covered in the two packages presents a fast and straightforward option to help the introduction and usage of genome analysis in the breeding program pipeline.
O uso de marcadores moleculares permite um aumento na eficiência da seleção, bem como uma melhor compreensão dos recursos genéticos em programas de melhoramento. No entanto, com o aumento do número de marcadores, é necessário o processamento deste antes de deixa-lo disponível para uso. Além disso, para explorar a interação genótipo x ambiente (GA) no contexto da predição genômica, algumas matrizes de covariância precisam ser obtidas antes da etapa de predição. Assim, com o objetivo de facilitar a introdução de práticas genômicas nos programa de melhoramento, dois pacotes em R foram desenvolvidos. O primeiro, snpReady, foi criado para preparar conjuntos de dados para realizar estudos genômicos. Este pacote oferece três funções para atingir esse objetivo, organizando e aplicando o controle de qualidade, construindo a matriz de parentesco genômico e com estimativas de parâmetros genéticos populacionais. Além disso, apresentamos um novo método de imputação para marcas perdidas. O segundo pacote é o BGGE, criado para gerar kernels para alguns modelos genômicos de interação GA e realizar predições genômicas. Consiste em duas funções (getK e BGGE). A primeira é utilizada para criar kernels para os modelos GA, e a última realiza predições genômicas, com alguns recursos especifico para os kernels GA que diminuem o tempo computacional. Os recursos abordados nos dois pacotes apresentam uma opção rápida e direta para ajudar a introdução e uso de análises genômicas nas diversas etapas do programa de melhoramento.
Style APA, Harvard, Vancouver, ISO itp.
29

Gorshechnikova, Anastasiia. "Likelihood approximation and prediction for large spatial and spatio-temporal datasets using H-matrix approach". Doctoral thesis, Università degli studi di Padova, 2019. http://hdl.handle.net/11577/3425427.

Pełny tekst źródła
Streszczenie:
The Gaussian distribution is the most fundamental distribution in statistics. However, many applications of Gaussian random fields (GRFs) are limited by the computational complexity associated to the evaluation of probability density functions. Particularly, large datasets with N irregularly sited spatial (or spatio-temporal) locations are difficult to handle for several applications of GRF such as maximum likelihood estimation (MLE) and kriging prediction. This is due to the fact that computation of the inverse of the dense covariance function requires a computational complexity of O(N^3) floating points operations in spatial or spatio-temporal context. For relatively large N the exact computation becomes unfeasible and alternative methods are necessary. Several approaches have been proposed to tackle this problem. Most assume a specific form for the spatial(-temporal) covariance function and use different methods to approximate the resulting covariance matrix. We aim at approximating covariance functions in a format that facilitates the computation of MLE and kriging prediction with very large spatial and spatio-temporal datasets. For a sufficiently general class of spatial and specific class of spatio-temporal covariance functions, a methodology is developed using a hierarchical matrix approach. Since this method was originally created for the approximation of dense matrices coming from partial differential and integral equations, a theoretical framework is formulated in terms of Stochastic Partial Differential equations (SPDEs). The application of this technique is detailed for covariance functions of GRFs obtained as solutions to SPDEs. The approximation of the covariance matrix in such a low-rank format allows for computation of the matrix-vector products and matrix factorisations in a log-linear computational cost followed by an efficient MLE and kriging prediction. The numerical studies are provided for based on spatial and spatio-temporal datasets and the H-matrix approach is compared with the other methods in terms of computational and statistical efficiency.​
Style APA, Harvard, Vancouver, ISO itp.
30

Rivera, Steven Anthony. "BeatDB v3 : a framework for the creation of predictive datasets from physiological signals". Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/113114.

Pełny tekst źródła
Streszczenie:
Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 101-104).
BeatDB is a framework for fast processing and analysis of physiological data, such as arterial blood pressure (ABP) or electrocardiograms (ECG). BeatDB takes such data as input and processes it for machine learning analytics in multiple stages. It offers both beat and onset detection, feature extraction for beats and groups of beats over one or more signal channels and over the time domain, and an extraction step focused on finding condition windows and aggregate features within them. BeatDB has gone through multiple iterations, with its initial version running as a collection of single-use MATLAB and Python scripts run on VM instances in Open- Stack and its second version (known as PhysioMiner) acting as a cohesive and modular cloud system on Amazon Web Services in Java. The goal of this project is primarily to modify BeatDB to support multi-channel waveform data like EEG and accelerometer data and to make the project more flexible to modification by researchers. Major software development tasks included rewriting condition detection to find windows in valid beat groups only, refactoring and writing new code to extract features and prepare training data for multi-channel signals, and fully redesigning and reimplementing BeatDB within Python, focusing on optimization and simplicity based on probable use cases of BeatDB. BeatDB v3 has become more accurate in the datasets it generates, usable for both developer and non-developer users, and efficient in both performance and design than previous iterations, achieving an average AUROC increase of over 4% when comparing specific iterations.
by Steven Anthony Rivera.
M. Eng.
Style APA, Harvard, Vancouver, ISO itp.
31

Yasarer, Hakan. "Decision making in engineering prediction systems". Diss., Kansas State University, 2013. http://hdl.handle.net/2097/16231.

Pełny tekst źródła
Streszczenie:
Doctor of Philosophy
Department of Civil Engineering
Yacoub M. Najjar
Access to databases after the digital revolutions has become easier because large databases are progressively available. Knowledge discovery in these databases via intelligent data analysis technology is a relatively young and interdisciplinary field. In engineering applications, there is a demand for turning low-level data-based knowledge into a high-level type knowledge via the use of various data analysis methods. The main reason for this demand is that collecting and analyzing databases can be expensive and time consuming. In cases where experimental or empirical data are already available, prediction models can be used to characterize the desired engineering phenomena and/or eliminate unnecessary future experiments and their associated costs. Phenomena characterization, based on available databases, has been utilized via Artificial Neural Networks (ANNs) for more than two decades. However, there is a need to introduce new paradigms to improve the reliability of the available ANN models and optimize their predictions through a hybrid decision system. In this study, a new set of ANN modeling approaches/paradigms along with a new method to tackle partially missing data (Query method) are introduced for this purpose. The potential use of these methods via a hybrid decision making system is examined by utilizing seven available databases which are obtained from civil engineering applications. Overall, the new proposed approaches have shown notable prediction accuracy improvements on the seven databases in terms of quantified statistical accuracy measures. The proposed new methods are capable in effectively characterizing the general behavior of a specific engineering/scientific phenomenon and can be collectively used to optimize predictions with a reasonable degree of accuracy. The utilization of the proposed hybrid decision making system (HDMS) via an Excel-based environment can easily be utilized by the end user, to any available data-rich database, without the need for any excessive type of training.
Style APA, Harvard, Vancouver, ISO itp.
32

Zhu, Cheng. "Efficient network based approaches for pattern recognition and knowledge discovery from large and heterogeneous datasets". University of Cincinnati / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1378215769.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
33

Dilthey, Alexander Tilo. "Statistical HLA type imputation from large and heterogeneous datasets". Thesis, University of Oxford, 2012. http://ora.ox.ac.uk/objects/uuid:1bca18bf-b9d5-4777-b58e-a0dca4c9dbea.

Pełny tekst źródła
Streszczenie:
An individual's Human Leukocyte Antigen (HLA) type is an essential immunogenetic parameter, influencing susceptibility to a variety of autoimmune and infectious diseases, to certain types of cancer and the likelihood of adverse drug reactions. I present and evaluate two models for the accurate statistical determination of HLA types for single-population and multi-population studies, based on SNP genotypes. Importantly, SNP genotypes are already available for many studies, so that the application of the statistical methods presented here does not incur any extra cost besides computing time. HLA*IMP:01 is based on a parallelized and modified version of LDMhc (Leslie et al., 2008), enabling the processing of large reference panels and improving call rates. In a homogeneous single-population imputation scenario on a mainly British dataset, it achieves accuracies (posterior predictive values) and call rates >=88% at all classical HLA loci (HLA-A, HLA-B, HLA-C, HLA-DQA1, HLA-DQB1, HLA-DRB1) at 4-digit HLA type resolution. HLA*IMP:02 is specifically designed to deal with multi-population heterogeneous reference panels and based on a new algorithm to construct haplotype graph models that takes into account haplotype estimate uncertainty, allows for missing data and enables the inclusion of prior knowledge on linkage disequilibrium. It works as well as HLA*IMP:01 on homogeneous panels and substantially outperforms it in more heterogeneous scenarios. In a cross-European validation experiment, even without setting a call threshold, HLA*IMP:02 achieves an average accuracy of 96% at 4-digit resolution (>=91% for all loci, which is achieved at HLA-DRB1). HLA*IMP:02 can accurately predict structural variation (DRB paralogs), can (to an extent) detect errors in the reference panel and is highly tolerant of missing data. I demonstrate that a good match between imputation and reference panels in terms of principal components and reference panel size are essential determinants of high imputation accuracy under HLA*IMP:02.
Style APA, Harvard, Vancouver, ISO itp.
34

Duncan, Andrew Paul. "The analysis and application of artificial neural networks for early warning systems in hydrology and the environment". Thesis, University of Exeter, 2014. http://hdl.handle.net/10871/17569.

Pełny tekst źródła
Streszczenie:
Artificial Neural Networks (ANNs) have been comprehensively researched, both from a computer scientific perspective and with regard to their use for predictive modelling in a wide variety of applications including hydrology and the environment. Yet their adoption for live, real-time systems remains on the whole sporadic and experimental. A plausible hypothesis is that this may be at least in part due to their treatment heretofore as “black boxes” that implicitly contain something that is unknown, or even unknowable. It is understandable that many of those responsible for delivering Early Warning Systems (EWS) might not wish to take the risk of implementing solutions perceived as containing unknown elements, despite the computational advantages that ANNs offer. This thesis therefore builds on existing efforts to open the box and develop tools and techniques that visualise, analyse and use ANN weights and biases especially from the viewpoint of neural pathways from inputs to outputs of feedforward networks. In so doing, it aims to demonstrate novel approaches to self-improving predictive model construction for both regression and classification problems. This includes Neural Pathway Strength Feature Selection (NPSFS), which uses ensembles of ANNs trained on differing subsets of data and analysis of the learnt weights to infer degrees of relevance of the input features and so build simplified models with reduced input feature sets. Case studies are carried out for prediction of flooding at multiple nodes in urban drainage networks located in three urban catchments in the UK, which demonstrate rapid, accurate prediction of flooding both for regression and classification. Predictive skill is shown to reduce beyond the time of concentration of each sewer node, when actual rainfall is used as input to the models. Further case studies model and predict statutory bacteria count exceedances for bathing water quality compliance at 5 beaches in Southwest England. An illustrative case study using a forest fires dataset from the UCI machine learning repository is also included. Results from these model ensembles generally exhibit improved performance, when compared with single ANN models. Also ensembles with reduced input feature sets, using NPSFS, demonstrate as good or improved performance when compared with the full feature set models. Conclusions are drawn about a new set of tools and techniques, including NPSFS and visualisation techniques for inspection of ANN weights, the adoption of which it is hoped may lead to improved confidence in the use of ANN for live real-time EWS applications.
Style APA, Harvard, Vancouver, ISO itp.
35

Chen, Kunru. "Recurrent Neural Networks for Fault Detection : An exploratory study on a dataset about air compressor failures of heavy duty trucks". Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-38184.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
36

Mauricio-Sanchez, David, Andrade Lopes Alneu de i higuihara Juarez Pedro Nelson. "Approaches based on tree-structures classifiers to protein fold prediction". Institute of Electrical and Electronics Engineers Inc, 2017. http://hdl.handle.net/10757/622536.

Pełny tekst źródła
Streszczenie:
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado.
Protein fold recognition is an important task in the biological area. Different machine learning methods such as multiclass classifiers, one-vs-all and ensemble nested dichotomies were applied to this task and, in most of the cases, multiclass approaches were used. In this paper, we compare classifiers organized in tree structures to classify folds. We used a benchmark dataset containing 125 features to predict folds, comparing different supervised methods and achieving 54% of accuracy. An approach related to tree-structure of classifiers obtained better results in comparison with a hierarchical approach.
Revisión por pares
Style APA, Harvard, Vancouver, ISO itp.
37

Bloodgood, Michael. "Active learning with support vector machines for imbalanced datasets and a method for stopping active learning based on stabilizing predictions". Access to citation, abstract and download form provided by ProQuest Information and Learning Company; downloadable PDF file, 200 p, 2009. http://proquest.umi.com/pqdweb?did=1818417671&sid=1&Fmt=2&clientId=8331&RQT=309&VName=PQD.

Pełny tekst źródła
Style APA, Harvard, Vancouver, ISO itp.
38

PETRINI, ALESSANDRO. "HIGH PERFORMANCE COMPUTING MACHINE LEARNING METHODS FOR PRECISION MEDICINE". Doctoral thesis, Università degli Studi di Milano, 2021. http://hdl.handle.net/2434/817104.

Pełny tekst źródła
Streszczenie:
La Medicina di Precisione (Precision Medicine) è un nuovo paradigma che sta rivoluzionando diversi aspetti delle pratiche cliniche: nella prevenzione e diagnosi, essa è caratterizzata da un approccio diverso dal "one size fits all" proprio della medicina classica. Lo scopo delle Medicina di Precisione è di trovare misure di prevenzione, diagnosi e cura che siano specifiche per ciascun individuo, a partire dalla sua storia personale, stile di vita e fattori genetici. Tre fattori hanno contribuito al rapido sviluppo della Medicina di Precisione: la possibilità di generare rapidamente ed economicamente una vasta quantità di dati omici, in particolare grazie alle nuove tecniche di sequenziamento (Next-Generation Sequencing); la possibilità di diffondere questa enorme quantità di dati grazie al paradigma "Big Data"; la possibilità di estrarre da questi dati tutta una serie di informazioni rilevanti grazie a tecniche di elaborazione innovative ed altamente sofisticate. In particolare, le tecniche di Machine Learning introdotte negli ultimi anni hanno rivoluzionato il modo di analizzare i dati: esse forniscono dei potenti strumenti per l'inferenza statistica e l'estrazione di informazioni rilevanti dai dati in maniera semi-automatica. Al contempo, però, molto spesso richiedono elevate risorse computazionali per poter funzionare efficacemente. Per questo motivo, e per l'elevata mole di dati da elaborare, è necessario sviluppare delle tecniche di Machine Learning orientate al Big Data che utilizzano espressamente tecniche di High Performance Computing, questo per poter sfruttare al meglio le risorse di calcolo disponibili e su diverse scale, dalle singole workstation fino ai super-computer. In questa tesi vengono presentate tre tecniche di Machine Learning sviluppate nel contesto del High Performance Computing e create per affrontare tre questioni fondamentali e ancora irrisolte nel campo della Medicina di Precisione, in particolare la Medicina Genomica: i) l'identificazione di varianti deleterie o patogeniche tra quelle neutrali nelle aree non codificanti del DNA; ii) l'individuazione della attività delle regioni regolatorie in diverse linee cellulari e tessuti; iii) la predizione automatica della funzione delle proteine nel contesto di reti biomolecolari. Per il primo problema è stato sviluppato parSMURF, un innovativo metodo basato su hyper-ensemble in grado di gestire l'elevato grado di sbilanciamento che caratterizza l'identificazione di varianti patogeniche e deleterie in mezzo al "mare" di varianti neutrali nelle aree non-coding del DNA. L'algoritmo è stato implementato per sfruttare appositamente le risorse di supercalcolo del CINECA (Marconi - KNL) e HPC Center Stuttgart (HLRS Apollo HAWK), ottenendo risultati allo stato dell'arte, sia per capacità predittiva, sia per scalabilità. Il secondo problema è stato affrontato tramite lo sviluppo di reti neurali "deep", in particolare Deep Feed Forward e Deep Convolutional Neural Networks per analizzare - rispettivamente - dati di natura epigenetica e sequenze di DNA, con lo scopo di individuare promoter ed enhancer attivi in linee cellulari e tessuti specifici. L'analisi è compiuta "genome-wide" e sono state usate tecniche di parallelizzazione su GPU. Infine, per il terzo problema è stato sviluppato un algoritmo di Machine Learning semi-supervisionato su grafo basato su reti di Hopfield per elaborare efficacemente grandi network biologici, utilizzando ancora tecniche di parallelizzazione su GPU; in particolare, una parte rilevante dell'algoritmo è data dall'introduzione di una tecnica parallela di colorazione del grafo che migliora il classico approccio greedy introdotto da Luby. Tra i futuri lavori e le attività in corso, viene presentato il progetto inerente all'estensione di parSMURF che è stato recentemente premiato dal consorzio Partnership for Advance in Computing in Europe (PRACE) allo scopo di sviluppare ulteriormente l'algoritmo e la sua implementazione, applicarlo a dataset di diversi ordini di grandezza più grandi e inserire i risultati in Genomiser, lo strumento attualmente allo stato dell'arte per l'individuazione di varianti genetiche Mendeliane. Questo progetto è inserito nel contesto di una collaborazione internazionale con i Jackson Lab for Genomic Medicine.
Precision Medicine is a new paradigm which is reshaping several aspects of clinical practice, representing a major departure from the "one size fits all" approach in diagnosis and prevention featured in classical medicine. Its main goal is to find personalized prevention measures and treatments, on the basis of the personal history, lifestyle and specific genetic factors of each individual. Three factors contributed to the rapid rise of Precision Medicine approaches: the ability to quickly and cheaply generate a vast amount of biological and omics data, mainly thanks to Next-Generation Sequencing; the ability to efficiently access this vast amount of data, under the Big Data paradigm; the ability to automatically extract relevant information from data, thanks to innovative and highly sophisticated data processing analytical techniques. Machine Learning in recent years revolutionized data analysis and predictive inference, influencing almost every field of research. Moreover, high-throughput bio-technologies posed additional challenges to effectively manage and process Big Data in Medicine, requiring novel specialized Machine Learning methods and High Performance Computing techniques well-tailored to process and extract knowledge from big bio-medical data. In this thesis we present three High Performance Computing Machine Learning techniques that have been designed and developed for tackling three fundamental and still open questions in the context of Precision and Genomic Medicine: i) identification of pathogenic and deleterious genomic variants among the "sea" of neutral variants in the non-coding regions of the DNA; ii) detection of the activity of regulatory regions across different cell lines and tissues; iii) automatic protein function prediction and drug repurposing in the context of biomolecular networks. For the first problem we developed parSMURF, a novel hyper-ensemble method able to deal with the huge data imbalance that characterizes the detection of pathogenic variants in the non-coding regulatory regions of the human genome. We implemented this approach with highly parallel computational techniques using supercomputing resources at CINECA (Marconi – KNL) and HPC Center Stuttgart (HLRS Apollo HAWK), obtaining state-of-the-art results. For the second problem we developed Deep Feed Forward and Deep Convolutional Neural Networks to respectively process epigenetic and DNA sequence data to detect active promoters and enhancers in specific tissues at genome-wide level using GPU devices to parallelize the computation. Finally we developed scalable semi-supervised graph-based Machine Learning algorithms based on parametrized Hopfield Networks to process in parallel using GPU devices large biological graphs, using a parallel coloring method that improves the classical Luby greedy algorithm. We also present ongoing extensions of parSMURF, very recently awarded by the Partnership for Advance in Computing in Europe (PRACE) consortium to further develop the algorithm, apply them to huge genomic data and embed its results into Genomiser, a state-of-the-art computational tool for the detection of pathogenic variants associated with Mendelian genetic diseases, in the context of an international collaboration with the Jackson Lab for Genomic Medicine.
Style APA, Harvard, Vancouver, ISO itp.
39

Malazizi, Ladan. "Development of Artificial Intelligence-based In-Silico Toxicity Models. Data Quality Analysis and Model Performance Enhancement through Data Generation". Thesis, University of Bradford, 2008. http://hdl.handle.net/10454/4262.

Pełny tekst źródła
Streszczenie:
Toxic compounds, such as pesticides, are routinely tested against a range of aquatic, avian and mammalian species as part of the registration process. The need for reducing dependence on animal testing has led to an increasing interest in alternative methods such as in silico modelling. The QSAR (Quantitative Structure Activity Relationship)-based models are already in use for predicting physicochemical properties, environmental fate, eco-toxicological effects, and specific biological endpoints for a wide range of chemicals. Data plays an important role in modelling QSARs and also in result analysis for toxicity testing processes. This research addresses number of issues in predictive toxicology. One issue is the problem of data quality. Although large amount of toxicity data is available from online sources, this data may contain some unreliable samples and may be defined as of low quality. Its presentation also might not be consistent throughout different sources and that makes the access, interpretation and comparison of the information difficult. To address this issue we started with detailed investigation and experimental work on DEMETRA data. The DEMETRA datasets have been produced by the EC-funded project DEMETRA. Based on the investigation, experiments and the results obtained, the author identified a number of data quality criteria in order to provide a solution for data evaluation in toxicology domain. An algorithm has also been proposed to assess data quality before modelling. Another issue considered in the thesis was the missing values in datasets for toxicology domain. Least Square Method for a paired dataset and Serial Correlation for single version dataset provided the solution for the problem in two different situations. A procedural algorithm using these two methods has been proposed in order to overcome the problem of missing values. Another issue we paid attention to in this thesis was modelling of multi-class data sets in which the severe imbalance class samples distribution exists. The imbalanced data affect the performance of classifiers during the classification process. We have shown that as long as we understand how class members are constructed in dimensional space in each cluster we can reform the distribution and provide more knowledge domain for the classifier.
Style APA, Harvard, Vancouver, ISO itp.
40

Matsumoto, Élia Yathie. "A methodology for improving computed individual regressions predictions". Universidade de São Paulo, 2015. http://www.teses.usp.br/teses/disponiveis/3/3142/tde-12052016-140407/.

Pełny tekst źródła
Streszczenie:
This research proposes a methodology to improve computed individual prediction values provided by an existing regression model without having to change either its parameters or its architecture. In other words, we are interested in achieving more accurate results by adjusting the calculated regression prediction values, without modifying or rebuilding the original regression model. Our proposition is to adjust the regression prediction values using individual reliability estimates that indicate if a single regression prediction is likely to produce an error considered critical by the user of the regression. The proposed method was tested in three sets of experiments using three different types of data. The first set of experiments worked with synthetically produced data, the second with cross sectional data from the public data source UCI Machine Learning Repository and the third with time series data from ISO-NE (Independent System Operator in New England). The experiments with synthetic data were performed to verify how the method behaves in controlled situations. In this case, the outcomes of the experiments produced superior results with respect to predictions improvement for artificially produced cleaner datasets with progressive worsening with the addition of increased random elements. The experiments with real data extracted from UCI and ISO-NE were done to investigate the applicability of the methodology in the real world. The proposed method was able to improve regression prediction values by about 95% of the experiments with real data.
Esta pesquisa propõe uma metodologia para melhorar previsões calculadas por um modelo de regressão, sem a necessidade de modificar seus parâmetros ou sua arquitetura. Em outras palavras, o objetivo é obter melhores resultados por meio de ajustes nos valores computados pela regressão, sem alterar ou reconstruir o modelo de previsão original. A proposta é ajustar os valores previstos pela regressão por meio do uso de estimadores de confiabilidade individuais capazes de indicar se um determinado valor estimado é propenso a produzir um erro considerado crítico pelo usuário da regressão. O método proposto foi testado em três conjuntos de experimentos utilizando três tipos de dados diferentes. O primeiro conjunto de experimentos trabalhou com dados produzidos artificialmente, o segundo, com dados transversais extraídos no repositório público de dados UCI Machine Learning Repository, e o terceiro, com dados do tipo séries de tempos extraídos do ISO-NE (Independent System Operator in New England). Os experimentos com dados artificiais foram executados para verificar o comportamento do método em situações controladas. Nesse caso, os experimentos alcançaram melhores resultados para dados limpos artificialmente produzidos e evidenciaram progressiva piora com a adição de elementos aleatórios. Os experimentos com dados reais extraído das bases de dados UCI e ISO-NE foram realizados para investigar a aplicabilidade da metodologia no mundo real. O método proposto foi capaz de melhorar os valores previstos por regressões em cerca de 95% dos experimentos realizados com dados reais.
Style APA, Harvard, Vancouver, ISO itp.
41

Hrabina, Martin. "VÝVOJ ALGORITMŮ PRO ROZPOZNÁVÁNÍ VÝSTŘELŮ". Doctoral thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2019. http://www.nusl.cz/ntk/nusl-409087.

Pełny tekst źródła
Streszczenie:
Táto práca sa zaoberá rozpoznávaním výstrelov a pridruženými problémami. Ako prvé je celá vec predstavená a rozdelená na menšie kroky. Ďalej je poskytnutý prehľad zvukových databáz, významné publikácie, akcie a súčasný stav veci spoločne s prehľadom možných aplikácií detekcie výstrelov. Druhá časť pozostáva z porovnávania príznakov pomocou rôznych metrík spoločne s porovnaním ich výkonu pri rozpoznávaní. Nasleduje porovnanie algoritmov rozpoznávania a sú uvedené nové príznaky použiteľné pri rozpoznávaní. Práca vrcholí návrhom dvojstupňového systému na rozpoznávanie výstrelov, monitorujúceho okolie v reálnom čase. V závere sú zhrnuté dosiahnuté výsledky a načrtnutý ďalší postup.
Style APA, Harvard, Vancouver, ISO itp.
42

SAMADI, HEANG, i 黄善玉. "Applying Linear Hazard Transform for Mortality Prediction to Taiwanese Dataset". Thesis, 2018. http://ndltd.ncl.edu.tw/handle/r8q3fn.

Pełny tekst źródła
Streszczenie:
碩士
逢甲大學
風險管理與保險學系
106
The thesis will present and compare two mortality modeling, Linear Hazard Transform (LHT) and Lee Carter, based on the real data in Taiwan. Empirical observation between two sequences of the force of mortality for two different years (no need to be consecutive) shows that there is a linear relation, and the two estimated coefficients of the LHT model can be obtained by multiple linear regression to capture the mortality improvement. Under those two fitted numbers, we plot the real and fitted survival curves and notice that the LHT is good at fitting under some statistical criteria. After forecasting the parameters, the future mortality rates can be yearly predicted. Moreover, the optimal period is considered based on numerous experiments, and the intercept coefficient in the model has been modified. Lastly, the application to TSO dataset for net single premium calculation is presented. Keywords: Linear Hazard Transform, Lee Carter, Net Single Premium, Fitting Mortality, Mortality Projection, Model Selection, AIC, BIC, RMSE, MAE, MAPE.
Style APA, Harvard, Vancouver, ISO itp.
43

Lo, Chia-Yu, i 駱佳妤. "Recurrent Learning on PM2.5 Prediction Based on Clustered Airbox Dataset". Thesis, 2019. http://ndltd.ncl.edu.tw/handle/r49hyt.

Pełny tekst źródła
Streszczenie:
碩士
國立中央大學
資訊工程學系
107
The progress of industrial development naturally leads to the demand of more electrical power. Unfortunately, due to the fear of the safety of nuclear power plants, many countries have relied on thermal power plants, which will cause more air pollutants during the process of coal burning. This phenomenon as well as more vehicle emissions around us, have constituted the primary factors of serious air pollution. Inhaling too much particulate air pollution may lead to respiratory diseases and even death, especially PM2.5. By predicting the air pollutant concentration, people can take precautions to avoid overexposure in the air pollutants. Consequently, the accurate PM2.5 prediction becomes more important. In this thesis, we propose a PM2.5 prediction system, which utilizes the dataset from EdiGreen Airbox and Taiwan EPA. Autoencoder and Linear interpolation are adopted for solving the missing value problem. Spearman's correlation coecient is used to identify the most relevant features for PM2.5. Two prediction models (i.e., LSTM and LSTM based on K-means) are implemented which predict PM2.5 value for each Airbox device. To assess the performance of the model prediction, the daily average error and the hourly average accuracy for the duration of a week are calculated. The experimental results show that LSTM based on K-means has the best performance among all methods. Therefore, LSTM based on K-means is chosen to provide real-time PM2.5 prediction through the Linebot.
Style APA, Harvard, Vancouver, ISO itp.
44

Chen, Mei-Yun, i 陳鎂鋆. "Prediction Model for Semitransparent Watercolor PigmentMixtures Using Deep Learning with a Dataset of Transmittance and Reflectance". Thesis, 2019. http://ndltd.ncl.edu.tw/handle/24u7ek.

Pełny tekst źródła
Streszczenie:
博士
國立臺灣大學
資訊網路與多媒體研究所
107
Learning color mixing is difficult for novice painters. In order to support novice painters in learning color mixing, we propose a prediction model for semitransparent pigment mixtures and use its prediction results to create a Smart Palette system. Such a system is constructed by first building a watercolor dataset with two types of color mixing data, indicated by transmittance and reflectance: incrementation of the same primary pigment and a mixture of two different pigments. Next, we apply the collected data to a deep neural network to train a model for predicting the results of semitransparent pigment mixtures. Finally, we constructed a Smart Palette that provides easily-followable instructions on mixing a target color with two primary pigments in real life: when users pick a pixel, an RGB color, from an image, the system returns its mixing recipe which indicates the two primary pigments being used and their quantities. When evaluating the pigment mixtures produced by the aforementioned model against ground truth, 83% of the test set registered a color distance of ΔE*ab < 5; ΔE*ab, above 5 is where average observers start determining that the colors in comparison as two different colors. In addition, in order to examine the effectiveness of the Smart Palette system, we design a user evaluation which untrained users perform pigment mixing with three methods: by intuition, based on Itten''s color wheel, and with the Smart Palette and the results are then compiled as three color distance, ΔE*ab values. After that, the color distance of the three methods are examined by a t-test to prove whether the color differences were significant. Combining the results of color distance and the t-values of the t-test, it can demonstrate that the mixing results produced by using the Smart Palette is obviously closer to a target color than that of the others. Base on these evaluations, our system, the Smart Palette demonstrates that it can effectively help users to learn and perform better at color mixing than that of the traditional method.
Style APA, Harvard, Vancouver, ISO itp.
45

MUKUL. "EFFICIENT CLASSIFICATION ON THE BASIS OF DECISION TRESS". Thesis, 2019. http://dspace.dtu.ac.in:8080/jspui/handle/repository/17065.

Pełny tekst źródła
Streszczenie:
Machine learning has gained a great interest in last decade among the researchers. Having such a great community which provide a continuously growing list of proposed algorithms, it is rapidly finding solutions to its problem. Therefore more and more people are entering the field of machine learning to make the idea of machine learning algorithms more useful and reliable for the digital world. In this thesis , efficient and accurate classification performances by various machine learning algorithms on the proposed heart disease prediction dataset is being discussed.The dataset cosists of 14 cloumns and 1026 rows. First 13 columns consists of Predictor variables and the last column is our target variable, which consists of Categorical values (0 and 1). Our main focus is to find the best fit model on the above proposed dataset. The machine learning algorithms that are applied and compared are KNN, SVM and DECISION TREES CLASSIFIERS. The programming tool used is python 3.7 and jupyter notebook installed in our systems. The libraries that are installed from external source to our system for plotting decision trees is graphviz and pydotplus. This thesis comprises of six chapters. Chapter 1 is our introduction part to have brief review about the used technologies in our research work process. Chapter 2 is our literature review section which basically focuses about all the past works that have been made till date on the above mentioned dataset and research ideology. Chapter 3 consists of proposed machine learning algorithms workings and their respective Methodologies in detail. (v) Chapter 4 is our results and discussion part in which we have evaluated the Performances of various machine learning algorithms used in our research work and finding out best performing Algorithm. Chapter 5 is our conclusion section in which we have concluded the best fitting Machine learning algorithm model on the basis of results and discussions in chapter 4. Chapter 6 is our refrences section. The brief review of this thesis is , to find the best fit model With highest accuracy and with minimum number of misclassification on the given heart disease Prediction dataset.
Style APA, Harvard, Vancouver, ISO itp.
46

Raj, Rohit. "Towards Robustness of Neural Legal Judgement System". Thesis, 2023. https://etd.iisc.ac.in/handle/2005/6145.

Pełny tekst źródła
Streszczenie:
Legal Judgment Prediction (LJP) implements Natural Language Processing (NLP) techniques to predict judgment results based on fact description. It can play a vital role as a legal assistant and benefit legal practitioners and regular citizens. Recently, the rapid advances in transformer- based pre-trained language models led to considerable improvement in this area. However, empirical results show that existing LJP systems are not robust to adversaries and noise. Also, they cannot handle large-length legal documents. In this work, we explore the robustness and efficiency of LJP systems even in a low data regime. In the first part, we empirically verify that existing state-of-the-art LJP systems are not robust. We further provide our novel architecture for LJP tasks which can handle extensive text lengths and adversarial examples. Our model performs better than state-of-the-art models, even in the presence of adversarial examples of the legal domain. In the second part, we investigate the approach for the LJP system in a low data regime. We further divide our second work into two scenarios depending on the number of unseen classes in the dataset which is being used for the LJP system. In the first scenario, we propose a few-shot approach with only two labels for the Judgement prediction task. In the second scenario, we propose an approach where we have an excessive number of labels for judgment prediction. For both approaches, we provide novel architectures using few-shot learning that are also robust to adversaries. We conducted extensive experiments on American, European, and Indian legal datasets in the few-shot scenario. Though trained using the few-shot approach, our models perform comparably to state-of-the-art models that are trained using large datasets in the legal domain.
Style APA, Harvard, Vancouver, ISO itp.
47

Lutu, P. E. N. (Patricia Elizabeth Nalwoga). "Dataset selection for aggregate model implementation in predictive data mining". Thesis, 2010. http://hdl.handle.net/2263/29486.

Pełny tekst źródła
Streszczenie:
Data mining has become a commonly used method for the analysis of organisational data, for purposes of summarizing data in useful ways and identifying non-trivial patterns and relationships in the data. Given the large volumes of data that are collected by business, government, non-government and scientific research organizations, a major challenge for data mining researchers and practitioners is how to select relevant data for analysis in sufficient quantities, in order to meet the objectives of a data mining task. This thesis addresses the problem of dataset selection for predictive data mining. Dataset selection was studied in the context of aggregate modeling for classification. The central argument of this thesis is that, for predictive data mining, it is possible to systematically select many dataset samples and employ different approaches (different from current practice) to feature selection, training dataset selection, and model construction. When a large amount of information in a large dataset is utilised in the modeling process, the resulting models will have a high level of predictive performance and should be more reliable. Aggregate classification models, also known as ensemble classifiers, have been shown to provide a high level of predictive accuracy on small datasets. Such models are known to achieve a reduction in the bias and variance components of the prediction error of a model. The research for this thesis was aimed at the design of aggregate models and the selection of training datasets from large amounts of available data. The objectives for the model design and dataset selection were to reduce the bias and variance components of the prediction error for the aggregate models. Design science research was adopted as the paradigm for the research. Large datasets obtained from the UCI KDD Archive were used in the experiments. Two classification algorithms: See5 for classification tree modeling and K-Nearest Neighbour, were used in the experiments. The two methods of aggregate modeling that were studied are One-Vs-All (OVA) and positive-Vs-negative (pVn) modeling. While OVA is an existing method that has been used for small datasets, pVn is a new method of aggregate modeling, proposed in this thesis. Methods for feature selection from large datasets, and methods for training dataset selection from large datasets, for OVA and pVn aggregate modeling, were studied. The experiments of feature selection revealed that the use of many samples, robust measures of correlation, and validation procedures result in the reliable selection of relevant features for classification. A new algorithm for feature subset search, based on the decision rule-based approach to heuristic search, was designed and the performance of this algorithm was compared to two existing algorithms for feature subset search. The experimental results revealed that the new algorithm makes better decisions for feature subset search. The information provided by a confusion matrix was used as a basis for the design of OVA and pVn base models which aren combined into one aggregate model. A new construct called a confusion graph was used in conjunction with new algorithms for the design of pVn base models. A new algorithm for combining base model predictions and resolving conflicting predictions was designed and implemented. Experiments to study the performance of the OVA and pVn aggregate models revealed the aggregate models provide a high level of predictive accuracy compared to single models. Finally, theoretical models to depict the relationships between the factors that influence feature selection and training dataset selection for aggregate models are proposed, based on the experimental results.
Thesis (PhD)--University of Pretoria, 2010.
Computer Science
unrestricted
Style APA, Harvard, Vancouver, ISO itp.
48

Ralho, João Pedro Loureiro. "Learning Single-View Plane Prediction from autonomous driving datasets". Master's thesis, 2019. http://hdl.handle.net/10316/87852.

Pełny tekst źródła
Streszczenie:
Dissertação de Mestrado Integrado em Engenharia Electrotécnica e de Computadores apresentada à Faculdade de Ciências e Tecnologia
A reconstrução 3D tradicional usando múltiplas imagens apresenta algumas dificuldades em cenas com pouca ou repetida textura, superfícies inclinadas, iluminação variada e especularidades. Este problema foi resolvido através de geometria planar (PPR), bastante frequente em estruturas construidas pelo ser humano. As primitivas planares são usadas para obter uma reconstrução mais precisa, geometricamente simples, e visualmente mais apelativa que uma núvem de pontos. Estimação de profundidade através de uma única imagem (SIDE) é uma ideia bastante apelativa que recentemente ganhou novo destaque devido à emergência de métodos de aprendizagem e de novas formas de gerar grandes conjuntos de dados RGB-D precisos. No fundo, esta dissertação pretende extender o trabalho desenvolvido em SIDE para reconstrução 3D usando primitivas planares através de uma única imagem (SI-PPR). Os métodos existentes apresentam alguma dificuldade em gerar bons resultados porque não existem grandes coleções de dados PPR precisos de cenas reais. Como tal, o objetivo desta dissertação é propor um pipeline para gerar de forma eficiente grandes coleções de dados PPR para retreinar métodos de estimação PPR. O pipeline é composto por três partes, uma responsável por gerar informação sobre a profundidade numa imagem através do colmap, segmentação manual dos planos verificados na imagem, e uma propagação automática da segmentação realizada e dos parâmetros dos planos para as imagens vizinhas usando uma nova estratégia com base em restrições geométricas e amostragem aleatória. O pipeline criado é capaz de gerar dados PPR com eficiência e precisão a partir de imagens reais.
Traditional 3D reconstruction using multiple images have some difficulties in dealing with scenes with little or repeated texture, slanted surfaces, variable illumination, and specularities. This problem was solved using planarity prior (PPR), which is quite common in man-made environments. Planar primitives are used for a more accurate, geometrically simple, and visually appealing reconstruction than a cloud of points. Single image depth estimation (SIDE) is a very appealing idea that has recently gained new prominence due to the emergence of learning methods and new ways to generate large accurate RGB-D datasets. This dissertation intends to extend the work developed in SIDE and work on single image piece-wise planar reconstruction (SI-PPR). Existing methods struggle in outputting accurate planar information from images because there are no large collections of accurate PPR data from real imagery. Therefore, this dissertation aims to propose a pipeline to efficiently generate large PPR datasets to re-train PPR estimation approaches. The pipeline is composed by three main stages, depth data generation using colmap, manual labeling of a small percentage of the images of the dataset, and automatic label and plane propagation to neighbouring views using a new strategy based on geometric constraints and random sampling. The pipeline created is able to efficiently and accurately generate PPR data from real images. -------------------------------------------------------------------
Style APA, Harvard, Vancouver, ISO itp.
49

"Predicting Demographic and Financial Attributes in a Bank Marketing Dataset". Master's thesis, 2016. http://hdl.handle.net/2286/R.I.38651.

Pełny tekst źródła
Streszczenie:
abstract: Bank institutions employ several marketing strategies to maximize new customer acquisition as well as current customer retention. Telemarketing is one such approach taken where individual customers are contacted by bank representatives with offers. These telemarketing strategies can be improved in combination with data mining techniques that allow predictability of customer information and interests. In this thesis, bank telemarketing data from a Portuguese banking institution were analyzed to determine predictability of several client demographic and financial attributes and find most contributing factors in each. Data were preprocessed to ensure quality, and then data mining models were generated for the attributes with logistic regression, support vector machine (SVM) and random forest using Orange as the data mining tool. Results were analyzed using precision, recall and F1 score.
Dissertation/Thesis
Masters Thesis Computer Science 2016
Style APA, Harvard, Vancouver, ISO itp.
50

Hong-YangLin i 林泓暘. "Generating Aggregated Weights to Improve the Predictive Accuracy of Single-Model Ensemble Numerical Predicting Method in Small Datasets". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/6b385s.

Pełny tekst źródła
Streszczenie:
碩士
國立成功大學
工業與資訊管理學系
105
In the age of information explosion,it’s easier to reach out to information,so how to explore and conclude some useful information in limited data is a pretty important study in small data learning.nowadays,the studies in ensemble method mostly focus on the process instead of the result.the methods in datamining can be divided into classification and prediction.in ensemble method ,voting is the most common way to deal with classification,but in numerical prediction problem,average method is the most common way to calculate the result,but it can be easily affected by some extreme values,especially in the circumstances of small datasets We make an improvement in Bagging.We use SVR as our prediction model ,and calculate the error value based on our prediction model,so we can get a corresponding weight value of each prediction value,and then we can calculate the compromise prediction value under the purpose of getting the smallest error value.Therefore,we can stabilize our system,and we compare our method to average method in order to examine the effect of our study,and we also take the practical case in panel factory to prove the improvement in single-model ensemble method
Style APA, Harvard, Vancouver, ISO itp.
Oferujemy zniżki na wszystkie plany premium dla autorów, których prace zostały uwzględnione w tematycznych zestawieniach literatury. Skontaktuj się z nami, aby uzyskać unikalny kod promocyjny!

Do bibliografii