Dissertations / Theses: 'ENSEMBLE LEARNING MODELS'

1

He, Wenbin. "Exploration and Analysis of Ensemble Datasets with Statistical and Deep Learning Models." The Ohio State University, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=osu1574695259847734.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Kim, Jinhan. "J-model : an open and social ensemble learning architecture for classification." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/7672.

Full text

Abstract:

Ensemble learning is a promising direction of research in machine learning, in which an ensemble classifier gives better predictive and more robust performance for classification problems by combining other learners. Meanwhile agent-based systems provide frameworks to share knowledge from multiple agents in an open context. This thesis combines multi-agent knowledge sharing with ensemble methods to produce a new style of learning system for open environments. We now are surrounded by many smart objects such as wireless sensors, ambient communication devices, mobile medical devices and even information supplied via other humans. When we coordinate smart objects properly, we can produce a form of collective intelligence from their collaboration. Traditional ensemble methods and agent-based systems have complementary advantages and disadvantages in this context. Traditional ensemble methods show better classification performance, while agent-based systems might not guarantee their performance for classification. Traditional ensemble methods work as closed and centralised systems (so they cannot handle classifiers in an open context), while agent-based systems are natural vehicles for classifiers in an open context. We designed an open and social ensemble learning architecture, named J-model, to merge the conflicting benefits of the two research domains. The J-model architecture is based on a service choreography approach for coordinating classifiers. Coordination protocols are defined by interaction models that describe how classifiers will interact with one another in a peer-to-peer manner. The peer ranking algorithm recommends more appropriate classifiers to participate in an interaction model to boost the success rate of results of their interactions. Coordinated participant classifiers who are recommended by the peer ranking algorithm become an ensemble classifier within J-model. We evaluated J-model’s classification performance with 13 UCI machine learning benchmark data sets and a virtual screening problem as a realistic classification problem. J-model showed better performance of accuracy, for 9 benchmark sets out of 13 data sets, than 8 other representative traditional ensemble methods. J-model gave better results of specificity for 7 benchmark sets. In the virtual screening problem, J-model gave better results for 12 out of 16 bioassays than already published results. We defined different interaction models for each specific classification task and the peer ranking algorithm was used across all the interaction models. Our research contributions to knowledge are as follows. First, we showed that service choreography can be an effective ensemble coordination method for classifiers in an open context. Second, we used interaction models that implement task specific coordinations of classifiers to solve a variety of representative classification problems. Third, we designed the peer ranking algorithm which is generally and independently applicable to the task of recommending appropriate member classifiers from a classifier pool based on an open pool of interaction models and classifiers.

APA, Harvard, Vancouver, ISO, and other styles

3

Gharroudi, Ouadie. "Ensemble multi-label learning in supervised and semi-supervised settings." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSE1333/document.

Full text

Abstract:

L'apprentissage multi-label est un problème d'apprentissage supervisé où chaque instance peut être associée à plusieurs labels cibles simultanément. Il est omniprésent dans l'apprentissage automatique et apparaÃ®t naturellement dans de nombreuses applications du monde réel telles que la classification de documents, l'étiquetage automatique de musique et l'annotation d'images. Nous discutons d'abord pourquoi les algorithmes multi-label de l'etat-de-l'art utilisant un comité de modèle souffrent de certains inconvénients pratiques. Nous proposons ensuite une nouvelle stratégie pour construire et agréger les modèles ensemblistes multi-label basés sur k-labels. Nous analysons ensuite en profondeur l'effet de l'étape d'agrégation au sein des approches ensemblistes multi-label et étudions comment cette agrégation influece les performances de prédictive du modèle enfocntion de la nature de fonction cout à optimiser. Nous abordons ensuite le problème spécifique de la selection de variables dans le contexte multi-label en se basant sur le paradigme ensembliste. Trois méthodes de sélection de caractéristiques multi-label basées sur le paradigme des forêts aléatoires sont proposées. Ces méthodes diffèrent dans la façon dont elles considèrent la dépendance entre les labels dans le processus de sélection des varibales. Enfin, nous étendons les problèmes de classification et de sélection de variables au cadre d'apprentissage semi-supervisé. Nous proposons une nouvelle approche de sélection de variables multi-label semi-supervisée basée sur le paradigme de l'ensemble. Le modèle proposé associe des principes issues de la co-training en conjonction avec une métrique interne d'évaluation d'importnance des varaibles basée sur les out-of-bag. Testés de manière satisfaisante sur plusieurs données de référence, les approches développées dans cette thèse sont prometteuses pour une variété d'ap-plications dans l'apprentissage multi-label supervisé et semi-supervisé. Testés de manière satisfaisante sur plusieurs jeux de données de référence, les approches développées dans cette thèse affichent des résultats prometteurs pour une variété domaine d'applications de l'apprentissage multi-label supervisé et semi-supervisé
Multi-label learning is a specific supervised learning problem where each instance can be associated with multiple target labels simultaneously. Multi-label learning is ubiquitous in machine learning and arises naturally in many real-world applications such as document classification, automatic music tagging and image annotation. In this thesis, we formulate the multi-label learning as an ensemble learning problem in order to provide satisfactory solutions for both the multi-label classification and the feature selection tasks, while being consistent with respect to any type of objective loss function. We first discuss why the state-of-the art single multi-label algorithms using an effective committee of multi-label models suffer from certain practical drawbacks. We then propose a novel strategy to build and aggregate k-labelsets based committee in the context of ensemble multi-label classification. We then analyze the effect of the aggregation step within ensemble multi-label approaches in depth and investigate how this aggregation impacts the prediction performances with respect to the objective multi-label loss metric. We then address the specific problem of identifying relevant subsets of features - among potentially irrelevant and redundant features - in the multi-label context based on the ensemble paradigm. Three wrapper multi-label feature selection methods based on the Random Forest paradigm are proposed. These methods differ in the way they consider label dependence within the feature selection process. Finally, we extend the multi-label classification and feature selection problems to the semi-supervised setting and consider the situation where only few labelled instances are available. We propose a new semi-supervised multi-label feature selection approach based on the ensemble paradigm. The proposed model combines ideas from co-training and multi-label k-labelsets committee construction in tandem with an inner out-of-bag label feature importance evaluation. Satisfactorily tested on several benchmark data, the approaches developed in this thesis show promise for a variety of applications in supervised and semi-supervised multi-label learning

APA, Harvard, Vancouver, ISO, and other styles

4

Henriksson, Aron. "Ensembles of Semantic Spaces : On Combining Models of Distributional Semantics with Applications in Healthcare." Doctoral thesis, Stockholms universitet, Institutionen för data- och systemvetenskap, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-122465.

Full text

Abstract:

Distributional semantics allows models of linguistic meaning to be derived from observations of language use in large amounts of text. By modeling the meaning of words in semantic (vector) space on the basis of co-occurrence information, distributional semantics permits a quantitative interpretation of (relative) word meaning in an unsupervised setting, i.e., human annotations are not required. The ability to obtain inexpensive word representations in this manner helps to alleviate the bottleneck of fully supervised approaches to natural language processing, especially since models of distributional semantics are data-driven and hence agnostic to both language and domain. All that is required to obtain distributed word representations is a sizeable corpus; however, the composition of the semantic space is not only affected by the underlying data but also by certain model hyperparameters. While these can be optimized for a specific downstream task, there are currently limitations to the extent the many aspects of semantics can be captured in a single model. This dissertation investigates the possibility of capturing multiple aspects of lexical semantics by adopting the ensemble methodology within a distributional semantic framework to create ensembles of semantic spaces. To that end, various strategies for creating the constituent semantic spaces, as well as for combining them, are explored in a number of studies. The notion of semantic space ensembles is generalizable across languages and domains; however, the use of unsupervised methods is particularly valuable in low-resource settings, in particular when annotated corpora are scarce, as in the domain of Swedish healthcare. The semantic space ensembles are here empirically evaluated for tasks that have promising applications in healthcare. It is shown that semantic space ensembles – created by exploiting various corpora and data types, as well as by adjusting model hyperparameters such as the size of the context window and the strategy for handling word order within the context window – are able to outperform the use of any single constituent model on a range of tasks. The semantic space ensembles are used both directly for k-nearest neighbors retrieval and for semi-supervised machine learning. Applying semantic space ensembles to important medical problems facilitates the secondary use of healthcare data, which, despite its abundance and transformative potential, is grossly underutilized.

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 4 and 5: Unpublished conference papers.

High-Performance Data Mining for Drug Effect Detection

APA, Harvard, Vancouver, ISO, and other styles

5

Chakraborty, Debaditya. "Detection of Faults in HVAC Systems using Tree-based Ensemble Models and Dynamic Thresholds." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1543582336141076.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

Li, Qiongzhu. "Study of Single and Ensemble Machine Learning Models on Credit Data to Detect Underlying Non-performing Loans." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-297080.

Full text

Abstract:

In this paper, we try to compare the performance of two feature dimension reduction methods, the LASSO and PCA. Both simulation study and empirical study show that the LASSO is superior to PCA when selecting significant variables. We apply Logistics Regression (LR), Artificial Neural Network (ANN), Support Vector Machine (SVM), Decision Tree (DT) and their corresponding ensemble machines constructed by bagging and adaptive boosting (adaboost) in our study. Three experiments are conducted to explore the impact of class-unbalanced data set on all models. Empirical study indicates that when the percentage of performing loans exceeds 83.3%, the training models shall be carefully applied. When we have class-balanced data set, ensemble machines indeed have a better performance over single machines. The weaker the single machine, the more obvious the improvement we can observe.

APA, Harvard, Vancouver, ISO, and other styles

7

Franch, Gabriele. "Deep Learning for Spatiotemporal Nowcasting." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/295096.

Full text

Abstract:

Nowcasting – short-term forecasting using current observations – is a key challenge that human activities have to face on a daily basis. We heavily rely on short-term meteorological predictions in domains such as aviation, agriculture, mobility, and energy production. One of the most important and challenging task for meteorology is the nowcasting of extreme events, whose anticipation is highly needed to mitigate risk in terms of social or economic costs and human safety. The goal of this thesis is to contribute with new machine learning methods to improve the spatio-temporal precision of nowcasting of extreme precipitation events. This work relies on recent advances in deep learning for nowcasting, adding methods targeted at improving nowcasting using ensembles and trained on novel original data resources. Indeed, the new curated multi-year radar scan dataset (TAASRAD19) is introduced that contains more than 350.000 labelled precipitation records over 10 years, to provide a baseline benchmark, and foster reproducibility of machine learning modeling. A TrajGRU model is applied to TAASRAD19, and implemented in an operational prototype. The thesis also introduces a novel method for fast analog search based on manifold learning: the tool leverages the entire dataset history in less than 5 seconds and demonstrates the feasibility of predictive ensembles. In the final part of the thesis, the new deep learning architecture ConvSG based on stacked generalization is presented, introducing novel concepts for deep learning in precipitation nowcasting: ConvSG is specifically designed to improve predictions of extreme precipitation regimes over published methods, and shows a 117% skill improvement on extreme rain regimes over a single member. Moreover, ConvSG shows superior or equal skills compared to Lagrangian Extrapolation models for all rain rates, achieving a 49% average improvement in predictive skill over extrapolation on the higher precipitation regimes.

APA, Harvard, Vancouver, ISO, and other styles

8

Franch, Gabriele. "Deep Learning for Spatiotemporal Nowcasting." Doctoral thesis, Università degli studi di Trento, 2021. http://hdl.handle.net/11572/295096.

Full text

Abstract:

Nowcasting – short-term forecasting using current observations – is a key challenge that human activities have to face on a daily basis. We heavily rely on short-term meteorological predictions in domains such as aviation, agriculture, mobility, and energy production. One of the most important and challenging task for meteorology is the nowcasting of extreme events, whose anticipation is highly needed to mitigate risk in terms of social or economic costs and human safety. The goal of this thesis is to contribute with new machine learning methods to improve the spatio-temporal precision of nowcasting of extreme precipitation events. This work relies on recent advances in deep learning for nowcasting, adding methods targeted at improving nowcasting using ensembles and trained on novel original data resources. Indeed, the new curated multi-year radar scan dataset (TAASRAD19) is introduced that contains more than 350.000 labelled precipitation records over 10 years, to provide a baseline benchmark, and foster reproducibility of machine learning modeling. A TrajGRU model is applied to TAASRAD19, and implemented in an operational prototype. The thesis also introduces a novel method for fast analog search based on manifold learning: the tool leverages the entire dataset history in less than 5 seconds and demonstrates the feasibility of predictive ensembles. In the final part of the thesis, the new deep learning architecture ConvSG based on stacked generalization is presented, introducing novel concepts for deep learning in precipitation nowcasting: ConvSG is specifically designed to improve predictions of extreme precipitation regimes over published methods, and shows a 117% skill improvement on extreme rain regimes over a single member. Moreover, ConvSG shows superior or equal skills compared to Lagrangian Extrapolation models for all rain rates, achieving a 49% average improvement in predictive skill over extrapolation on the higher precipitation regimes.

APA, Harvard, Vancouver, ISO, and other styles

9

Ekström, Linus, and Andreas Augustsson. "A comperative study of text classification models on invoices : The feasibility of different machine learning algorithms and their accuracy." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15647.

Full text

Abstract:

Text classification for companies is becoming more important in a world where an increasing amount of digital data are made available. The aim is to research whether five different machine learning algorithms can be used to automate the process of classification of invoice data and see which one gets the highest accuracy. Algorithms are in a later stage combined for an attempt to achieve higher results. N-grams are used, and results are compared in form of total accuracy of classification for each algorithm. A library in Python, called scikit-learn, implementing the chosen algorithms, was used. Data is collected and generated to represent data present on a real invoice where data has been extracted. Results from this thesis show that it is possible to use machine learning for this type of problem. The highest scoring algorithm (LinearSVC from scikit-learn) classifies 86% of all samples correctly. This is a margin of 16% above the acceptable level of 70%.

APA, Harvard, Vancouver, ISO, and other styles

10

Lundberg, Jacob. "Resource Efficient Representation of Machine Learning Models : investigating optimization options for decision trees in embedded systems." Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-162013.

Full text

Abstract:

Combining embedded systems and machine learning models is an exciting prospect. However, to fully target any embedded system, with the most stringent resource requirements, the models have to be designed with care not to overwhelm it. Decision tree ensembles are targeted in this thesis. A benchmark model is created with LightGBM, a popular framework for gradient boosted decision trees. This model is first transformed and regularized with RuleFit, a LASSO regression framework. Then it is further optimized with quantization and weight sharing, techniques used when compressing neural networks. The entire process is combined into a novel framework, called ESRule. The data used comes from the domain of frequency measurements in cellular networks. There is a clear use-case where embedded systems can use the produced resource optimized models. Compared with LightGBM, ESRule uses 72ˆ less internal memory on average, simultaneously increasing predictive performance. The models use 4 kilobytes on average. The serialized variant of ESRule uses 104ˆ less hard disk space than LightGBM. ESRule is also clearly faster at predicting a single sample.

APA, Harvard, Vancouver, ISO, and other styles

11

Olofsson, Nina. "A Machine Learning Ensemble Approach to Churn Prediction : Developing and Comparing Local Explanation Models on Top of a Black-Box Classifier." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-210565.

Full text

Abstract:

Churn prediction methods are widely used in Customer Relationship Management and have proven to be valuable for retaining customers. To obtain a high predictive performance, recent studies rely on increasingly complex machine learning methods, such as ensemble or hybrid models. However, the more complex a model is, the more difficult it becomes to understand how decisions are actually made. Previous studies on machine learning interpretability have used a global perspective for understanding black-box models. This study explores the use of local explanation models for explaining the individual predictions of a Random Forest ensemble model. The churn prediction was studied on the users of Tink – a finance app. This thesis aims to take local explanations one step further by making comparisons between churn indicators of different user groups. Three sets of groups were created based on differences in three user features. The importance scores of all globally found churn indicators were then computed for each group with the help of local explanation models. The results showed that the groups did not have any significant differences regarding the globally most important churn indicators. Instead, differences were found for globally less important churn indicators, concerning the type of information that users stored in the app. In addition to comparing churn indicators between user groups, the result of this study was a well-performing Random Forest ensemble model with the ability of explaining the reason behind churn predictions for individual users. The model proved to be significantly better than a number of simpler models, with an average AUC of 0.93.
Metoder för att prediktera utträde är vanliga inom Customer Relationship Management och har visat sig vara värdefulla när det kommer till att behålla kunder. För att kunna prediktera utträde med så hög säkerhet som möjligt har den senasteforskningen fokuserat på alltmer komplexa maskininlärningsmodeller, såsom ensembler och hybridmodeller. En konsekvens av att ha alltmer komplexa modellerär dock att det blir svårare och svårare att förstå hur en viss modell har kommitfram till ett visst beslut. Tidigare studier inom maskininlärningsinterpretering har haft ett globalt perspektiv för att förklara svårförståeliga modeller. Denna studieutforskar lokala förklaringsmodeller för att förklara individuella beslut av en ensemblemodell känd som 'Random Forest'. Prediktionen av utträde studeras påanvändarna av Tink – en finansapp. Syftet med denna studie är att ta lokala förklaringsmodeller ett steg längre genomatt göra jämförelser av indikatorer för utträde mellan olika användargrupper. Totalt undersöktes tre par av grupper som påvisade skillnader i tre olika variabler. Sedan användes lokala förklaringsmodeller till att beräkna hur viktiga alla globaltfunna indikatorer för utträde var för respektive grupp. Resultaten visade att detinte fanns några signifikanta skillnader mellan grupperna gällande huvudindikatorerna för utträde. Istället visade resultaten skillnader i mindre viktiga indikatorer som hade att göra med den typ av information som lagras av användarna i appen. Förutom att undersöka skillnader i indikatorer för utträde resulterade dennastudie i en välfungerande modell för att prediktera utträde med förmågan attförklara individuella beslut. Random Forest-modellen visade sig vara signifikantbättre än ett antal enklare modeller, med ett AUC-värde på 0.93.

APA, Harvard, Vancouver, ISO, and other styles

12

Henriksson, Erik, and Kristopher Werlinder. "Housing Price Prediction over Countrywide Data : A comparison of XGBoost and Random Forest regressor models." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-302535.

Full text

Abstract:

The aim of this research project is to investigate how an XGBoost regressor compares to a Random Forest regressor in terms of predictive performance of housing prices with the help of two data sets. The comparison considers training time, inference time and the three evaluation metrics R2, RMSE and MAPE. The data sets are described in detail together with background about the regressor models that are used. The method makes substantial data cleaning of the two data sets, it involves hyperparameter tuning to find optimal parameters and 5foldcrossvalidation in order to achieve good performance estimates. The finding of this research project is that XGBoost performs better on both small and large data sets. While the Random Forest model can achieve similar results as the XGBoost model, it needs a much longer training time, between 2 and 50 times as long, and has a longer inference time, around 40 times as long. This makes it especially superior when used on larger sets of data.
Målet med den här studien är att jämföra och undersöka hur en XGBoost regressor och en Random Forest regressor presterar i att förutsäga huspriser. Detta görs med hjälp av två stycken datauppsättningar. Jämförelsen tar hänsyn till modellernas träningstid, slutledningstid och de tre utvärderingsfaktorerna R2, RMSE and MAPE. Datauppsättningarna beskrivs i detalj tillsammans med en bakgrund om regressionsmodellerna. Metoden innefattar en rengöring av datauppsättningarna, sökande efter optimala hyperparametrar för modellerna och 5delad korsvalidering för att uppnå goda förutsägelser. Resultatet av studien är att XGBoost regressorn presterar bättre på både små och stora datauppsättningar, men att den är överlägsen när det gäller stora datauppsättningar. Medan Random Forest modellen kan uppnå liknande resultat som XGBoost modellen, tar träningstiden mellan 250 gånger så lång tid och modellen får en cirka 40 gånger längre slutledningstid. Detta gör att XGBoost är särskilt överlägsen vid användning av stora datauppsättningar.

APA, Harvard, Vancouver, ISO, and other styles

13

Ngo, Khai Thoi. "Stacking Ensemble for auto_ml." Thesis, Virginia Tech, 2018. http://hdl.handle.net/10919/83547.

Full text

Abstract:

Machine learning has been a subject undergoing intense study across many different industries and academic research areas. Companies and researchers have taken full advantages of various machine learning approaches to solve their problems; however, vast understanding and study of the field is required for developers to fully harvest the potential of different machine learning models and to achieve efficient results. Therefore, this thesis begins by comparing auto ml with other hyper-parameter optimization techniques. auto ml is a fully autonomous framework that lessens the knowledge prerequisite to accomplish complicated machine learning tasks. The auto ml framework automatically selects the best features from a given data set and chooses the best model to fit and predict the data. Through multiple tests, auto ml outperforms MLP and other similar frameworks in various datasets using small amount of processing time. The thesis then proposes and implements a stacking ensemble technique in order to build protection against over-fitting for small datasets into the auto ml framework. Stacking is a technique used to combine a collection of Machine Learning models’ predictions to arrive at a final prediction. The stacked auto ml ensemble results are more stable and consistent than the original framework; across different training sizes of all analyzed small datasets.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

14

Ferreira, Ednaldo José. "Método baseado em rotação e projeção otimizadas para a construção de ensembles de modelos." Universidade de São Paulo, 2012. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-27062012-161603/.

Full text

Abstract:

O desenvolvimento de novas técnicas capazes de produzir modelos de predição com erros de generalização relativamente baixos é uma constante em aprendizado de máquina e áreas correlatas. Nesse sentido, a composição de um conjunto de modelos no denominado ensemble merece destaque por seu potencial teórico e empírico de minimizar o erro de generalização. Diversos métodos para construção de ensembles de modelos são encontrados na literatura. Dentre esses, o método baseado em rotação (RB) tem apresentado desempenho superior a outros clássicos. O método RB utiliza a técnica de extração de características da análise de componentes principais (PCA) como estratégia de rotação para provocar acurácia e diversidade entre os modelos componentes. Contudo, essa estratégia não assegura que a direção resultante será apropriada para a técnica de aprendizado supervisionado (SLT) escolhida. Adicionalmente, o método RB não é adequado com SLTs invariantes à rotação e não foi amplamente validado com outras estáveis. Esses aspectos tornam-no inadequado e/ou restrito a algumas SLTs. Nesta tese, é proposta uma nova abordagem de extração baseada na concatenação de rotação e projeção otimizadas em prol da SLT (denominada roto-projeção otimizada). A abordagem utiliza uma metaheurística para otimizar os parâmetros da transformação de roto-projeção e minimizar o erro da técnica diretora da otimização. Mais enfaticamente, propõe-se a roto-projeção otimizada como parte fundamental de um novo método de ensembles, denominado ensemble baseado em roto-projeção otimizada (ORPE). Os resultados obtidos mostram que a roto-projeção otimizada pode reduzir a dimensionalidade e a complexidade dos dados e do modelo, além de aumentar o desempenho da SLT utilizada posteriormente. O método ORPE superou, com relevância estatística, o RB e outros com SLTs estáveis e instáveis em bases de classificação e regressão de domínio público e privado. O ORPE mostrou-se irrestrito e altamente eficaz assumindo a primeira posição em todos os ranqueamentos de dominância realizados
The development of new techniques capable of inducing predictive models with low generalization errors has been a constant in machine learning and other related areas. In this context, the composition of an ensemble of models should be highlighted due to its theoretical and empirical potential to minimize the generalization error. Several methods for building ensembles are found in the literature. Among them, the rotation-based (RB) has become known for outperforming other traditional methods. RB method applies the principal components analysis (PCA) for feature extraction as a rotation strategy to provide diversity and accuracy among base models. However, this strategy does not ensure that the resulting direction is appropriate for the supervised learning technique (SLT). Moreover, the RB method is not suitable for rotation-invariant SLTs and also it has not been evaluated with stable ones, which makes RB inappropriate and/or restricted to the use with only some SLTs. This thesis proposes a new approach for feature extraction based on concatenation of rotation and projection optimized for the SLT (called optimized roto-projection). The approach uses a metaheuristic to optimize the parameters from the roto-projection transformation, minimizing the error of the director technique of the optimization process. More emphatically, it is proposed the optimized roto-projection as a fundamental part of a new ensemble method, called optimized roto-projection ensemble (ORPE). The results show that the optimized roto-projection can reduce the dimensionality and the complexities of the data and model. Moreover, optimized roto-projection can increase the performance of the SLT subsequently applied. The ORPE outperformed, with statistical significance, RB and others using stable and unstable SLTs for classification and regression with databases from public and private domains. The ORPE method was unrestricted and highly effective holding the first position in every dominance rankings

APA, Harvard, Vancouver, ISO, and other styles

15

Top, Mame Kouna. "Analyse des modèles résines pour la correction des effets de proximité en lithographie optique." Thesis, Grenoble, 2011. http://www.theses.fr/2011GRENT007/document.

Full text

Abstract:

Les progrès réalisés dans la microélectronique répondent à la problématique de la réduction des coûts de production et celle de la recherche de nouveaux marchés. Ces progrès sont possibles notamment grâce à ceux effectués en lithographie optique par projection, le procédé lithographique principalement utilisé par les industriels. La miniaturisation des circuits intégrés n’a donc été possible qu’en poussant les limites d’impression lithographique. Cependant en réduisant les largeurs des transistors et l’espace entre eux, on augmente la sensibilité du transfert à ce que l’on appelle les effets de proximité optique au fur et à mesure des générations les plus avancées de 45 et 32 nm de dimension de grille de transistor.L’utilisation des modèles OPC est devenue incontournable en lithographie optique, pour les nœuds technologiques avancés. Les techniques de correction des effets de proximité (OPC) permettent de garantir la fidélité des motifs sur plaquette, par des corrections sur le masque. La précision des corrections apportées au masque dépend de la qualité des modèles OPC mis en œuvre. La qualité de ces modèles est donc primordiale. Cette thèse s’inscrit dans une démarche d’analyse et d’évaluation des modèles résine OPC qui simulent le comportement de la résine après exposition. La modélisation de données et l’analyse statistique ont été utilisées pour étudier ces modèles résine de plus en plus empiriques. Outre la fiabilisation des données de calibrage des modèles, l’utilisation des plateformes de création de modèles dédiées en milieu industriel et la méthodologie de création et de validation des modèles OPC ont également été étudié. Cette thèse expose le résultat de l’analyse des modèles résine OPC et propose une nouvelles méthodologie de création, d’analyse et de validation de ces modèles
The Progress made in microelectronics responds to the matter of production costs reduction and to the search of new markets. These progresses have been possible thanks those made in optical lithography, the printing process principally used in integrated circuit (IC) manufacturing.The miniaturization of integrated circuits has been possible only by pushing the limits of optical resolution. However this miniaturization increases the sensitivity of the transfer, leading to more proximity effects at progressively more advanced technology nodes (45 and 32 nm in transistor gate size). The correction of these optical proximity effects is indispensible in photolithographic processes for advanced technology nodes. Techniques of optical proximity correction (OPC) enable to increase the achievable resolution and the pattern transfer fidelity for advanced lithographic generations. Corrections are made on the mask based on OPC models which connect the image on the resin to the changes made on the mask. The reliability of these OPC models is essential for the improvement of the pattern transfer fidelity.This thesis analyses and evaluates the OPC resist models which simulates the behavior of the resist after the photolithographic process. Data modeling and statistical analysis have been used to study these increasingly empirical resist models. Besides the model calibration data reliability, we worked on the way of using the models calibration platforms generally used in IC manufacturing.This thesis exposed the results of the analysis of OPC resist models and proposes a new methodology for OPC resist models creation, analysis and validation

APA, Harvard, Vancouver, ISO, and other styles

16

Whiting, Jeffrey S. "Cognitive and Behavioral Model Ensembles for Autonomous Virtual Characters." Diss., CLICK HERE for online access, 2007. http://contentdm.lib.byu.edu/ETD/image/etd1873.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Iyer, Vasanth. "Ensemble Stream Model for Data-Cleaning in Sensor Networks." FIU Digital Commons, 2013. http://digitalcommons.fiu.edu/etd/973.

Full text

Abstract:

Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as f-test is performed during each node’s split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.

APA, Harvard, Vancouver, ISO, and other styles

18

Ali, Rozniza. "Ensemble classification and signal image processing for genus Gyrodactylus (Monogenea)." Thesis, University of Stirling, 2014. http://hdl.handle.net/1893/21734.

Full text

Abstract:

This thesis presents an investigation into Gyrodactylus species recognition, making use of machine learning classification and feature selection techniques, and explores image feature extraction to demonstrate proof of concept for an envisaged rapid, consistent and secure initial identification of pathogens by field workers and non-expert users. The design of the proposed cognitively inspired framework is able to provide confident discrimination recognition from its non-pathogenic congeners, which is sought in order to assist diagnostics during periods of a suspected outbreak. Accurate identification of pathogens is a key to their control in an aquaculture context and the monogenean worm genus Gyrodactylus provides an ideal test-bed for the selected techniques. In the proposed algorithm, the concept of classification using a single model is extended to include more than one model. In classifying multiple species of Gyrodactylus, experiments using 557 specimens of nine different species, two classifiers and three feature sets were performed. To combine these models, an ensemble based majority voting approach has been adopted. Experimental results with a database of Gyrodactylus species show the superior performance of the ensemble system. Comparison with single classification approaches indicates that the proposed framework produces a marked improvement in classification performance. The second contribution of this thesis is the exploration of image processing techniques. Active Shape Model (ASM) and Complex Network methods are applied to images of the attachment hooks of several species of Gyrodactylus to classify each species according to their true species type. ASM is used to provide landmark points to segment the contour of the image, while the Complex Network model is used to extract the information from the contour of an image. The current system aims to confidently classify species, which is notifiable pathogen of Atlantic salmon, to their true class with high degree of accuracy. Finally, some concluding remarks are made along with proposal for future work.

APA, Harvard, Vancouver, ISO, and other styles

19

Darwiche, Aiman A. "Machine Learning Methods for Septic Shock Prediction." Diss., NSUWorks, 2018. https://nsuworks.nova.edu/gscis_etd/1051.

Full text

Abstract:

Sepsis is an organ dysfunction life-threatening disease that is caused by a dysregulated body response to infection. Sepsis is difficult to detect at an early stage, and when not detected early, is difficult to treat and results in high mortality rates. Developing improved methods for identifying patients in high risk of suffering septic shock has been the focus of much research in recent years. Building on this body of literature, this dissertation develops an improved method for septic shock prediction. Using the data from the MMIC-III database, an ensemble classifier is trained to identify high-risk patients. A robust prediction model is built by obtaining a risk score from fitting the Cox Hazard model on multiple input features. The score is added to the list of features and the Random Forest ensemble classifier is trained to produce the model. The Cox Enhanced Random Forest (CERF) proposed method is evaluated by comparing its predictive accuracy to those of extant methods.

APA, Harvard, Vancouver, ISO, and other styles

20

Li, Jianeng. "Research on a Heart Disease Prediction Model Based on the Stacking Principle." Thesis, Högskolan Dalarna, Informatik, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34591.

Full text

Abstract:

In this study, the prediction model based on the Stacking principle is called the Stacking fusion model. Little evidence demonstrates that the Stacking fusion model possesses better prediction performance in the field of heart disease diagnosis than other classification models. Since this model belongs to the family of ensemble learning models, which has a bad interpretability, it should be used with caution in medical diagnoses. The purpose of this study is to verify whether the Stacking fusion model has better prediction performance than stand-alone machine learning models and other ensemble classifiers in the field of heart disease diagnosis, and to find ways to explain this model. This study uses experiment and quantitative analysis to evaluate the prediction performance of eight models in terms of prediction ability, algorithmic stability, false negative rate and run-time. It is proved that the Stacking fusion model with Naive Bayes classifier, XGBoost and Random forest as the first-level learners is superior to other classifiers in prediction ability. The false negative rate of this model is also outstanding. Furthermore, the Stacking fusion model is explained from the working principle of the model and the SHAP framework. The SHAP framework explains this model’s judgement of the important factors that influence heart disease and the relationship between the value of these factors and the probability of disease. Overall, two research problems in this study help reveal the prediction performance and reliability of the cardiac disease prediction model based on the Stacking principle. This study provides practical and theoretical support for hospitals to use the Stacking principle in the diagnosis of heart disease.

APA, Harvard, Vancouver, ISO, and other styles

21

Pouilly-Cathelain, Maxime. "Synthèse de correcteurs s’adaptant à des critères multiples de haut niveau par la commande prédictive et les réseaux de neurones." Electronic Thesis or Diss., université Paris-Saclay, 2020. http://www.theses.fr/2020UPASG019.

Full text

Abstract:

Cette thèse porte sur la commande des systèmes non linéaires soumis à des contraintes non différentiables ou non convexes. L'objectif est de pouvoir réaliser une commande permettant de considérer tout type de contraintes évaluables en temps réel.Pour répondre à cet objectif, la commande prédictive a été utilisée en ajoutant des fonctions barrières à la fonction de coût. Un algorithme d'optimisation sans gradient a permis de résoudre ce problème d'optimisation. De plus, une formulation permettant de garantir la stabilité et la robustesse vis-à-vis de perturbations a été proposée dans le cadre des systèmes linéaires. La démonstration de la stabilité repose sur les ensembles invariants et la théorie de Lyapunov.Dans le cas des systèmes non linéaires, les réseaux de neurones dynamiques ont été utilisés comme modèle de prédiction pour la commande prédictive. L'apprentissage de ces réseaux ainsi que les observateurs non linéaires nécessaires à leur utilisation ont été étudiés. Enfin, notre étude s'est portée sur l'amélioration de la prédiction par réseaux de neurones en présence de perturbations.La méthode de synthèse de correcteurs présentée dans ces travaux a été appliquée à l’évitement d’obstacles par un véhicule autonome
This PhD thesis deals with the control of nonlinear systems subject to nondifferentiable or nonconvex constraints. The objective is to design a control law considering any type of constraints that can be online evaluated.To achieve this goal, model predictive control has been used in addition to barrier functions included in the cost function. A gradient-free optimization algorithm has been used to solve this optimization problem. Besides, a cost function formulation has been proposed to ensure stability and robustness against disturbances for linear systems. The proof of stability is based on invariant sets and the Lyapunov theory.In the case of nonlinear systems, dynamic neural networks have been used as a predictor for model predictive control. Machine learning algorithms and the nonlinear observers required for the use of neural networks have been studied. Finally, our study has focused on improving neural network prediction in the presence of disturbances.The synthesis method presented in this work has been applied to obstacle avoidance by an autonomous vehicle

APA, Harvard, Vancouver, ISO, and other styles

22

Duncan, Andrew Paul. "The analysis and application of artificial neural networks for early warning systems in hydrology and the environment." Thesis, University of Exeter, 2014. http://hdl.handle.net/10871/17569.

Full text

Abstract:

Artificial Neural Networks (ANNs) have been comprehensively researched, both from a computer scientific perspective and with regard to their use for predictive modelling in a wide variety of applications including hydrology and the environment. Yet their adoption for live, real-time systems remains on the whole sporadic and experimental. A plausible hypothesis is that this may be at least in part due to their treatment heretofore as “black boxes” that implicitly contain something that is unknown, or even unknowable. It is understandable that many of those responsible for delivering Early Warning Systems (EWS) might not wish to take the risk of implementing solutions perceived as containing unknown elements, despite the computational advantages that ANNs offer. This thesis therefore builds on existing efforts to open the box and develop tools and techniques that visualise, analyse and use ANN weights and biases especially from the viewpoint of neural pathways from inputs to outputs of feedforward networks. In so doing, it aims to demonstrate novel approaches to self-improving predictive model construction for both regression and classification problems. This includes Neural Pathway Strength Feature Selection (NPSFS), which uses ensembles of ANNs trained on differing subsets of data and analysis of the learnt weights to infer degrees of relevance of the input features and so build simplified models with reduced input feature sets. Case studies are carried out for prediction of flooding at multiple nodes in urban drainage networks located in three urban catchments in the UK, which demonstrate rapid, accurate prediction of flooding both for regression and classification. Predictive skill is shown to reduce beyond the time of concentration of each sewer node, when actual rainfall is used as input to the models. Further case studies model and predict statutory bacteria count exceedances for bathing water quality compliance at 5 beaches in Southwest England. An illustrative case study using a forest fires dataset from the UCI machine learning repository is also included. Results from these model ensembles generally exhibit improved performance, when compared with single ANN models. Also ensembles with reduced input feature sets, using NPSFS, demonstrate as good or improved performance when compared with the full feature set models. Conclusions are drawn about a new set of tools and techniques, including NPSFS and visualisation techniques for inspection of ANN weights, the adoption of which it is hoped may lead to improved confidence in the use of ANN for live real-time EWS applications.

APA, Harvard, Vancouver, ISO, and other styles

23

Bellani, Carolina. "Predictive churn models in vehicle insurance." Master's thesis, 2019. http://hdl.handle.net/10362/90767.

Full text

Abstract:

Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
The goal of this project is to develop a predictive model to reduce customer churn from a company. In order to reduce churn, the model will identify customers who may be thinking of ending their patronage. The model also seeks to identify the reasons behind the customers decision to leave, to enable the company to take appropriate counter measures. The company in question is an insurance company in Portugal, Tranquilidade, and this project will focus in particular on their vehicle insurance products. Customer churn will be calculated in relation to two insurance policies; the compulsory motor’s (third party liability) policy and the optional Kasko’s (first party liability) policy. This model will use information the company holds internally on their customers, as well as commercial, vehicle, policy details and external information (from census). The first step of the analysis was data pre-processing with data cleaning, transformation and reduction (especially, for redundancy); in particular, concept hierarchy generation was performed for nominal data. As the percentage of churn is not comparable with the active policy products, the dataset is unbalanced. In order to resolve this an under-sampling technique was used. To force the models to learn how to identify the churn cases, samples of the majority class were separated in such a way as to balance with the minority class. To prevent any loss of information, all the samples of the majority class were studied with the minority class. The predictive models used are generalized linear models, random forests and artificial neural networks, parameter tuning was also conducted. A further validation was also performed on a recent new sample, without any data leakage. In relation to compulsory motor’s insurances, the recommended model is an artificial neural network. The model has a first layer of 15 neurons and a second layer of 4 neurons, with an AUC of 68.72%, a sensitivity of 33.14% and a precision of 27%. For the Kasko’s insurances, the suggested model is a random forest with 325 decision trees with an AUC of 72.58%, a sensitivity of 36.85% and a precision of 31.70%. AUCs are aligned with other predictive churn model results, however, precision and sensitivity measures are worse than in telecommunication churn models’, but comparable with insurance churn predictions. Not only do the models allow for the creation of a churn classification, but they are also able to give some insight about this phenomenon, and therefore provide useful information and data which the company can use and analyze in order to reduce the customer churn rate. However, there are some hidden factors that couldn’t be accounted for with the information available, such as; competitors’ market and client interaction, if these could be integrated a better prediction could be achieved.

APA, Harvard, Vancouver, ISO, and other styles

24

Amaro, Miguel Mendes. "Credit scoring: comparison of non‐parametric techniques against logistic regression." Master's thesis, 2020. http://hdl.handle.net/10362/99692.

Full text

Abstract:

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
Over the past decades, financial institutions have been giving increased importance to credit risk management as a critical tool to control their profitability. More than ever, it became crucial for these institutions to be able to well discriminate between good and bad clients for only accepting the credit applications that are not likely to default. To calculate the probability of default of a particular client, most financial institutions have credit scoring models based on parametric techniques. Logistic regression is the current industry standard technique in credit scoring models, and it is one of the techniques under study in this dissertation. Although it is regarded as a robust and intuitive technique, it is still not free from several critics towards the model assumptions it takes that can compromise its predictions. This dissertation intends to evaluate the gains in performance resulting from using more modern non-parametric techniques instead of logistic regression, performing a model comparison over four different real-life credit datasets. Specifically, the techniques compared against logistic regression in this study consist of two single classifiers (decision tree and SVM with RBF kernel) and two ensemble methods (random forest and stacking with cross-validation). The literature review demonstrates that heterogeneous ensemble approaches have a weaker presence in credit scoring studies and, because of that, stacking with cross-validation was considered in this study. The results demonstrate that logistic regression outperforms the decision tree classifier, has similar performance in relation to SVM and slightly underperforms both ensemble approaches in similar extents.

APA, Harvard, Vancouver, ISO, and other styles

25

Santos, Esdras Christo Moura dos. "Predictive modelling applied to propensity to buy personal accidents insurance products." Master's thesis, 2018. http://hdl.handle.net/10362/37698.

Full text

Abstract:

Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Predictive models have been largely used in organizational scenarios with the increasing popularity of machine learning. They play a fundamental role in the support of customer acquisition in marketing campaigns. This report describes the development of a propensity to buy model for personal accident insurance products. The entire process from business understanding to the deployment of the final model is analyzed with the objective of linking the theory to practice.

APA, Harvard, Vancouver, ISO, and other styles

26

Gau, Olivier. "Ensemble learning with GSGP." Master's thesis, 2020. http://hdl.handle.net/10362/93780.

Full text

Abstract:

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
The purpose of this thesis is to conduct comparative research between Genetic Programming (GP) and Geometric Semantic Genetic Programming (GSGP), with different initialization (RHH and EDDA) and selection (Tournament and Epsilon-Lexicase) strategies, in the context of a model-ensemble in order to solve regression optimization problems. A model-ensemble is a combination of base learners used in different ways to solve a problem. The most common ensemble is the mean, where the base learners are combined in a linear fashion, all having the same weights. However, more sophisticated ensembles can be inferred, providing higher generalization ability. GSGP is a variant of GP using different genetic operators. No previous research has been conducted to see if GSGP can perform better than GP in model-ensemble learning. The evolutionary process of GP and GSGP should allow us to learn about the strength of each of those base models to provide a more accurate and robust solution. The base-models used for this analysis were Linear Regression, Random Forest, Support Vector Machine and Multi-Layer Perceptron. This analysis has been conducted using 7 different optimization problems and 4 real-world datasets. The results obtained with GSGP are statistically significantly better than GP for most cases.
O objetivo desta tese é realizar pesquisas comparativas entre Programação Genética (GP) e Programação Genética Semântica Geométrica (GSGP), com diferentes estratégias de inicialização (RHH e EDDA) e seleção (Tournament e Epsilon-Lexicase), no contexto de um conjunto de modelos, a fim de resolver problemas de otimização de regressão. Um conjunto de modelos é uma combinação de alunos de base usados de diferentes maneiras para resolver um problema. O conjunto mais comum é a média, na qual os alunos da base são combinados de maneira linear, todos com os mesmos pesos. No entanto, conjuntos mais sofisticados podem ser inferidos, proporcionando maior capacidade de generalização. O GSGP é uma variante do GP usando diferentes operadores genéticos. Nenhuma pesquisa anterior foi realizada para verificar se o GSGP pode ter um desempenho melhor que o GP no aprendizado de modelos. O processo evolutivo do GP e GSGP deve permitir-nos aprender sobre a força de cada um desses modelos de base para fornecer uma solução mais precisa e robusta. Os modelos de base utilizados para esta análise foram: Regressão Linear, Floresta Aleatória, Máquina de Vetor de Suporte e Perceptron de Camadas Múltiplas. Essa análise foi realizada usando 7 problemas de otimização diferentes e 4 conjuntos de dados do mundo real. Os resultados obtidos com o GSGP são estatisticamente significativamente melhores que o GP na maioria dos casos.

APA, Harvard, Vancouver, ISO, and other styles

27

Nožička, Michal. "Ensemble learning metody pro vývoj skóringových modelů." Master's thesis, 2018. http://www.nusl.cz/ntk/nusl-382813.

Full text

Abstract:

Credit scoring is very important process in banking industry during which each potential or current client is assigned credit score that in certain way expresses client's probability of default, i.e. failing to meet his or her obligations on time or in full amount. This is a cornerstone of credit risk management in banking industry. Traditionally, statistical models (such as logistic regression model) are used for credit scoring in practice. Despite many advantages of such approach, recent research shows many alternatives that are in some ways superior to those traditional models. This master thesis is focused on introducing ensemble learning models (in particular constructed by using bagging, boosting and stacking algorithms) with various base models (in particular logistic regression, random forest, support vector machines and artificial neural network) as possible alternatives and challengers to traditional statistical models used for credit scoring and compares their advantages and disadvantages. Accuracy and predictive power of those scoring models is examined using standard measures of accuracy and predictive power in credit scoring field (in particular GINI coefficient and LIFT coefficient) on a real world dataset and obtained results are presented. The main result of this comparative study is that...

APA, Harvard, Vancouver, ISO, and other styles

28

Abreu, Mariana da Conceição Ferreira. "Modelos de Avaliação de Risco de Crédito: Aplicação de Machine Learning." Master's thesis, 2020. http://hdl.handle.net/10316/94723.

Full text

Abstract:

Trabalho de Projeto do Mestrado em Economia apresentado à Faculdade de Economia
Existem vários métodos que ao longo dos anos tem sido empregues na avaliação de risco de crédito, sobretudo, metodologias tradicionais como o Modelo de Análise Discriminante (ADi), Modelo Logit e Modelo Probit, e metodologias mais sofisticadas de Machine Learning, como Árvores de Classificação (AC), Random Forests (RF), Redes Neuronais (RN) e Support Vector Machines (SVM). Na revisão de literatura são apresentados alguns estudos que recorrem a metodologias tradicionais e a metodologias de Machine Learning. Estas últimas não só se apresentam teoricamente como são estudadas na prática para avaliar diferentes aplicações de risco de crédito, sendo aplicados a duas bases reais, disponíveis publicamente, uma referente ao cumprimento de pagamento de cartões de crédito em Taiwan e outra referente ao risco de crédito na Alemanha. Ambas as bases de dados incluem uma variável de resposta binária relativa ao risco de crédito. Em cada modelo experimentaram-se alguns meta-parâmetros, tendo a devida precaução na sua seleção, de forma a não repeti-los nas diferentes combinações do mesmo modelo e, consequentemente, de forma a evitar o overfitting.Este estudo efetua uma análise do desempenho dos modelos de Machine Learning individuais e também do desempenho de uma técnica de Ensemble baseada nos resultados obtidos pelos diferentes modelos, com intuito de determinar qual destes revela um melhor desempenho na avaliação de risco de crédito. A maioria dos resultados deste estudo empírico permitem concluir que os desempenhos da técnica de Ensemble são superiores aos dos modelos individuais. Também o modelo Random Forest realçou os melhores desempenhos de entre todos os modelos individuais.
There are several methods that over the years have been used in credit risk assessment, especially traditional methodologies such as the Discriminant Analysis Model (ADi), Logit Model and Probit Model, and more sophisticated Machine Learning methodologies, such as Classification Trees (AC), Random Forests (RF), Neural Networks (RN) and Support Vector Machines (SVM). In the literature review presents some studies that use traditional methodologies and Machine Learning methodologies. This last not only present themselves theoretically, but are studied in practice to evaluate different applications of credit risk, being applied to two real bases, publicly available, one referring to the fulfillment of credit card payments in Taiwan and the other referring to credit risk. in Germany. Both databases include a binary response variable for credit risk.In each model, some meta-parameters were experimented, taking due care in their selection, so as not to repeat them in the different combinations of the same model and, consequently, in order to avoid overfitting.This study performs an analysis of the performance of the individual Machine Learning models and also of the performance of an Ensemble technique based on the results obtained by the different models, in order to determine which one shows a better performance in the credit risk assessment. Most of the results of this empirical study allow us to conclude that the performances of the Ensemble technique are superior to those of the individual models. Also the Random Forest model highlighted the best performances among all individual models.

APA, Harvard, Vancouver, ISO, and other styles

29

Milioli, Heloisa Helena. "Breast cancer intrinsic subtypes: a critical conception in bioinformatics." Thesis, 2017. http://hdl.handle.net/1959.13/1350957.

Full text

Abstract:

Research Doctorate - Doctor of Philosophy (PhD)
Breast cancers have been uncovered by high-throughput technologies that allow the investigation at the genomic, transcriptomic and proteomic levels. In the early 2000s, the gene expression profiling has led to the classification of five intrinsic subtypes: luminal A, luminal B, HER2-enriched, normal like and basal-like. A decade later, the spectrum of copy number aberrations has further expanded the heterogeneous architecture of this disease with the identification of 10 integrative clusters (IntClusts). The referred classifications aim at explaining the diverse phenotypes and independent outcomes that impact clinical decision-making. However, intrinsic subtypes and IntClusts show limited overlap. In this context, novel methodologies in bioinformatics to analyse large-scale microarray data will contribute to further understanding the molecular subtypes. In this study, we focus on developing new approaches to cover multi-perspective, highly dimensional, and highly complex data analysis in breast cancer. Our goal is to review and reconcile the disease classification, underlying the differences across clinicopathological features and survival outcomes. For this purpose, we have explored the information processed by the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC); one of the largest of its type and depth, with over 2000 samples. A series of distinct approaches combining computer science, statistics, mathematics, and engineering have been applied in order to bring new insights to cancer biology. The translational strategy will facilitate a more efficient and effective incorporation of bioinformatics research into laboratory assays. Further applications of this knowledge are, therefore, critical in order to support novel implementations in the clinical setting; paving the way for future progress in medicine.

APA, Harvard, Vancouver, ISO, and other styles

30

Huang, Hong-Zhou, and 黃弘州. "Nonintrusive Appliance Recognition Algorithm based on Ensemble Learning Model." Thesis, 2015. http://ndltd.ncl.edu.tw/handle/3ftb3d.

Full text

Abstract:

碩士
國立中興大學
資訊科學與工程學系
103
In this paper, a non-intrusive appliance load monitoring (NILM) scheme based on the Adaboot ensemble algorithm for cheaper and low frequency meter is developed. In order to apply the NILM scheme we need to extract features for appliances. However, it is a challenging task if we want to know the states for each appliance at home just from information of single point aggregate power meter. In literature, it is usually done by applying high frequency meter to extract high frequency feature, e.g., harmonics and electromagnetic interference, to make recognition accuracy better. However, the hardware of high frequency were costly. For typical family, expensive devices would make the NILM impractical and infeasible. To develop a NILM that can be applied on a cheaper and low frequency meter, low frequency features should be used. In addition, these low frequency features should satisfy the additivity property in order to be used in our learning model. The Adaboost ensemble learning model is then used as the recognition algorithm in our work. Multiple features and multiple recognition algorithms are used to get initial recognition results. These results are used as the training data, and adopted the Adaboost ensemble learning model. In this model, the recognition result was decided from multiple features and multiple different weight recognition algorithms. Adaboost ensemble learning algorithm could solve the problem of similar number of votes in typical ensemble learning model. Results show that the proposed Adaboost ensemble learning model could enhance the recognition accuracy.

APA, Harvard, Vancouver, ISO, and other styles

31

Chi, Tsai-Lin, and 季彩琳. "Using Ensemble Learning Model for Classifying Fiduciary Purchasing Behavior." Thesis, 2014. http://ndltd.ncl.edu.tw/handle/49166414836032969056.

Full text

Abstract:

碩士
輔仁大學
企業管理學系管理學碩士班
102
In financial industry, the environment is getting much harsher than ever before. To be outstanding in financial sector, bankers have been tried to satisfy the need of customers as they can whether in service quality or in service area. Considering the characteristics of the consumer credit loans which are less risky and thriving, bankers are keen to sell consumer credit loans. The objective of the proposed study is to explore the performance of classification model for classifying fiduciary purchasing behavior using ensemble learning techniques. This study proposed a hybrid Logistic regression, Discriminant analysis, Extreme learning machine, Support vector machine and Artificial neural networks method to upgrade the performance of classification model comparing to single learner above. To demonstrate the effectiveness of the ensemble learning approach, classification tasks are performed on one consumer dataset of credit loans telemarketing. 15 variables are adopted in this study. The result shows that the proposed approach is better than other five single classification models.

APA, Harvard, Vancouver, ISO, and other styles

32

SYU, HUAN-YU, and 許桓瑜. "Prediction Model of Narcolepsy Based on Ensemble Learning Approach." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/2d9ud3.

Full text

Abstract:

碩士
國立臺北護理健康大學
資訊管理研究所
106
The advent of the era of precision medicine shows that the diagnosis of diseases tends to be personalized and customized. Nowadays, the combination of medicine and information is the trend of the times, and Narcolepsy is a kind of Hypersomnia. Patients often have symptoms such as excessive daytime sleepiness, cataplexy, hypnagogic hallucination, Narcolepsy must be diagnosed by multiple tests of sleep, multi-stage sleep test, etc. Most of the studies related to narcolepsy use only partial or specific tests. In this study, about ten kinds of measurement and questionnaire data related to narcolepsy were collected, and build a classifier based on ensemble learning to classify the narcolepsy type I and narcolepsy type II. All kind of dataset will be training and selecting parameters by five kinds of classifiers, such as support vector machine, decision tree, neural network, nearest neighbor method, and naive Bayes, and training the classifiers of individual datasets with the best model parameters, and integrating the individual classifiers based on ensemble learning and establishing a hybrid model, the accuracy of the individual classifier is about 57.38%~71.64%, and the accuracy of the hybrid model is 80.88%. The result shows that the model based on ensemble learning is better than individual classifier. In the process of construction, the feature importance and reference rules of each data set are also mined through the decision tree. For example, we can use some parameters in the PSG and MSLT or the hallucination in the Comorbidity to further classify the narcolepsy category. These reference rules are available as a reference for future clinical diagnosis of narcolepsy. In clinical practice, it is also possible to prioritize tests with high discrimination in the model. For example, in addition to the necessary MSLT and PSG tests, PET and other tests can be prioritized. The above period can shorten the clinical diagnosis process.

APA, Harvard, Vancouver, ISO, and other styles

33

CHIU, YI-HAN, and 邱奕涵. "Using Ensemble Learning to Build the Sales Forecast Model of Baking Industry." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/57v4jt.

Full text

Abstract:

碩士
國立高雄第一科技大學
行銷與流通管理系碩士班
106
Recently, the value of bakery industry output is being on the rise in Taiwan. Nowadays, there is a growing focus on healthy diet. Given this, the research use sales data which is from healthy bakeries to build the sales forecast model of baking industry. Besides, the research tries to use data visualization and feature selection to examine each bakery. Eventually, through Ensemble Learning to build the better forecast model of baking industry. The result demonstrate that technique XGBoost is better than other model. In addition, the result would like to help manager to control product and allocate human resources. Eventually, the sales forecast can assist company to set short-term or long-term objectives to make operating plan.

APA, Harvard, Vancouver, ISO, and other styles

34

Huang, Yong-Jhih, and 黃雍智. "Applying Deep Learning and Ensemble Learning to Construct Spectrum and Cepstrum of Filtered Phonocardiogram Prediction Model." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/535usy.

Full text

Abstract:

碩士
國立中興大學
資訊管理學系所
106
Coronary artery disease is a common chronic disease, as known as ischemic heart disease, which is cardiac dysfunction caused by insufficient blood supply to the heart and kills countless people every year in the world. In recent years, coronary artery disease ranks first in the world’s top ten cause of death. Until now, cardiac auscultation is still an important examination for diagnosing heart diseases. Many heart diseases can be diagnosed effectively by auscultation. However, cardiac auscultation relies on the subjective experience of physicians. In order to provide objective diagnostic and assist physicians in the diagnosis of heart sounds in clinic, this study uses phonocardiograms to build an automatic classification model. This study proposes an automatic classification approach for phonocardiograms using deep learning and ensemble learning with filters. The steps of approach are as follows:First, Savitzky-Golay and Butterworth filters are used to filter the phonocardiograms. Second, phonocardiograms are converted into spectrograms and cepstrums using methods such as short-time Fourier transform and discrete cosine transform. Third: Training convolutional neural networks to build classification models for phonocardiogram. Fourth: Use two ensemble strategies to build ensemble models. Lastly: Balance the quantity of positive and negative samples to increase the sensitivity of the model. The experimental results show that the proposed method is very competitive, which show that the performance of phonocardiogram classification model in the hold out testing is 86.04% MAcc (86.46% sensitivity, 85.63% specificity), and in the 10-fold cross validation is 89.81% MAcc(91.73% sensitivity, 87.91% specificity).

APA, Harvard, Vancouver, ISO, and other styles

35

Zheng, Yu-Xuan, and 鄭宇軒. "Sleep Apnea Detection Algorithm using EEG and Oximetry based on Ensemble Learning Model." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/66291000965483738888.

Full text

Abstract:

碩士
國立中興大學
資訊科學與工程學系
105
The gold standard for diagnosis of sleep apnea is a formal sleep study established by the polysomnography(PSG). However the high cost and the complex steps of PSG makes a diagnosis of sleep apnea become evenmore difficult. Not to mention the shortage of devices and medical human resources. In this thesis, we propose a sleep apnea detection algorithm based on ensemble machine learning model. By using only Electroencephalography(EEG) and Oximetry, we can significantly reduce the difficulty of diagnosis and the effort of medical persons. The experimental results show that the performance of our approach is comparable to other past works.

APA, Harvard, Vancouver, ISO, and other styles

36

Cheng, Lu-Wen, and 程路文. "A prediction model of air pollution and Respiratory Diseases based on Ensemble learning." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/23mv9a.

Full text

Abstract:

碩士
元智大學
資訊工程學系
106
The study aimed to determine whether there is an association between air pollutants levels and outpatient clinic visits with chronic obstructive pulmonary disease (COPD) in Taiwan. Data of air pollutant concentrations (PM2.5、PM10、SO2、NO2、CO、O3) were collected from air monitoring stations. We use a case-crossover study design and conditional logistic regression models with odds ratios (OR) and 95% confidence intervals(CI) for evaluating the associations between the air pollutant factor and COPD-associated OC visits. Analyses show the PM2.5, PM10, CO, NO2, SO2 had significant effects on COPD-associated OC visits. In colder days, a significantly greater effect on COPD-associated OC visits O3 had greater lag effects (the lag was 1， 2，4，5 days) on COPD-associated OC visits. Controlling ambient air pollution would provide benefits to COPD patients. In this study, We used XGBoost algorithm to build a prediction model of air pollution and hospital readmission for Chrome Obstructive Pulmonary Disease. Compared with Random Forest, Neural Network, C5.0, AdaBoost and SVM, it was found that the model based on the integrated learning method XGBoost algorithm produces a higher classification of this problem result.

APA, Harvard, Vancouver, ISO, and other styles

37

Chang, Hsueh-Wei, and 張學瑋. "Nonintrusive Appliance Recognition Algorithm based on Ensemble Learning Model integrating with Dynamic Time Warping." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/35192290829923038200.

Full text

Abstract:

碩士
國立中興大學
資訊科學與工程學系
104
According to the research, if we can provide immediate and fine-grained power information to users, a significant reduction in the energy wastage can be achieved. Non-Intrusive Appliance Load Monitoring is an approach to reach the goal, which is more practical and feasible for typical families. In previous studies, we can discover that there were some disadvantages. First, it usually used high frequency sensor to acquire information, which made the cost of hardware higher. Second, most studies focused on the high consumption or on/off type appliances. As a result, low consumption appliances, multi-state appliances and continuously variable appliances were ignored. In this paper, we proposed a low cost and real-time approach. We use two-step detection in training phase and cluster detection in testing phase to confirm an event. Besides, we use a clustering algorithm-ISODATA to find an appropriate number of state for each appliance in the training set after feature extraction. Finally, we succeed to build the ensemble learning model integrating with dynamic time warping (DTW) model to identify appliances. Experimental results implies that two-step detection and cluster detection method can avoid excessive unknown appliance events, which can improve the accuracy of event detection. In addition, we can solve the problem of tie vote by using ensemble learning model integrating with DTW predictive model, which results in better recognition accuracy than using a single predictive model.

APA, Harvard, Vancouver, ISO, and other styles

38

Silvestre, Martinho de Matos. "Three-stage ensemble model : reinforce predictive capacity without compromising interpretability." Master's thesis, 2019. http://hdl.handle.net/10362/71588.

Full text

Abstract:

Thesis proposal presented as partial requirement for obtaining the Master’s degree in Statistics and Information Management, with specialization in Risk Analysis and Management
Over the last decade, several banks have developed models to quantify credit risk. In addition to the monitoring of the credit portfolio, these models also help deciding the acceptance of new contracts, assess customers profitability and define pricing strategy. The objective of this paper is to improve the approach in credit risk modeling, namely in scoring models to predict default events. To this end, we propose the development of a three-stage ensemble model that combines the results interpretability of the Scorecard with the predictive power of machine learning algorithms. The results show that ROC index improves 0.5%-0.7% and Accuracy 0%-1% considering the Scorecard as baseline.

APA, Harvard, Vancouver, ISO, and other styles

39

Chen, Chien-Jen, and 陳建仁. "Combining Hidden Markov Model with Ensemble Learning to Predict Hidden States and Conduct Stochastic Simulation." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/mhh87z.

Full text

Abstract:

碩士
國立交通大學
工業工程與管理系所
106
Taiwan’s semiconductor industry, optoelectronics industry, computers and peripheral equipment industry play an important role in the world. Additionally, the rapid development of Artificial Intelligence (AI) and Internet of Things (IoT) have also driven the growth of these industries. Although the overall industry is growing up, there is a significant gap between the firms within the industry. Therefore, this study focuses on those companies which revenues go up and down. First, Hidden Markov Model (HMM) is used to explore the company’s hidden states. Without loss of generality, three hidden states, such as healthy, risky, and sick are used in this thesis. In particular, the hidden states are linked into measurable variables, namely, NPBT (net profit before tax), EPS (earning per share), and ROE (return on equity). In addition, 19 representative independent variables used to predict hidden states and conduct stochastic simulation. This study use ensemble learning to identify the key performance indicators (KPIs) of hidden states and then uses Bayesian Belief Network (BBN) to conduct stochastic simulations. Based on the presented framework, the impact of the abovementioned KPI on the hidden state and NPBT can be quantitatively measured. Finally, management implications are provided to improve the company’s operational efficiency.

APA, Harvard, Vancouver, ISO, and other styles

40

Hong, Zih-Siang, and 洪梓翔. "Using Ensemble Learning and Deep Recurrent Neural Network to Construct an Internet Forum Conversation Prediction Model." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/s67dep.

Full text

Abstract:

碩士
中原大學
資訊管理研究所
106
The study on natural language dialogue or conversation involves language understanding, reasoning, and basic common sense, therefore it is one of the most challenging artificial intelligence issues. To design a common and general conversation model is even more complicated and difficult. In the past, the studies on natural language processing and dialogue mainly focused on the rule-based and machine learning-based methods. Although these methods can solve part of the dialogue problems in the specific fields, but they have their own learning bottlenecks. Until recurrent neural networks (RNN) and sequence to sequence model is proposed, the research in this field has been further breakthrough. However, although deep learning can automatically extract the features of a large number of dialogue data, it has high requirements on the quantity and quality of data sets, and has the overfitting problem. Therefore, how to extract the useful features from the limited training dataset, and achieve model generalization ability in different situations, is the challenge of deep learning in the natural language dialogue problem. This project is titled “Conversation Model using Deep Recurrent Neural Networks with Ensemble Learning”. The advantage of the ensemble learning is that it enhances the generalization ability of the model to reinforce the prediction, and make the model suitable for the prediction of various contexts and scenarios. In this study, ensemble learning will be applied to the natural language dialogue and conversation model in various and complex contexts and scenarios. This method is a deep neural network conversation model, using the ensemble learning method to train the sub-prediction model of multiple different types, different parameters, and different training data sets. Then to obtain the prediction results by the specific designed ensemble strategy. Through a number of sub-models jointly predicted and judged to get a generalized conversation prediction model.

APA, Harvard, Vancouver, ISO, and other styles

41

Wu, Hsuan, and 吳亘. "Constructing a Risk Assessment Model for Small and Medium Enterprises by Ensemble Learning with Macroeconomic Indices." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/4573r2.

Full text

Abstract:

碩士
國立交通大學
工業工程與管理系所
107
Due to the high connection of the global financial system, the international financial crisis may have a significant influence on the domestic economy and increase the number of non-performing loans from financial institutions. As a result, many financial institutions have begun to construct an objective and fair risk assessment model. However, most financial institutions only take internal information about borrowing SMEs into account when constructing the model. Therefore, considering the macroeconomic environment may affect the risk of default, this thesis selects macroeconomic indices through Pearson Correlation Analysis and Principal Component Analysis to become new variables. On the other hand, the two-stage ensemble learning method, which integrates three classifiers (Logistic Regression, Support Vector Machine, and Gradient Boosting Decision Tree) is applied to construct the model in the thesis. A financial institution in Taiwan provides the actual SMEs loan data as the verification data. According to the result, the risk assessment model proposed in this thesis outperforms other common single-stage classifier models. Furthermore, adding the macroeconomic indices in the model is also proved to enhance the prediction performance.

APA, Harvard, Vancouver, ISO, and other styles

42

Siedel, Georg. "Evaluation von Machine-Learning-Modellen und Konzeptionierung eines Modell-Ensembles für die Vorhersage von Unfalldaten." 2020. https://tud.qucosa.de/id/qucosa%3A73972.

Full text

Abstract:

In dieser Arbeit wird mittels verschiedener Methoden die Datenfusion von Unfallszenarien untersucht. Ausgangspunkt sind zwei Datensätze aus der Datenbank der polizeilichen Unfallstatistik. Im Empfängerdatensatz wird das spezifische Attribut „Unfalltyp“ entfernt, welches mithilfe des Spenderdatensatzes ergänzt werden soll. Ziel ist das Erstellen einer einheitlichen Datenbasis, deren Qualität mittels geeigneter ausgewählter Metriken bewertet wird. Als Methode der Datenfusion wird zum einen das Distance-Hot-Deck-Verfahren verwendet. Zum anderen werden vier aussichtsreiche Machine Learning Verfahren auf Basis einer systematischen Literaturrecherche ausgewählt und zur Vorhersage des spezifischen Attributes angewandt. Um die jeweiligen Vorteile bezüglich der Verteilung und Trefferrate des vorhergesagten Attributes ausnutzen zu können, werden Kombinationsvarianten (Ensembling) beider Methoden entwickelt. Es werden Erkenntnisse gewonnen, welche Verfahren die höchste Qualität des fusionierten Datensatzes erreichen.:1. Einleitung 2. Grundlagen der Datenfusion 3. Randbedingungen 4. Vorgehensweise 5. Ergebnisse 6. Diskussion und Ausblick

APA, Harvard, Vancouver, ISO, and other styles

43

Frazão, Xavier Marques. "Deep learning model combination and regularization using convolutional neural networks." Master's thesis, 2014. http://hdl.handle.net/10400.6/5605.

Full text

Abstract:

Convolutional neural networks (CNNs) were inspired by biology. They are hierarchical neural networks whose convolutional layers alternate with subsampling layers, reminiscent of simple and complex cells in the primary visual cortex [Fuk86a]. In the last years, CNNs have emerged as a powerful machine learning model and achieved the best results in many object recognition benchmarks [ZF13, HSK+12, LCY14, CMMS12]. In this dissertation, we introduce two new proposals for convolutional neural networks. The first, is a method to combine the output probabilities of CNNs which we call Weighted Convolutional Neural Network Ensemble. Each network has an associated weight that makes networks with better performance have a greater influence at the time to classify a pattern when compared to networks that performed worse. This new approach produces better results than the common method that combines the networks doing just the average of the output probabilities to make the predictions. The second, which we call DropAll, is a generalization of two well-known methods for regularization of fully-connected layers within convolutional neural networks, DropOut [HSK+12] and DropConnect [WZZ+13]. Applying these methods amounts to sub-sampling a neural network by dropping units. When training with DropOut, a randomly selected subset of the output layer’s activations are dropped, when training with DropConnect we drop a randomly subsets of weights. With DropAll we can perform both methods simultaneously. We show the validity of our proposals by improving the classification error on a common image classification benchmark.

APA, Harvard, Vancouver, ISO, and other styles

44

Ashofteh, Afshin. "Data Science for Finance: Targeted Learning from (Big) Data to Economic Stability and Financial Risk Management." Doctoral thesis, 2022. http://hdl.handle.net/10362/135620.

Full text

Abstract:

A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Statistics and Econometrics
The modelling, measurement, and management of systemic financial stability remains a critical issue in most countries. Policymakers, regulators, and managers depend on complex models for financial stability and risk management. The models are compelled to be robust, realistic, and consistent with all relevant available data. This requires great data disclosure, which is deemed to have the highest quality standards. However, stressed situations, financial crises, and pandemics are the source of many new risks with new requirements such as new data sources and different models. This dissertation aims to show the data quality challenges of high-risk situations such as pandemics or economic crisis and it try to theorize the new machine learning models for predictive and longitudes time series models. In the first study (Chapter Two) we analyzed and compared the quality of official datasets available for COVID-19 as a best practice for a recent high-risk situation with dramatic effects on financial stability. We used comparative statistical analysis to evaluate the accuracy of data collection by a national (Chinese Center for Disease Control and Prevention) and two international (World Health Organization; European Centre for Disease Prevention and Control) organizations based on the value of systematic measurement errors. We combined excel files, text mining techniques, and manual data entries to extract the COVID-19 data from official reports and to generate an accurate profile for comparisons. The findings show noticeable and increasing measurement errors in the three datasets as the pandemic outbreak expanded and more countries contributed data for the official repositories, raising data comparability concerns and pointing to the need for better coordination and harmonized statistical methods. The study offers a COVID-19 combined dataset and dashboard with minimum systematic measurement errors and valuable insights into the potential problems in using databanks without carefully examining the metadata and additional documentation that describe the overall context of data. In the second study (Chapter Three) we discussed credit risk as the most significant source of risk in banking as one of the most important sectors of financial institutions. We proposed a new machine learning approach for online credit scoring which is enough conservative and robust for unstable and high-risk situations. This Chapter is aimed at the case of credit scoring in risk management and presents a novel method to be used for the default prediction of high-risk branches or customers. This study uses the Kruskal-Wallis non-parametric statistic to form a conservative credit-scoring model and to study its impact on modeling performance on the benefit of the credit provider. The findings show that the new credit scoring methodology represents a reasonable coefficient of determination and a very low false-negative rate. It is computationally less expensive with high accuracy with around 18% improvement in Recall/Sensitivity. Because of the recent perspective of continued credit/behavior scoring, our study suggests using this credit score for non-traditional data sources for online loan providers to allow them to study and reveal changes in client behavior over time and choose the reliable unbanked customers, based on their application data. This is the first study that develops an online non-parametric credit scoring system, which can reselect effective features automatically for continued credit evaluation and weigh them out by their level of contribution with a good diagnostic ability. In the third study (Chapter Four) we focus on the financial stability challenges faced by insurance companies and pension schemes when managing systematic (undiversifiable) mortality and longevity risk. For this purpose, we first developed a new ensemble learning strategy for panel time-series forecasting and studied its applications to tracking respiratory disease excess mortality during the COVID-19 pandemic. The layered learning approach is a solution related to ensemble learning to address a given predictive task by different predictive models when direct mapping from inputs to outputs is not accurate. We adopt a layered learning approach to an ensemble learning strategy to solve the predictive tasks with improved predictive performance and take advantage of multiple learning processes into an ensemble model. In this proposed strategy, the appropriate holdout for each model is specified individually. Additionally, the models in the ensemble are selected by a proposed selection approach to be combined dynamically based on their predictive performance. It provides a high-performance ensemble model to automatically cope with the different kinds of time series for each panel member. For the experimental section, we studied more than twelve thousand observations in a portfolio of 61-time series (countries) of reported respiratory disease deaths with monthly sampling frequency to show the amount of improvement in predictive performance. We then compare each country’s forecasts of respiratory disease deaths generated by our model with the corresponding COVID-19 deaths in 2020. The results of this large set of experiments show that the accuracy of the ensemble model is improved noticeably by using different holdouts for different contributed time series methods based on the proposed model selection method. These improved time series models provide us proper forecasting of respiratory disease deaths for each country, exhibiting high correlation (0.94) with Covid-19 deaths in 2020. In the fourth study (Chapter Five) we used the new ensemble learning approach for time series modeling, discussed in the previous Chapter, accompany by K-means clustering for forecasting life tables in COVID-19 times. Stochastic mortality modeling plays a critical role in public pension design, population and public health projections, and in the design, pricing, and risk management of life insurance contracts and longevity-linked securities. There is no general method to forecast the mortality rate applicable to all situations especially for unusual years such as the COVID-19 pandemic. In this Chapter, we investigate the feasibility of using an ensemble of traditional and machine learning time series methods to empower forecasts of age-specific mortality rates for groups of countries that share common longevity trends. We use Generalized Age-Period-Cohort stochastic mortality models to capture age and period effects, apply K-means clustering to time series to group countries following common longevity trends, and use ensemble learning to forecast life expectancy and annuity prices by age and sex. To calibrate models, we use data for 14 European countries from 1960 to 2018. The results show that the ensemble method presents the best robust results overall with minimum RMSE in the presence of structural changes in the shape of time series at the time of COVID-19. In this dissertation’s conclusions (Chapter Six), we provide more detailed insights about the overall contributions of this dissertation on the financial stability and risk management by data science, opportunities, limitations, and avenues for future research about the application of data science in finance and economy.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'ENSEMBLE LEARNING MODELS'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles