Dissertations / Theses: 'Random selection'

1

Tyrrell, Simon. "Random and rational methods for compound selection." Thesis, University of Sheffield, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.370002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Stringer, Harold. "BEHAVIOR OF VARIABLE-LENGTH GENETIC ALGORITHMS UNDER RANDOM SELECTION." Master's thesis, University of Central Florida, 2007. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2657.

Full text

Abstract:

In this work, we show how a variable-length genetic algorithm naturally evolves populations whose mean chromosome length grows shorter over time. A reduction in chromosome length occurs when selection is absent from the GA. Specifically, we divide the mating space into five distinct areas and provide a probabilistic and empirical analysis of the ability of matings in each area to produce children whose size is shorter than the parent generation's average size. Diversity of size within a GA's population is shown to be a necessary condition for a reduction in mean chromosome length to take place. We show how a finite variable-length GA under random selection pressure uses 1) diversity of size within the population, 2) over-production of shorter than average individuals, and 3) the imperfect nature of random sampling during selection to naturally reduce the average size of individuals within a population from one generation to the next. In addition to our findings, this work provides GA researchers and practitioners with 1) a number of mathematical tools for analyzing possible size reductions for various matings and 2) new ideas to explore in the area of bloat control.
M.S.
School of Electrical Engineering and Computer Science
Engineering and Computer Science
Computer Science MS

APA, Harvard, Vancouver, ISO, and other styles

3

Choukri, Sam. "Selection of malaria-specific epitopes from random peptide libraries /." free to MU campus, to others for purchase, 1999. http://wwwlib.umi.com/cr/mo/fullcit?p9962513.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Frondana, Iara Moreira. "Model selection for discrete Markov random fields on graphs." Universidade de São Paulo, 2016. http://www.teses.usp.br/teses/disponiveis/45/45133/tde-02022018-151123/.

Full text

Abstract:

In this thesis we propose to use a penalized maximum conditional likelihood criterion to estimate the graph of a general discrete Markov random field. We prove the almost sure convergence of the estimator of the graph in the case of a finite or countable infinite set of variables. Our method requires minimal assumptions on the probability distribution and contrary to other approaches in the literature, the usual positivity condition is not needed. We present several examples with a finite set of vertices and study the performance of the estimator on simulated data from theses examples. We also introduce an empirical procedure based on k-fold cross validation to select the best value of the constant in the estimators definition and show the application of this method in two real datasets.
Nesta tese propomos um critério de máxima verossimilhança penalizada para estimar o grafo de dependência condicional de um campo aleatório Markoviano discreto. Provamos a convergência quase certa do estimador do grafo no caso de um conjunto finito ou infinito enumerável de variáveis. Nosso método requer condições mínimas na distribuição de probabilidade e contrariamente a outras abordagens da literatura, a condição usual de positividade não é necessária. Introduzimos alguns exemplos com um conjunto finito de vértices e estudamos o desempenho do estimador em dados simulados desses exemplos. Também propomos um procedimento empírico baseado no método de validação cruzada para selecionar o melhor valor da constante na definição do estimador, e mostramos a aplicação deste procedimento em dois conjuntos de dados reais.

APA, Harvard, Vancouver, ISO, and other styles

5

Ushan, Wardah. "Portfolio selection using Random Matrix theory and L-Moments." Master's thesis, University of Cape Town, 2015. http://hdl.handle.net/11427/16921.

Full text

Abstract:

Includes bibliographical references
Markowitz's (1952) seminal work on Modern Portfolio Theory (MPT) describes a methodology to construct an optimal portfolio of risky stocks. The constructed portfolio is based on a trade-off between risk and reward, and will depend on the risk- return preferences of the investor. Implementation of MPT requires estimation of the expected returns and variances of each of the stocks, and the associated covariances between them. Historically, the sample mean vector and variance-covariance matrix have been used for this purpose. However, estimation errors result in the optimised portfolios performing poorly out-of-sample. This dissertation considers two approaches to obtaining a more robust estimate of the variance-covariance matrix. The first is Random Matrix Theory (RMT), which compares the eigenvalues of an empirical correlation matrix to those generated from a correlation matrix of purely random returns. Eigenvalues of the random correlation matrix follow the Marcenko-Pastur density, and lie within an upper and lower bound. This range is referred to as the "noise band". Eigenvalues of the empirical correlation matrix falling within the "noise band" are considered to provide no useful information. Thus, RMT proposes that they be filtered out to obtain a cleaned, robust estimate of the correlation and covariance matrices. The second approach uses L-moments, rather than conventional sample moments, to estimate the covariance and correlation matrices. L-moment estimates are more robust to outliers than conventional sample moments, in particular, when sample sizes are small. We use L-moments in conjunction with Random Matrix Theory to construct the minimum variance portfolio. In particular, we consider four strategies corresponding to the four different estimates of the covariance matrix: the L-moments estimate and sample moments estimate, each with and without the incorporation of RMT. We then analyse the performance of each of these strategies in terms of their risk-return characteristics, their performance and their diversification.

APA, Harvard, Vancouver, ISO, and other styles

6

Wonkye, Yaa Tawiah. "Innovations of random forests for longitudinal data." Bowling Green State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=bgsu1563054152739397.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Tran, The Truyen. "On conditional random fields: applications, feature selection, parameter estimation and hierarchical modelling." Curtin University of Technology, Dept. of Computing, 2008. http://espace.library.curtin.edu.au:80/R/?func=dbin-jump-full&object_id=18614.

Full text

Abstract:

There has been a growing interest in stochastic modelling and learning with complex data, whose elements are structured and interdependent. One of the most successful methods to model data dependencies is graphical models, which is a combination of graph theory and probability theory. This thesis focuses on a special type of graphical models known as Conditional Random Fields (CRFs) (Lafferty et al., 2001), in which the output state spaces, when conditioned on some observational input data, are represented by undirected graphical models. The contributions of thesis involve both (a) broadening the current applicability of CRFs in the real world and (b) deepening the understanding of theoretical aspects of CRFs. On the application side, we empirically investigate the applications of CRFs in two real world settings. The first application is on a novel domain of Vietnamese accent restoration, in which we need to restore accents of an accent-less Vietnamese sentence. Experiments on half a million sentences of news articles show that the CRF-based approach is highly accurate. In the second application, we develop a new CRF-based movie recommendation system called Preference Network (PN). The PN jointly integrates various sources of domain knowledge into a large and densely connected Markov network. We obtained competitive results against well-established methods in the recommendation field.
On the theory side, the thesis addresses three important theoretical issues of CRFs: feature selection, parameter estimation and modelling recursive sequential data. These issues are all addressed under a general setting of partial supervision in that training labels are not fully available. For feature selection, we introduce a novel learning algorithm called AdaBoost.CRF that incrementally selects features out of a large feature pool as learning proceeds. AdaBoost.CRF is an extension of the standard boosting methodology to structured and partially observed data. We demonstrate that the AdaBoost.CRF is able to eliminate irrelevant features and as a result, returns a very compact feature set without significant loss of accuracy. Parameter estimation of CRFs is generally intractable in arbitrary network structures. This thesis contributes to this area by proposing a learning method called AdaBoost.MRF (which stands for AdaBoosted Markov Random Forests). As learning proceeds AdaBoost.MRF incrementally builds a tree ensemble (a forest) that cover the original network by selecting the best spanning tree at a time. As a result, we can approximately learn many rich classes of CRFs in linear time. The third theoretical work is on modelling recursive, sequential data in that each level of resolution is a Markov sequence, where each state in the sequence is also a Markov sequence at the finer grain. One of the key contributions of this thesis is Hierarchical Conditional Random Fields (HCRF), which is an extension to the currently popular sequential CRF and the recent semi-Markov CRF (Sarawagi and Cohen, 2004). Unlike previous CRF work, the HCRF does not assume any fixed graphical structures.
Rather, it treats structure as an uncertain aspect and it can estimate the structure automatically from the data. The HCRF is motivated by Hierarchical Hidden Markov Model (HHMM) (Fine et al., 1998). Importantly, the thesis shows that the HHMM is a special case of HCRF with slight modification, and the semi-Markov CRF is essentially a flat version of the HCRF. Central to our contribution in HCRF is a polynomial-time algorithm based on the Asymmetric Inside Outside (AIO) family developed in (Bui et al., 2004) for learning and inference. Another important contribution is to extend the AIO family to address learning with missing data and inference under partially observed labels. We also derive methods to deal with practical concerns associated with the AIO family, including numerical overflow and cubic-time complexity. Finally, we demonstrate good performance of HCRF against rivals on two applications: indoor video surveillance and noun-phrase chunking.

APA, Harvard, Vancouver, ISO, and other styles

8

Linusson, Henrik, Robin Rudenwall, and Andreas Olausson. "Random forest och glesa datarespresentationer." Thesis, Högskolan i Borås, Institutionen Handels- och IT-högskolan, 2012. http://urn.kb.se/resolve?urn=urn:nbn:se:hb:diva-16672.

Full text

Abstract:

In silico experimentation is the process of using computational and statistical models to predict medicinal properties in chemicals; as a means of reducing lab work and increasing success rate this process has become an important part of modern drug development. There are various ways of representing molecules - the problem that motivated this paper derives from collecting substructures of the chemical into what is known as fractional representations. Assembling large sets of molecules represented in this way will result in sparse data, where a large portion of the set is null values. This consumes an excessive amount of computer memory which inhibits the size of data sets that can be used when constructing predictive models.In this study, we suggest a set of criteria for evaluation of random forest implementations to be used for in silico predictive modeling on sparse data sets, with regard to computer memory usage, model construction time and predictive accuracy.A novel random forest system was implemented to meet the suggested criteria, and experiments were made to compare our implementation to existing machine learning algorithms to establish our implementation‟s correctness. Experimental results show that our random forest implementation can create accurate prediction models on sparse datasets, with lower memory usage overhead than implementations using a common matrix representation, and in less time than existing random forest implementations evaluated against. We highlight design choices made to accommodate for sparse data structures and data sets in the random forest ensemble technique, and therein present potential improvements to feature selection in sparse data sets.
Program: Systemarkitekturutbildningen

APA, Harvard, Vancouver, ISO, and other styles

9

Patel, Richa. "Random mutagenesis and selection for RubisCO function in the photosynthetic bacterium rhodobacter capsulatus." Connect to resource, 2008. http://hdl.handle.net/1811/32176.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Peng, Xiaoling. "Methods of variable selection and their applications in quantitative structure-property relationship (QSPR)." HKBU Institutional Repository, 2005. http://repository.hkbu.edu.hk/etd_ra/594.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Du, Ye Ting. "Simultaneous fixed and random effects selection in finite mixtures of linear mixed-effects models." Thesis, McGill University, 2012. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=110592.

Full text

Abstract:

Linear mixed-effects (LME) models are frequently used for modeling longitudinal data. One complicating factor in the analysis of such data is that samples are sometimes obtained from a population with significant underlying heterogeneity, which would be hard to capture by a single LME model. Such problems may be addressed by a finite mixture of linear mixed-effects (FMLME) models, which segments the population into subpopulations and models each subpopulation by a distinct LME model. Often in the initial stage of a study, a large number of predictors are introduced. However, their associations to the response variable vary from one component to another of the FMLME model. To enhance predictability and to obtain a parsimonious model, it is of great practical interest to identify the important effects, both fixed and random, in the model. Traditional variable selection techniques such as stepwise deletion and subset selection are computationally expensive even with modest numbers of covariates and components in the mixture model. In this thesis, we introduce a penalized likelihood approach and propose a nested EM algorithm for efficient numerical computations. The estimators are shown to possess consistency and sparsity properties and asymptotic normality. We illustrate the performance of the proposed method through simulations and a real data example.
Les modèles linéaires mixtes (LME) sont fréquemment employés pour la modélisation des données longitudinales. Un facteur qui complique l'analyse de ce genre de données est que les échantillons sont parfois obtenus à partir d'une population d'importante hétérogénéité sous-jacente, qui serait difficile à capter par un seul LME. De tels problèmes peuvent être surmontés par un mélange fini de modèles linéaires mixtes (FMLME), qui segmente la population en sous-populations et modélise chacune de ces dernières par un LME distinct. Souvent, un grand nombre de variables explicatives sont introduites dans la phase initiale d'une étude. Cependant, leurs associations à la variable réponse varient d'un composant à l'autre du modèle FMLME. Afin d'améliorer la prévisibilité et de recueillir un modèle parcimonieux, il est d'un grand intérêt pratique d'identifier les effets importants, tant fixes qu'aléatoires, dans le modèle. Les techniques conventionnelles de sélection de variables telles que la suppression progressive et la sélection de sous-ensembles sont informatiquement chères, même lorsque le nombre de composants et de covariables est relativement modeste. La présente thèse introduit une approche basée sur la vraisemblance pénalisée et propose un algorithme EM imbriqué qui est computationnellement efficace. On démontre aussi que les estimateurs possèdent des propriétés telles que la cohérence, la parcimonie et la normalité asymptotique. On illustre la performance de la méthode proposée au moyen de simulations et d'une application sur un vrai jeu de données.

APA, Harvard, Vancouver, ISO, and other styles

12

Chen, Juan. "Model selection for IRT equating of Testlet-based tests in the random groups design." Diss., University of Iowa, 2014. https://ir.uiowa.edu/etd/1439.

Full text

Abstract:

The use of testlets in a test can cause multidimensionality and local item dependence (LID), which can result in inaccurate estimation of item parameters, and in turn compromise the quality of item response theory (IRT) true and observed score equating of testlet-based tests. Both unidimensional and multidimensional IRT models have been developed to control local item dependence caused by testlets. The purposes of the current study were to (1) investigate how different levels of LID can affect IRT true and observed score equating of testlet-based tests when the traditional three parameter logistic (3PL) IRT model was used for calibration, and (2) compare the performance of four different IRT models, including the 3PL IRT model, graded response model (GRM), testlet response theory model (TRT), and bifactor model, in IRT true and observed score equating of testlet-based tests with various levels of local item dependence. Both real and simulated data analyses were conducted in this study. Two testlet-based tests (i.e., Test A and Test B) that differed in subjects, test length, and testlet length were used in the real data analysis. For simulated data analysis, two main factors were investigated in this study: (1) testlet length (5 or 10), and (2) LID level within testlets that was defined by testlet effect variance (0, 0.25, 0.5625, 0.75, 1, and 1.5). For the unidimensional IRT models (i.e., 3PL IRT model and GRM), unidimensional IRT true score and observed score equating procedures, explained in Kolen and Brennan (2004), were used. For the two investigated multidimensional IRT models (i.e., 3PL TRT model and bifactor model), the unidimensional approximation of multidimensional item response theory (MIRT) true score equating procedure and the unidimensional approximation of MIRT observed score equating procedure (Brossman & Lee, 2013) were applied. The traditional equipercentile equating method was used as the baseline for comparison in both real data and simulated data analyses. It was found in the study that both testlet length and the LID level affected the performance of the investigated models on IRT true and observed score equating of testlet-based tests. When the traditional 3PL IRT model was used for tests with long testlets, higher levels of local item dependence led to IRT equating results that deviated further away from those obtained from the baseline method. However, the effect of local item dependence on IRT equating results was not prominent for tests with short testlets. Moreover, for tests consisting of long testlets (e.g., a testlet length of 10 or more) and having a very low level of local item dependence (e.g., a LID level of 0.25 or lower), and for tests consisting of short testlets (e.g., a testlet length around 5), all four investigated IRT models worked well in IRT true and observed score equating. For tests with long testlets and a relatively high level of local item dependence (e.g., a LID level of 0.5625 or higher), the GRM, bifactor, and TRT models outperformed the traditional 3PL IRT model in IRT true and observed equating of testlet-based tests. The study suggested that the selection of models for IRT true and observed score equating of testlet-based tests should be considered with respect to the features of the testlet-based tests and the groups of examinees from which the data is collected. It is hoped that this study encourages researchers to identify differences among existing models for IRT true and observed score equating of testlet-based tests with various features, and to develop new models that are appropriate for modeling testlet-based tests to obtain accurate IRT number correct score equating results.

APA, Harvard, Vancouver, ISO, and other styles

13

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. "Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2006. http://epub.wu.ac.at/1274/1/document.pdf.

Full text

Abstract:

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale level or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types. Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale level or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analysing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics

APA, Harvard, Vancouver, ISO, and other styles

14

Hjerpe, Adam. "Computing Random Forests Variable Importance Measures (VIM) on Mixed Numerical and Categorical Data." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-185496.

Full text

Abstract:

The Random Forest model is commonly used as a predictor function and the model have been proven useful in a variety of applications. Their popularity stems from the combination of providing high prediction accuracy, their ability to model high dimensional complex data, and their applicability under predictor correlations. This report investigates the random forest variable importance measure (VIM) as a means to find a ranking of important variables. The robustness of the VIM under imputation of categorical noise, and the capability to differentiate informative predictors from non-informative variables is investigated. The selection of variables may improve robustness of the predictor, improve the prediction accuracy, reduce computational time, and may serve as a exploratory data analysis tool. In addition the partial dependency plot obtained from the random forest model is examined as a means to find underlying relations in a non-linear simulation study.
Random Forest (RF) är en populär prediktormodell som visat goda resultat vid en stor uppsättning applikationsstudier. Modellen ger hög prediktionsprecision, har förmåga att modellera komplex högdimensionell data och modellen har vidare visat goda resultat vid interkorrelerade prediktorvariabler. Detta projekt undersöker ett mått, variabel importance measure (VIM) erhållna från RF modellen, för att beräkna graden av association mellan prediktorvariabler och målvariabeln. Projektet undersöker känsligheten hos VIM vid kvalitativt prediktorbrus och undersöker VIMs förmåga att differentiera prediktiva variabler från variabler som endast, med aveende på målvariableln, beskriver brus. Att differentiera prediktiva variabler vid övervakad inlärning kan användas till att öka robustheten hos klassificerare, öka prediktionsprecisionen, reducera data dimensionalitet och VIM kan användas som ett verktyg för att utforska relationer mellan prediktorvariabler och målvariablel.

APA, Harvard, Vancouver, ISO, and other styles

15

Sokolovska, Nataliya. "Contributions to the estimation of probabilistic discriminative models: semi-supervised learning and feature selection." Phd thesis, Télécom ParisTech, 2010. http://pastel.archives-ouvertes.fr/pastel-00006257.

Full text

Abstract:

Dans cette thèse nous étudions l'estimation de modèles probabilistes discriminants, surtout des aspects d'apprentissage semi-supervisé et de sélection de caractéristiques. Le but de l'apprentissage semi-supervisé est d'améliorer l'efficacité de l'apprentissage supervisé en utilisant des données non-étiquetées. Cet objectif est difficile à atteindre dans les cas des modèles discriminants. Les modèles probabilistes discriminants permettent de manipuler des représentations linguistiques riches, sous la forme de vecteurs de caractéristiques de très grande taille. Travailler en grande dimension pose des problèmes, en particulier computationnels, qui sont exacerbés dans le cadre de modèles de séquences tels que les champs aléatoires conditionnels (CRF). Notre contribution est double. Nous introduisons une méthode originale et simple pour intégrer des données non étiquetées dans une fonction objectif semi-supervisée. Nous démontrons alors que l'estimateur semi-supervisé correspondant est asymptotiquement optimal. Le cas de la régression logistique est illustré par des résultats d'expèriences. Dans cette étude, nous proposons un algorithme d'estimation pour les CRF qui réalise une sélection de modèle, par le truchement d'une pénalisation $L_1$. Nous présentons également les résultats d'expériences menées sur des tâches de traitement des langues (le chunking et la détection des entités nommées), en analysant les performances en généralisation et les caractéristiques sélectionnées. Nous proposons finalement diverses pistes pour améliorer l'efficacité computationelle de cette technique.

APA, Harvard, Vancouver, ISO, and other styles

16

Hu, Renjie. "Random neural networks for dimensionality reduction and regularized supervised learning." Diss., University of Iowa, 2019. https://ir.uiowa.edu/etd/6960.

Full text

Abstract:

This dissertation explores Random Neural Networks (RNNs) in several aspects and their applications. First, Novel RNNs have been proposed for dimensionality reduction and visualization. Based on Extreme Learning Machines (ELMs) and Self-Organizing Maps (SOMs) a new method is created to identify the important variables and visualize the data. This technique reduces the curse of dimensionality and improves furthermore the interpretability of the visualization and is tested on real nursing survey datasets. ELM-SOM+ is an autoencoder created to preserves the intrinsic quality of SOM and also brings continuity to the projection using two ELMs. This new methodology shows considerable improvement over SOM on real datasets. Second, as a Supervised Learning method, ELMs has been applied to the hierarchical multiscale method to bridge the the molecular dynamics to continua. The method is tested on simulation data and proven to be efficient for passing the information from one scale to another. Lastly, the regularization of ELMs has been studied and a new regularization algorithm for ELMs is created using a modified Lanczos Algorithm. The Lanczos ELM on average divide computational time by 20 and reduce the Normalized MSE by 14% comparing with regular ELMs.

APA, Harvard, Vancouver, ISO, and other styles

17

Chakravorty, Hirak. "Equilibrium and non-equilibrium analysis of folding and sequence selection in mean field random heteropolymers." Thesis, King's College London (University of London), 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.399162.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Al, Maathidi M. M. "Optimal feature selection and machine learning for high-level audio classification : a random forests approach." Thesis, University of Salford, 2017. http://usir.salford.ac.uk/44338/.

Full text

Abstract:

Content related information, metadata, and semantics can be extracted from soundtracks of multimedia files. Speech recognition, music information retrieval and environmental sound detection techniques have been developed into a fairly mature technology enabling a final text mining process to obtain semantics for the audio scene. An efficient speech, music and environmental sound classification system, which correctly identify these three types of audio signals and feed them into dedicated recognisers, is a critical pre-processing stage for such a content analysis system. The performance and computational efficiency of such a system is predominately dependent on the selected features. This thesis presents a detailed study to identify the suitable classification features and associate a suitable machine learning technique for the intended classification task. In particular, a systematic feature selection procedure is developed to employ the random forests classifier to rank the features according to their importance and reduces the dimensionality of the feature space accordingly. This new technique avoids the trial-and-error approach used by many authors researchers. The implemented feature selection produces results related to individual classification tasks instead of the commonly used statistical distance criteria based approaches that does not consider the intended classification task, which makes it more suitable for supervised learning with specific purposes. A final collective decision-making stage is employed to combine multiple class detectors patterns into one to produce a single classification result for each input frames. The performance of the proposed feature selection technique has been compared with the techniques proposed by MPEG-7 standard to extract the reduced feature space. The results show a significant improvement in the resulted classification accuracy, at the same time, the feature space is simplified and computational overhead reduced. The proposed feature selection and machine learning technique enable the use of only 30 out of the 47 features without degrading the classification accuracy while the classification accuracy lowered by 1.7% only while just 10 features were utilised. The validation shows good performance also and the last stage of collective decision making was able to improve the classification result even after selecting only a small number of classification features. The work represents a successful attempt to determine audio feature importance and classify the audio contents into speech, music and environmental sound using a selected feature subset. The result shows a high degree of accuracy by utilising the random forests for both feature importance ranking and audio content classification.

APA, Harvard, Vancouver, ISO, and other styles

19

Eriksson, Viktor. "Bayesian Model Selection with Intrinsic Bayes Factor for Location-Scale Model and Random Eﬀects Model." Thesis, Örebro universitet, Handelshögskolan vid Örebro Universitet, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:oru:diva-85152.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

New, T. M. "Random road analysis and improved gear ratio selection of a front wheel drive drag racing car." Connect to this title online, 2008. http://etd.lib.clemson.edu/documents/1211387456/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Carter, Kristina A. "A Comparison of Variable Selection Methods for Modeling Human Judgment." Ohio University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=ohiou1552494031580848.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Frühwirth-Schnatter, Sylvia, and Regina Tüchler. "Bayesian parsimonious covariance estimation for hierarchical linear mixed models." Institut für Statistik und Mathematik, WU Vienna University of Economics and Business, 2004. http://epub.wu.ac.at/774/1/document.pdf.

Full text

Abstract:

We considered a non-centered parameterization of the standard random-effects model, which is based on the Cholesky decomposition of the variance-covariance matrix. The regression type structure of the non-centered parameterization allows to choose a simple, conditionally conjugate normal prior on the Cholesky factor. Based on the non-centered parameterization, we search for a parsimonious variance-covariance matrix by identifying the non-zero elements of the Cholesky factors using Bayesian variable selection methods. With this method we are able to learn from the data for each effect, whether it is random or not, and whether covariances among random effects are zero or not. An application in marketing shows a substantial reduction of the number of free elements of the variance-covariance matrix. (author's abstract)
Series: Research Report Series / Department of Statistics and Mathematics

APA, Harvard, Vancouver, ISO, and other styles

23

Lin, Hui-Fen. "A Comparison of Three Item Selection Methods in Criterion-Referenced Tests." Thesis, University of North Texas, 1988. https://digital.library.unt.edu/ark:/67531/metadc332327/.

Full text

Abstract:

This study compared three methods of selecting the best discriminating test items and the resultant test reliability of mastery/nonmastery classifications. These three methods were (a) the agreement approach, (b) the phi coefficient approach, and (c) the random selection approach. Test responses from 1,836 students on a 50-item physical science test were used, from which 90 distinct data sets were generated for analysis. These 90 data sets contained 10 replications of the combination of three different sample sizes (75, 150, and 300) and three different numbers of test items (15, 25, and 35). The results of this study indicated that the agreement approach was an appropriate method to be used for selecting criterion-referenced test items at the classroom level, while the phi coefficient approach was an appropriate method to be used at the district and/or state levels. The random selection method did not have similar characteristics in selecting test items and produced the lowest reliabilities, when compared with the agreement and the phi coefficient approaches.

APA, Harvard, Vancouver, ISO, and other styles

24

Kudella, Patrick [Verfasser], and Dieter [Akademischer Betreuer] Braun. "Sequence self-selection by the network dynamics of random ligating oligomer pools / Patrick Kudella ; Betreuer: Dieter Braun." München : Universitätsbibliothek der Ludwig-Maximilians-Universität, 2021. http://d-nb.info/123264546X/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Blaha, Jeffrey. "Variable Selection Methods for Residential Real Estate Markets: An Exploration of Random Forest Trees in Spatial Economics." University of Toledo / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=toledo1503330225924692.

Full text

APA, Harvard, Vancouver, ISO, and other styles

26

Singh, Vivek. "Contributions to automatic particle identification in electron micrographs: Algorithms, implementation, and applications." Doctoral diss., University of Central Florida, 2005. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2107.

Full text

Abstract:

Three dimensional reconstruction of large macromolecules like viruses at resolutions below 8 Å - 10 Å requires a large set of projection images and the particle identification step becomes a bottleneck. Several automatic and semi-automatic particle detection algorithms have been developed along the years. We present a general technique designed to automatically identify the projection images of particles. The method utilizes Markov random field modelling of the projected images and involves a preprocessing of electron micrographs followed by image segmentation and post processing for boxing of the particle projections. Due to the typically extensive computational requirements for extracting hundreds of thousands of particle projections, parallel processing becomes essential. We present parallel algorithms and load balancing schemes for our algorithms. The lack of a standard benchmark for relative performance analysis of particle identification algorithms has prompted us to develop a benchmark suite. Further, we present a collection of metrics for the relative performance analysis of particle identification algorithms on the micrograph images in the suite, and discuss the design of the benchmark suite.
Ph.D.
School of Computer Science
Engineering and Computer Science
Computer Science

APA, Harvard, Vancouver, ISO, and other styles

27

Edgel, Robert John. "Habitat Selection and Response to Disturbance by Pygmy Rabbits in Utah." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/3928.

Full text

Abstract:

The pygmy rabbit (Brachylagus idahoensis) is a sagebrush (Artemisia sp.) obligate that depends on sagebrush habitats for food and cover throughout its life cycle. Invasive species, frequent fires, overgrazing, conversion of land to agriculture, energy development, and many other factors have contributed to recent declines in both quantity and quality of sagebrush-steppe habitats required by pygmy rabbits. Because of the many threats to these habitats and the believed decline of pygmy rabbit populations, there is a need to further understand habitat requirements for this species and how they respond to disturbance. This study evaluated habitat selection by pygmy rabbits in Utah and assessed response of this small lagomorph to construction of a large-scale pipeline (i.e. Ruby pipeline) in Utah. We collected habitat data across Utah at occupied sites (pygmy rabbit occupied burrows) and compared these data to similar measurements at unoccupied sites (random locations within sagebrush habitat where pygmy rabbits were not observed). Variables such as horizontal obscurity, elevation, percent understory composed of sagebrush and other shrubs, and sagebrush decadence best described between occupied (active burrow) and unoccupied (randomly selected) sites. Occupied sites had greater amounts of horizontal obscurity, were located at higher elevations, had greater percentage of understory comprised of sagebrush and shrubs, and had less decadent sagebrush. When considering habitat alterations or management these variables should be considered to enhance and protect existing habitat for pygmy rabbits. The Ruby pipeline was a large-scale pipeline project that required the removal of vegetation and the excavation of soil in a continuous linear path for the length of the pipeline. The area that was disturbed is referred to as the right of way (ROW). From our assessment of pygmy rabbit response to construction of the Ruby pipeline, we found evidence for habitat loss and fragmentation as a result of this disturbance. The size of pygmy rabbit space-use areas and home ranges decreased post construction, rabbits shifted core-use areas away from the ROW, and there were fewer movements of collared rabbits across the ROW. Mitigation efforts should consider any action which may reduce restoration time and facilitate movements of rabbits across disturbed areas.

APA, Harvard, Vancouver, ISO, and other styles

28

Boone, Edward L. "Bayesian Methodology for Missing Data, Model Selection and Hierarchical Spatial Models with Application to Ecological Data." Diss., Virginia Tech, 2003. http://hdl.handle.net/10919/26141.

Full text

Abstract:

Ecological data is often fraught with many problems such as Missing Data and Spatial Correlation. In this dissertation we use a data set collected by the Ohio EPA as motivation for studying techniques to address these problems. The data set is concerned with the benthic health of Ohio's waterways. A new method for incorporating covariate structure and missing data mechanisms into missing data analysis is considered. This method allows us to detect relationships other popular methods do not allow. We then further extend this method into model selection. In the special case where the unobserved covariates are assumed normally distributed we use the Bayesian Model Averaging method to average the models, select the highest probability model and do variable assessment. Accuracy in calculating the posterior model probabilities using the Laplace approximation and an approximation based on the Bayesian Information Criterion (BIC) are explored. It is shown that the Laplace approximation is superior to the BIC based approximation using simulation. Finally, Hierarchical Spatial Linear Models are considered for the data and we show how to combine analysis which have spatial correlation within and between clusters.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

29

Kisamore, Jennifer L. "Validity Generalization and Transportability: An Investigation of Distributional Assumptions of Random-Effects Meta-Analytic Methods." [Tampa, Fla.] : University of South Florida, 2003. http://purl.fcla.edu/fcla/etd/SFE0000060.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Rönnegård, Lars. "Selection, maternal effects and inbreeding in reindeer husbandry." Uppsala : Dept. of Animal Breeding and Genetics, Swedish Univ. of Agricultural Sciences, 2003. http://epsilon.slu.se/a370.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Kamath, Vidya. "Use of Random Subspace Ensembles on Gene Expression Profiles in Survival Prediction for Colon Cancer Patients." Scholar Commons, 2005. https://scholarcommons.usf.edu/etd/715.

Full text

Abstract:

Cancer is a disease process that emerges out of a series of genetic mutations that cause seemingly uncontrolled multiplication of cells. The molecular genetics of cells indicates that different combinations of genetic events or alternative pathways in cells may lead to cancer. A study of the gene expressions of cancer cells, in combination with the external influential factors, can greatly aid in cancer management such as understanding the initiation and etiology of cancer, as well as detection, assessment and prediction of the progression of cancer. Gene expression analysis of cells yields a very large number of features that can be used to describe the condition of the cell. Feature selection methods are explored to choose the best of these features that are most relevant to the problem at hand. Random subspace ensembles created using these selected features perform poorly in predicting the 36-month survival for colon cancer patients. A modification to the random subspace scheme is proposed to enhance the accuracy of prediction. The method first applies random subspace ensembles with decision trees to select predictive features. Then, support vector machines are used to analyze the selected gene expression profiles in cancer tissue to predict the survival outcome for a patient. The proposed method is shown to achieve a weighted accuracy of 58.96%, with 40.54% sensitivity and 77.38% specificity in predicting 36-month survival for new and unknown colon cancer patients. The prediction accuracy of the method is comparable to the baseline classifiers and significantly better than random subspace ensembles on gene expression profiles of colon cancer.

APA, Harvard, Vancouver, ISO, and other styles

32

Arnroth, Lukas, and Dennis Jonni Fiddler. "Supervised Learning Techniques : A comparison of the Random Forest and the Support Vector Machine." Thesis, Uppsala universitet, Statistiska institutionen, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-274768.

Full text

Abstract:

This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features.

APA, Harvard, Vancouver, ISO, and other styles

33

Söderberg, Max Joel, and Axel Meurling. "Feature selection in short-term load forecasting." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-259692.

Full text

Abstract:

This paper investigates correlation between energy consumption 24 hours ahead and features used for predicting energy consumption. The features originate from three categories: weather, time and previous energy. The correlations are calculated using Pearson correlation and mutual information. This resulted in the highest correlated features being those representing previous energy consumption, followed by temperature and month. Two identical feature sets containing all attributes1 were obtained by ranking the features according to correlation. Three feature sets were created manually. The first set contained seven attributes representing previous energy consumption over the course of seven days prior to the day of prediction. The second set consisted of weather and time attributes. The third set consisted of all attributes from the first and second set. These sets were then compared on different machine learning models. It was found the set containing all attributes and the set containing previous energy attributes yielded the best performance for each machine learning model. 1In this report, the words ”attribute” and ”feature” are used interchangeably.
I denna rapport undersöks korrelation och betydelsen av olika attribut för att förutspå energiförbrukning 24 timmar framåt. Attributen härstammar från tre kategorier: väder, tid och tidigare energiförbrukning. Korrelationerna tas fram genom att utföra Pearson Correlation och Mutual Information. Detta resulterade i att de högst korrelerade attributen var de som representerar tidigare energiförbrukning, följt av temperatur och månad. Två identiska attributmängder erhölls genom att ranka attributen över korrelation. Tre attributmängder skapades manuellt. Den första mängden innehåll sju attribut som representerade tidigare energiförbrukning, en för varje dag, sju dagar innan datumet för prognosen av energiförbrukning. Den andra mängden bestod av väderoch tidsattribut. Den tredje mängden bestod av alla attribut från den första och andra mängden. Dessa mängder jämfördes sedan med hjälp av olika maskininlärningsmodeller. Resultaten visade att mängden med alla attribut och den med tidigare energiförbrukning gav bäst resultat för samtliga modeller.

APA, Harvard, Vancouver, ISO, and other styles

34

Michelfelder, Stefan. "Selection and characterization of targeted vector capsids from random adeno-associated virus type 2 (AAV-2) display peptide libraries." [S.l. : s.n.], 2008. http://nbn-resolving.de/urn:nbn:de:bsz:352-opus-72406.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Frot, Benjamin. "Graphical model selection for Gaussian conditional random fields in the presence of latent variables : theory and application to genetics." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:0a6799ed-fca1-48b2-89cd-ad6f2c0439af.

Full text

Abstract:

The task of performing graphical model selection arises in many applications in science and engineering. The field of application of interest in this thesis relates to the needs of datasets that include genetic and multivariate phenotypic data. There are several factors that make this problem particularly challenging: some of the relevant variables might not be observed, high-dimensionality might cause identifiability issues and, finally, it might be preferable to learn the model over a subset of the collection while conditioning on the rest of the variables, e.g. genetic variants. We suggest addressing these problems by learning a conditional Gaussian graphical model, while accounting for latent variables. Building on recent advances in this field, we decompose the parameters of a conditional Markov random field into the sum of a sparse and a low-rank matrix. We derive convergence bounds for this novel estimator, show that it is well-behaved in the high-dimensional regime and describe algorithms that can be used when the number of variables is in the thousands. Through simulations, we illustrate the conditions required for identifiability and show that this approach is consistent in a wider range of settings. In order to show the practical implications of our work, we apply our method to two real datasets and devise a metric that makes use of an independent source of information to assess the biological relevance of the estimates. In our first application, we use the proposed approach to model the levels of 39 metabolic traits conditional on hundreds of genetic variants, in two independent cohorts. We find our results to be better replicated across cohorts than the ones obtained with other methods. In our second application, we look at a high-dimensional gene expression dataset. We find that our method is capable of retrieving as many biologically relevant gene-gene interactions as other methods while retrieving fewer irrelevant interaction.

APA, Harvard, Vancouver, ISO, and other styles

36

Russ, Ricardo. "Service Level Achievments - Test Data for Optimal Service Selection." Thesis, Linnéuniversitetet, Institutionen för datavetenskap (DV), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-50538.

Full text

Abstract:

This bachelor’s thesis was written in the context of a joint research group, which developed a framework for finding and providing the best-fit web service for a user. The problem of the research group lays in testing their developed framework sufficiently. The framework can either be tested with test data produced by real web services which costs money or by generated test data based on a simulation of web service behavior. The second attempt has been developed within this scientific paper in the form of a test data generator. The generator simulates a web service request by defining internal services, whereas each service has an own internal graph which considers the structure of a service. A service can be atomic or can be compose of other services that are called in a specific manner (sequential, loop, conditional). The generation of the test data is done by randomly going through the services which result in variable response times, since the graph structure changes every time the system has been initialized. The implementation process displayed problems which have not been solved within the time frame. Those problems are displaying interesting challenges for the dynamical generation of random graphs. Those challenges should be targeted in further research.

APA, Harvard, Vancouver, ISO, and other styles

37

Körber, Julian [Verfasser], and Susanne [Akademischer Betreuer] Rässler. "Bayesian Analysis of Network Data. Model Selection and Evaluation of the Exponential Random Graph Model / Julian Körber ; Betreuer: Susanne Rässler." Bamberg : Otto-Friedrich-Universität Bamberg, 2018. http://d-nb.info/1160938849/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Peck, Riley D. "Seasonal Habitat Selection by Greater Sage Grouse in Strawberry Valley Utah." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/3180.

Full text

Abstract:

This study examined winter habitat use and nesting ecology of greater sage grouse (Centrocercus urophasianus) in Strawberry Valley (SV), Utah located in the north-central part of the state. We monitored sage grouse with the aid of radio telemetry throughout the year, but specifically used information from the winter and nesting periods for this study. Our study provided evidence that sage grouse show fidelity to nesting areas in subsequent years regardless of nest success. We found only 57% of our nests located within the 3 km distance from an active lek typically used to delineate critical nesting habitat. We suggest a more conservative distance of 10 km for our study area. Whenever possible, we urge consideration of nest-area fidelity in conservation planning across the range of greater sage grouse. We also evaluated winter-habitat selection at multiple spatial scales. Sage grouse in our study area selected gradual slopes with high amounts of sagebrush exposed above the snow. We produced a map that identified suitable winter habitat for sage grouse in our study area. This map highlighted core areas that should be conserved and will provide a basis for management decisions affecting Strawberry Valley, Utah.

APA, Harvard, Vancouver, ISO, and other styles

39

Zhang, Qing Frankowski Ralph. "An empirical evaluation of the random forests classifier models for variable selection in a large-scale lung cancer case-control study /." See options below, 2006. http://proquest.umi.com/pqdweb?did=1324365481&sid=1&Fmt=2&clientId=68716&RQT=309&VName=PQD.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Hermann, Philipp [Verfasser], and Hajo [Akademischer Betreuer] Holzmann. "High-dimensional, robust, heteroscedastic variable selection with the adaptive LASSO, and applications to random coefficient regression / Philipp Hermann ; Betreuer: Hajo Holzmann." Marburg : Philipps-Universität Marburg, 2021. http://d-nb.info/1236692187/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Meyer, Patrick E. "Information-theoretic variable selection and network inference from microarray data." Doctoral thesis, Universite Libre de Bruxelles, 2008. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210396.

Full text

Abstract:

Statisticians are used to model interactions between variables on the basis of observed

data. In a lot of emerging fields, like bioinformatics, they are confronted with datasets

having thousands of variables, a lot of noise, non-linear dependencies and, only, tens of

samples. The detection of functional relationships, when such uncertainty is contained in

data, constitutes a major challenge.

Our work focuses on variable selection and network inference from datasets having

many variables and few samples (high variable-to-sample ratio), such as microarray data.

Variable selection is the topic of machine learning whose objective is to select, among a

set of input variables, those that lead to the best predictive model. The application of

variable selection methods to gene expression data allows, for example, to improve cancer

diagnosis and prognosis by identifying a new molecular signature of the disease. Network

inference consists in representing the dependencies between the variables of a dataset by

a graph. Hence, when applied to microarray data, network inference can reverse-engineer

the transcriptional regulatory network of cell in view of discovering new drug targets to

cure diseases.

In this work, two original tools are proposed MASSIVE (Matrix of Average Sub-Subset

Information for Variable Elimination) a new method of feature selection and MRNET (Minimum

Redundancy NETwork), a new algorithm of network inference. Both tools rely on

the computation of mutual information, an information-theoretic measure of dependency.

More precisely, MASSIVE and MRNET use approximations of the mutual information

between a subset of variables and a target variable based on combinations of mutual informations

between sub-subsets of variables and the target. The used approximations allow

to estimate a series of low variate densities instead of one large multivariate density. Low

variate densities are well-suited for dealing with high variable-to-sample ratio datasets,

since they are rather cheap in terms of computational cost and they do not require a large

amount of samples in order to be estimated accurately. Numerous experimental results

show the competitiveness of these new approaches. Finally, our thesis has led to a freely

available source code of MASSIVE and an open-source R and Bioconductor package of

network inference.
Doctorat en sciences, Spécialisation Informatique
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

42

Tanyildiz, Zeynep Esra. "The Effects of Networks on Institution Selection by Foreign Doctoral Students in the U.S." Digital Archive @ GSU, 2008. http://digitalarchive.gsu.edu/pmap_diss/25.

Full text

Abstract:

The United States has been a very attractive destination for foreign science and engineering graduate students and postdoctoral scholars for a considerable period of time. Despite the important role of foreign doctoral students in the U.S, relatively little is known about the factors influencing their decision to attend an institution. One factor that is rarely explored is the effect of networks on institution selection. This study aims to provide both qualitative and quantitative information about the role networks play in foreign doctoral students institution selection. This three-part study utilizes different methodologies: (1) focus group interviews conducted with Turkish doctoral students at the Georgia Institute of Technology; (2) a web study of research laboratories in science and engineering; and (3) the estimation of Random Utility Model (RUM) of institution selection. Guided focus group interviews provide important qualitative information about the ways students, alumni, faculty and local community of same nationality influence institution choice. The web study of research laboratories provide evidence that labs that are directed by foreign-born faculty are more likely to be populated by students from the same country of origin than are labs that are directed by native (U.S. born) faculty. The results from RUM of institution selection provide strong and significant evidence for the relationship between the number of existing students from a country of origin at an institution and the probability of attending that institution for potential applicants from the same country of origin. Also, in some of the models there is evidence that the alumni and faculty from the same origin also play a role in student choice. The results of this study have several policy implications related to integration of foreign doctoral students, future enrollments, institutional mismatch , and the role foreign-born faculty play in U.S universities.

APA, Harvard, Vancouver, ISO, and other styles

43

Cattoglio, Claudia. "Target site selection of retroviral vectors in the human genome : viral and genomic determinants of non-random integration patterns in hematopoietic cells." Thesis, Open University, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.494505.

Full text

Abstract:

Integration of gamma-retroviruses (RV) and lehtiviruses (LV) follows different, non random patterns in mammalian genomes. To obtain information about the viral and genomic determinants of integration preferences, I mapped > 2,500 integration sites of RV and LV vectors carrying wild type or modified LTRs in human CD34+ hematopoietic cells. To investigate the role of transcriptional regulatory networks in directing RV and LV integration, 1 evaluated the local abundance and anangement of putative transcription factor binding sites (TFBSs) in the genomic regions flanking integrated proviruses.

APA, Harvard, Vancouver, ISO, and other styles

44

Jiménez, Montero José Antonio. "Selección genómica en poblaciones reducidas de vacuno de leche." Doctoral thesis, Universitat Politècnica de València, 2013. http://hdl.handle.net/10251/27649.

Full text

Abstract:

La selección genómica está cambiando profundamente el mercado del vacuno de leche. En la actualidad, es posible obtener una alta precisión en las valoraciones genéticas de animales muy jóvenes sin la necesidad del fenotipo propio o el de sus hijas. Por tanto, la respuesta genética de un programa genómico bien diseñado supera netamente a la selección tradicional. Esta mejora está modificando uno de los principios tradicionales del mercado de vacuno de leche como era la preferencia de uso de toros con altas fiabilidades frente a otros animales con valores genéticos a priori superiores. Esta tesis contiene seis capítulos en los cuales se estudian de las bases para la implementación del programa de selección genómica en el vacuno de leche español. Para ello se realizaron estudios de simulación y valoraciones genómicas con datos reales de la primera población nacional de referencia. El objetivo principal de esta tesis es contribuir a la implementación de la selección genómica en el vacuno de leche español. Los objetivos específicos son: (1) Estudiar alternativas de genotipado en poblaciones reducidas de vacuno lechero. (2) Desarrollar y validar metodología para la evaluación de grandes cantidades de genotipos. (3) Estudiar el efecto de los procesos de imputación de genotipos en la capacidad predictiva de los genotipos resultantes. Las principales cuestiones relacionadas con la selección genómica en vacuno lechero fueron discutidas en el capítulo 1 incluyendo: aspectos estadísticos y genéticos en los que se basa la selección genómica, diseño de poblaciones de referencia, revisión del estado del arte en cuanto a la metodología desarrollada para evaluación genómica, diseño y métodos de los algoritmos de imputación, e implementación de la selección genómica en vacuno de leche a nivel de programa de selección, centro de inseminación y de granja comercial. En el capítulo 2 se realizó un estudio de simulación comparando estrategias de genotipado selectivo en poblaciones de hembras frente al uso de selección tradicional o selección genómica con una población de referencia de machos. La población de referencia española estaba formada en principio por algo más de 1,600 toros con prueba de progenie. Este tamaño no es, en principio, suficiente para obtener predicciones genómicas de alta fiabilidad. Por tanto, debían evaluarse diferentes alternativas para incrementar la habilidad predictiva de las evaluaciones. Las estrategias que consisten en usar como población de referencia los animales en los extremos de la distribución fenotípica permitían mejorar la precisión de la evaluación. Los resultados usando 1,000 genotipos fueron 0.50 para el carácter de baja heredabilidad y 0.63 para el de heredabilidad media cuando la variable dependiente fue el fenotipo ajustado. Cuando se usaron valores genéticos como variable dependiente las correlaciones fueron 0.48 y 0.63 respectivamente. Para los mismos caracteres, una población de 996 machos obtuvo correlaciones de 0.48 y 0.55 en las predicciones posteriores. El estudio concluye que la estrategia de genotipado que proporciona la mayor correlación es la que incluye las hembras de ambas colas de la distribución de fenotipos. Por otro lado se pone de manifiesto que la mera inclusión de las hembras élite que son las habitualmente genotipadas en las poblaciones reales produce resultados no satisfactorios en la predicción de valores genómicos. En el capítulo 3, el Random Boosting (R-Boost) es comparado con otros métodos de evaluación genómica como Bayes-A, LASSO Bayesiano y G-BLUP. La población de referencia española y caracteres incluidos en las evaluaciones genéticas tradicionales de vacuno lechero fueron usados para comparar estos métodos en términos de precisión y sesgo. Las predicciones genómicas fueron más precisas que el índice de pedigrí tradicional a la hora de predecir los resultados de futuros test de progenie como era de esperar. Las ganancias en precisión debidas al empleo de la selección genómica dependen del carácter evaluado y variaron entre 0.04 (Profundidad de ubre) y 0.42 (Porcentaje de grasa) unidades de correlación de Pearson. Los resultados promediados entre caracteres mostraron que el LASSO Bayesiano obtuvo mayores correlaciones superando al R-Boost, Bayes-A y G-BLUP en 0.01, 0.03 y 0.03 unidades respectivamente. Las predicciones obtenidas con el LASSO Bayesiano también mostraron menos desviaciones en la media, 0.02, 0.03 y 0.10 menos que Bayes-A, R-Boost y G-BLUP, respectivamente. Las predicciones usando R-Boost obtuvieron coeficientes de regresión más próximos a la unidad que el resto de métodos y los errores medios cuadráticos fueron un 2%, 10% y 12% inferiores a los obtenidos a partir del B-LASSO, Bayes-A y G-BLUP, respectivamente. El estudio concluye que R- Boost es una metodología aplicable a selección genómica y competitiva en términos de capacidad predictiva. En el capítulo 4, el algoritmo de machine learning R-Boost evaluado en el capítulo 3 es descrito e implementado para selección genómica adaptado a la evaluación de grandes bases de datos de una forma eficiente. Tras la incorporación en el consorcio Eurogenomics, el programa genómico español pasó a disponer de más de 22,000 toros probados como población de referencia, por tanto era necesario implementar un método capaz de evaluar éste gran conjunto de datos en un tiempo razonable. El nuevo algoritmo denominado R-Boost realiza de forma secuencial un muestreo aleatorio de SNPs en cada iteración sobre los cuales se aplica un predictor débil. El algoritmo fue evaluado sobre datos reales de vacuno de leche empleados en el capítulo 3 estudiando más en profundidad el comportamiento de los parámetros de sintonización. Esta propuesta de modificación del Boosting puede obtener predicciones sin perdida de precisión o incrementos de sesgo empleando tan solo un 1% del tiempo de computación original. En el capítulo 5 se evalúa el efecto de usar genotipos de baja densidad imputados con el software Beagle en cuanto a su posterior habilidad predictiva cuando son incorporados a la población de referencia. Para ello se emplearon dos métodos de evaluación R-Boost y un BLUP con matriz genómica. Animales de los que se conocían los SNPs incluidos en los chips GoldenGate Bovine 3K y BovineLD BeadChip, fueron imputados hasta conocer los SNPs incluidos en el BovineSNP50v2 BeadChip. Posteriormente, un segundo proceso de imputación obtuvo los SNPs incluidos en el BovineHD BeadChip. Tras imputatar desde dos genotipados a baja densidad, se obtuvo similar capacidad predictiva a la obtenida empleando los originales en densidad 50K. Sin embargo, sólo se obtuvo una pequeña mejora (0.002 unidades de Pearson) al imputar a HD. El mayor incremento se obtuvo para el carácter días abiertos donde las correlaciones en el grupo de validación aumentaron en 0.06 unidades de Pearson las correlaciones en el grupo de validación cuando se emplearon los genotipos imputados a HD. En función de la densidad de genotipado, el algoritmo R-Boost mostró mayores diferencias que el G-BLUP. Ambos métodos obtuvieron resultados similares salvo en el caso de porcentaje de grasa, donde las predicciones obtenidas con el R-Boost fueron superiores a las del G-BLUP en 0.20 unidades de correlación de Pearson. El estudio concluye que la capacidad predictiva para algunos caracteres puede mejorar imputando la población de referencia a HD así como empleando métodos de evaluación capaces de adaptarse a las distintas arquitecturas genéticas posibles. Finalmente en el capitulo 6 se desarrolla una discusión general de los estudios presentados en los capítulos anteriores y se enlazan con la implementación de la selección genómica en el vacuno lechero español, que se ha desarrollado en paralelo a esta tesis doctoral. La primera población de referencia con unos 1.600 toros fue evaluada en el capítulo 4 y fue usada para comparar los distintos métodos y escenarios propuestos en los capítulos 3, 4 y 5. La primera evaluación genómica obtenida para los caracteres incluidos en el capítulo 4 de esta tesis estuvo disponible para los centros de inseminación incluidos en el programa en septiembre de 2011. La población de Eurogenomics se incorporó en Noviembre de dicho año, completándose la primera evaluación para los caracteres incluidos en el índice de selección ICO en Febrero de 2012 empleando el R-Boost descrito en el capítulo 3. En mayo de 2012 las evaluaciones del carácter proteína fueron validadas por Interbull y finalmente el 30 de Noviembre del 2012 las primeras evaluaciones genómicas oficiales fueron publicadas on-line por la federación de ganaderos CONAFE (http://www.conafe.com/noticias/20121130a.htm).
Jiménez Montero, JA. (2013). Selección genómica en poblaciones reducidas de vacuno de leche [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/27649
TESIS

APA, Harvard, Vancouver, ISO, and other styles

45

Wang, Xing. "Time Dependent Kernel Density Estimation: A New Parameter Estimation Algorithm, Applications in Time Series Classiﬁcation and Clustering." Scholar Commons, 2016. http://scholarcommons.usf.edu/etd/6425.

Full text

Abstract:

The Time Dependent Kernel Density Estimation (TDKDE) developed by Harvey & Oryshchenko (2012) is a kernel density estimation adjusted by the Exponentially Weighted Moving Average (EWMA) weighting scheme. The Maximum Likelihood Estimation (MLE) procedure for estimating the parameters proposed by Harvey & Oryshchenko (2012) is easy to apply but has two inherent problems. In this study, we evaluate the performances of the probability density estimation in terms of the uniformity of Probability Integral Transforms (PITs) on various kernel functions combined with diﬀerent preset numbers. Furthermore, we develop a new estimation algorithm which can be conducted using Artiﬁcial Neural Networks to eliminate the inherent problems with the MLE method and to improve the estimation performance as well. Based on the new estimation algorithm, we develop the TDKDE-based Random Forests time series classiﬁcation algorithm which is signiﬁcantly superior to the commonly used statistical feature-based Random Forests method as well as the Ker- nel Density Estimation (KDE)-based Random Forests approach. Furthermore, the proposed TDKDE-based Self-organizing Map (SOM) clustering algorithm is demonstrated to be superior to the widely used Discrete-Wavelet- Transform (DWT)-based SOM method in terms of the Adjusted Rand Index (ARI).

APA, Harvard, Vancouver, ISO, and other styles

46

Kaze, Joshua Taft. "Habitat Selection by Two K-Selected Species: An Application to Bison and Sage Grouse." BYU ScholarsArchive, 2013. https://scholarsarchive.byu.edu/etd/4284.

Full text

Abstract:

Population growth for species with long lifespans and low reproductive rates (i.e., K-selected species) is influenced primarily by both survival of adult females and survival of young. Because survival of adults and young is influenced by habitat quality and resource availability, it is important for managers to understand factors that influence habitat selection during the period of reproduction. My thesis contains two chapters addressing this issue for K-selected species in Utah. Chapter one evaluates habitat selection of greater sage-grouse (Centrocercusurophasianus) on Diamond Mountain during the critical nesting and brood-rearing period. Chapter two address selection of birth sites by bison (Bison bison) on Antelope Island, Utah. We collected micro-habitat data for 88 nests and 138 brood locations of greater sage-grouse from 2010-2012 to determine habitat preferences of nesting and brooding sage-grouse. Using random forests modeling techniques, we found that percent sagebrush, percent canopy cover, percent total shrubs, and percent obscurity (Robel pole) best differentiated nest locations from random locations with selection of higher values in each case. We used a 26-day nesting period to determine an average nest survival rate of 0.35 (95% CI = 0.23 – 0.47) for adults and 0.31 (95% CI = 0.14 – 0.50) for juvenile grouse.Brood sites were closer to habitat edges, contained more forbs and less rock than random locations. Average annual adult female survival across the two-year study period was 0.52 (95% CI= 0.38 – 0.65) compared to 0.43 (95% CI= 0.28 – 0.59) for yearlings.Brooding and nesting habitat at use locations on Diamond Mountain met or exceeded published guidelines for everything but forb cover at nest sites. Adult and juvenile survival rates were in line with average values from around the range whereas nest success was on the low end of reported values. For bison, we quantified variables surrounding 35 birth sites and 100 random sites during 2010 and 2011 on Antelope Island State Park. We found females selected birth sites based on landscape attributes such as curvature and elevation, but also distance to anthropogenic features (i.e., human structures such as roads or trails). Models with variables quantifying the surrounding vegetation received no support.Coefficients associated with top models indicated that areas near anthropogenic features had a lower probability of selection as birth sites. Our model predicted 91% of observed birth sites in medium-high or high probability categories. This model of birthing habitat, in cooperation with data of birth timing, provides biologists with a map of high-probability birthing areas and a time of year in which human access to trails or roads could be minimized to reduce conflict between recreation and female bison.

APA, Harvard, Vancouver, ISO, and other styles

47

Tanyildiz, Zeynep Esra. "Effects of networks on U.S. institution selection by foreign doctoral students in science and engineering." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/22644.

Full text

Abstract:

Thesis (Ph. D.)--Public Policy, Georgia Institute of Technology, 2008.
Committee Chair: Paula E. Stephan; Committee Member: Albert J. Sumell; Committee Member: Erdal Tekin; Committee Member: Gregory B. Lewis; Committee Member: Mary Frank Fox.

APA, Harvard, Vancouver, ISO, and other styles

48

Cole, James Jacob. "Assessing Nonlinear Relationships through Rich Stimulus Sampling in Repeated-Measures Designs." OpenSIUC, 2018. https://opensiuc.lib.siu.edu/dissertations/1587.

Full text

Abstract:

Explaining a phenomenon often requires identification of an underlying relationship between two variables. However, it is common practice in psychological research to sample only a few values of an independent variable. Young, Cole, and Sutherland (2012) showed that this practice can impair model selection in between-subject designs. The current study expands that line of research to within-subjects designs. In two Monte Carlo simulations, model discrimination under systematic sampling of 2, 3, or 4 levels of the IV was compared with that under random uniform sampling and sampling from a Halton sequence. The number of subjects, number of observations per subject, effect size, and between-subject parameter variance in the simulated experiments were also manipulated. Random sampling out-performed the other methods in model discrimination with only small, function-specific costs to parameter estimation. Halton sampling also produced good results but was less consistent. The systematic sampling methods were generally rank-ordered by the number of levels they sampled.

APA, Harvard, Vancouver, ISO, and other styles

49

Duan, Haoyang. "Applying Supervised Learning Algorithms and a New Feature Selection Method to Predict Coronary Artery Disease." Thèse, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31113.

Full text

Abstract:

From a fresh data science perspective, this thesis discusses the prediction of coronary artery disease based on Single-Nucleotide Polymorphisms (SNPs) from the Ontario Heart Genomics Study (OHGS). First, the thesis explains the k-Nearest Neighbour (k-NN) and Random Forest learning algorithms, and includes a complete proof that k-NN is universally consistent in finite dimensional normed vector spaces. Second, the thesis introduces two dimensionality reduction techniques: Random Projections and a new method termed Mass Transportation Distance (MTD) Feature Selection. Then, this thesis compares the performance of Random Projections with k-NN against MTD Feature Selection and Random Forest for predicting artery disease. Results demonstrate that MTD Feature Selection with Random Forest is superior to Random Projections and k-NN. Random Forest is able to obtain an accuracy of 0.6660 and an area under the ROC curve of 0.8562 on the OHGS dataset, when 3335 SNPs are selected by MTD Feature Selection for classification. This area is considerably better than the previous high score of 0.608 obtained by Davies et al. in 2010 on the same dataset.

APA, Harvard, Vancouver, ISO, and other styles

50

Khan, Syeduzzaman. "A PROBABILISTIC MACHINE LEARNING FRAMEWORK FOR CLOUD RESOURCE SELECTION ON THE CLOUD." Scholarly Commons, 2020. https://scholarlycommons.pacific.edu/uop_etds/3720.

Full text

Abstract:

The execution of the scientific applications on the Cloud comes with great flexibility, scalability, cost-effectiveness, and substantial computing power. Market-leading Cloud service providers such as Amazon Web service (AWS), Azure, Google Cloud Platform (GCP) offer various general purposes, memory-intensive, and compute-intensive Cloud instances for the execution of scientific applications. The scientific community, especially small research institutions and undergraduate universities, face many hurdles while conducting high-performance computing research in the absence of large dedicated clusters. The Cloud provides a lucrative alternative to dedicated clusters, however a wide range of Cloud computing choices makes the instance selection for the end-users. This thesis aims to simplify Cloud instance selection for end-users by proposing a probabilistic machine learning framework to allow to users select a suitable Cloud instance for their scientific applications. This research builds on the previously proposed A2Cloud-RF framework that recommends high-performing Cloud instances by profiling the application and the selected Cloud instances. The framework produces a set of objective scores called the A2Cloud scores, which denote the compatibility level between the application and the selected Cloud instances. When used alone, the A2Cloud scores become increasingly unwieldy with an increasing number of tested Cloud instances. Additionally, the framework only examines the raw application performance and does not consider the execution cost to guide resource selection. To improve the usability of the framework and assist with economical instance selection, this research adds two Naïve Bayes (NB) classifiers that consider both the application’s performance and execution cost. These NB classifiers include: 1) NB with a Random Forest Classifier (RFC) and 2) a standalone NB module. Naïve Bayes with a Random Forest Classifier (RFC) augments the A2Cloud-RF framework's final instance ratings with the execution cost metric. In the training phase, the classifier builds the frequency and probability tables. The classifier recommends a Cloud instance based on the highest posterior probability for the selected application. The standalone NB classifier uses the generated A2Cloud score (an intermediate result from the A2Cloud-RF framework) and execution cost metric to construct an NB classifier. The NB classifier forms a frequency table and probability (prior and likelihood) tables. For recommending a Cloud instance for a test application, the classifier calculates the highest posterior probability for all of the Cloud instances. The classifier recommends a Cloud instance with the highest posterior probability. This study performs the execution of eight real-world applications on 20 Cloud instances from AWS, Azure, GCP, and Linode. We train the NB classifiers using 80% of this dataset and employ the remaining 20% for testing. The testing yields more than 90% recommendation accuracy for the chosen applications and Cloud instances. Because of the imbalanced nature of the dataset and multi-class nature of classification, we consider the confusion matrix (true positive, false positive, true negative, and false negative) and F1 score with above 0.9 scores to describe the model performance. The final goal of this research is to make Cloud computing an accessible resource for conducting high-performance scientific executions by enabling users to select an effective Cloud instance from across multiple providers.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Random selection'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles