Dissertations / Theses on the topic 'LASSO regression models'

To see the other types of publications on this topic, follow the link: LASSO regression models.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 34 dissertations / theses for your research on the topic 'LASSO regression models.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Patnaik, Kaushik. "Adaptive learning in lasso models." Thesis, Georgia Institute of Technology, 2015. http://hdl.handle.net/1853/54353.

Full text
Abstract:
Regression with L1-regularization, Lasso, is a popular algorithm for recovering the sparsity pattern (also known as model selection) in linear models from observations contaminated by noise. We examine a scenario where a fraction of the zero co-variates are highly correlated with non-zero co-variates making sparsity recovery difficult. We propose two methods that adaptively increment the regularization parameter to prune the Lasso solution set. We prove that the algorithms achieve consistent model selection with high probability while using fewer samples than traditional Lasso. The algorithm can be extended to a broad set of L1-regularized M-estimators for linear statistical models.
APA, Harvard, Vancouver, ISO, and other styles
2

Chen, Xiaohui. "Lasso-type sparse regression and high-dimensional Gaussian graphical models." Thesis, University of British Columbia, 2012. http://hdl.handle.net/2429/42271.

Full text
Abstract:
High-dimensional datasets, where the number of measured variables is larger than the sample size, are not uncommon in modern real-world applications such as functional Magnetic Resonance Imaging (fMRI) data. Conventional statistical signal processing tools and mathematical models could fail at handling those datasets. Therefore, developing statistically valid models and computationally efficient algorithms for high-dimensional situations are of great importance in tackling practical and scientific problems. This thesis mainly focuses on the following two issues: (1) recovery of sparse regression coefficients in linear systems; (2) estimation of high-dimensional covariance matrix and its inverse matrix, both subject to additional random noise. In the first part, we focus on the Lasso-type sparse linear regression. We propose two improved versions of the Lasso estimator when the signal-to-noise ratio is low: (i) to leverage adaptive robust loss functions; (ii) to adopt a fully Bayesian modeling framework. In solution (i), we propose a robust Lasso with convex combined loss function and study its asymptotic behaviors. We further extend the asymptotic analysis to the Huberized Lasso, which is shown to be consistent even if the noise distribution is Cauchy. In solution (ii), we propose a fully Bayesian Lasso by unifying discrete prior on model size and continuous prior on regression coefficients in a single modeling framework. Since the proposed Bayesian Lasso has variable model sizes, we propose a reversible-jump MCMC algorithm to obtain its numeric estimates. In the second part, we focus on the estimation of large covariance and precision matrices. In high-dimensional situations, the sample covariance is an inconsistent estimator. To address this concern, regularized estimation is needed. For the covariance matrix estimation, we propose a shrinkage-to-tapering estimator and show that it has attractive theoretic properties for estimating general and large covariance matrices. For the precision matrix estimation, we propose a computationally efficient algorithm that is based on the thresholding operator and Neumann series expansion. We prove that, the proposed estimator is consistent in several senses under the spectral norm. Moreover, we show that the proposed estimator is minimax in a class of precision matrices that are approximately inversely closed.
APA, Harvard, Vancouver, ISO, and other styles
3

Olaya, Bucaro Orlando. "Predicting risk of cyberbullying victimization using lasso regression." Thesis, Uppsala universitet, Statistiska institutionen, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-338767.

Full text
Abstract:
The increased online presence and use of technology by today’s adolescents has created new places where bullying can occur. The aim of this thesis is to specify a prediction model that can accurately predict the risk of cyberbullying victimization. The data used is from a survey conducted at five secondary schools in Pereira, Colombia. A logistic regression model with random effects is used to predict cyberbullying exposure. Predictors are selected by lasso, tuned by cross-validation. Covariates included in the study includes demographic variables, dietary habit variables, parental mediation variables, school performance variables, physical health variables, mental health variables and health risk variables such as alcohol and drug consumption. Included variables in the final model are demographic variables, mental health variables and parental mediation variables. Variables excluded in the final model includes dietary habit variables, school performance variables, physical health variables and health risk variables. The final model has an overall prediction accuracy of 88%.
APA, Harvard, Vancouver, ISO, and other styles
4

Mo, Lili. "A class of operator splitting methods for least absolute shrinkage and selection operator (LASSO) models." HKBU Institutional Repository, 2012. https://repository.hkbu.edu.hk/etd_ra/1391.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Miller, Ryan. "Marginal false discovery rate approaches to inference on penalized regression models." Diss., University of Iowa, 2018. https://ir.uiowa.edu/etd/6474.

Full text
Abstract:
Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.
APA, Harvard, Vancouver, ISO, and other styles
6

Marques, Matheus Augustus Pumputis. "Análise e comparação de alguns métodos alternativos de seleção de variáveis preditoras no modelo de regressão linear." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/45/45133/tde-23082018-210710/.

Full text
Abstract:
Neste trabalho estudam-se alguns novos métodos de seleção de variáveis no contexto da regressão linear que surgiram nos últimos 15 anos, especificamente o LARS - Least Angle Regression, o NAMS - Noise Addition Model Selection, a Razão de Falsa Seleção - RFS (FSR em inglês), o LASSO Bayesiano e o Spike-and-Slab LASSO. A metodologia foi a análise e comparação dos métodos estudados e aplicações. Após esse estudo, realizam-se aplicações em bases de dados reais e um estudo de simulação, em que todos os métodos se mostraram promissores, com os métodos Bayesianos apresentando os melhores resultados.
In this work, some new variable selection methods that have appeared in the last 15 years in the context of linear regression are studied, specifically the LARS - Least Angle Regression, the NAMS - Noise Addition Model Selection, the False Selection Rate - FSR, the Bayesian LASSO and the Spike-and-Slab LASSO. The methodology was the analysis and comparison of the studied methods. After this study, applications to real data bases are made, as well as a simulation study, in which all methods are shown to be promising, with the Bayesian methods showing the best results.
APA, Harvard, Vancouver, ISO, and other styles
7

Zhai, Jing, Chiu-Hsieh Hsu, and Z. John Daye. "Ridle for sparse regression with mandatory covariates with application to the genetic assessment of histologic grades of breast cancer." BIOMED CENTRAL LTD, 2017. http://hdl.handle.net/10150/622811.

Full text
Abstract:
Background: Many questions in statistical genomics can be formulated in terms of variable selection of candidate biological factors for modeling a trait or quantity of interest. Often, in these applications, additional covariates describing clinical, demographical or experimental effects must be included a priori as mandatory covariates while allowing the selection of a large number of candidate or optional variables. As genomic studies routinely require mandatory covariates, it is of interest to propose principled methods of variable selection that can incorporate mandatory covariates. Methods: In this article, we propose the ridge-lasso hybrid estimator (ridle), a new penalized regression method that simultaneously estimates coefficients of mandatory covariates while allowing selection for others. The ridle provides a principled approach to mitigate effects of multicollinearity among the mandatory covariates and possible dependency between mandatory and optional variables. We provide detailed empirical and theoretical studies to evaluate our method. In addition, we develop an efficient algorithm for the ridle. Software, based on efficient Fortran code with R-language wrappers, is publicly and freely available at https://sites.google.com/site/zhongyindaye/software. Results: The ridle is useful when mandatory predictors are known to be significant due to prior knowledge or must be kept for additional analysis. Both theoretical and comprehensive simulation studies have shown that the ridle to be advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves. A microarray gene expression analysis of the histologic grades of breast cancer has identified 24 genes, in which 2 genes are selected only by the ridle among current methods and found to be associated with tumor grade. Conclusions: In this article, we proposed the ridle as a principled sparse regression method for the selection of optional variables while incorporating mandatory ones. Results suggest that the ridle is advantageous when mandatory covariates are correlated with the irrelevant optional predictors or are highly correlated among themselves.
APA, Harvard, Vancouver, ISO, and other styles
8

Song, Song. "Confidence bands in quantile regression and generalized dynamic semiparametric factor models." Doctoral thesis, Humboldt-Universität zu Berlin, Wirtschaftswissenschaftliche Fakultät, 2010. http://dx.doi.org/10.18452/16341.

Full text
Abstract:
In vielen Anwendungen ist es notwendig, die stochastische Schwankungen der maximalen Abweichungen der nichtparametrischen Schätzer von Quantil zu wissen, zB um die verschiedene parametrische Modelle zu überprüfen. Einheitliche Konfidenzbänder sind daher für nichtparametrische Quantil Schätzungen der Regressionsfunktionen gebaut. Die erste Methode basiert auf der starken Approximation der empirischen Verfahren und Extremwert-Theorie. Die starke gleichmäßige Konsistenz liegt auch unter allgemeinen Bedingungen etabliert. Die zweite Methode beruht auf der Bootstrap Resampling-Verfahren. Es ist bewiesen, dass die Bootstrap-Approximation eine wesentliche Verbesserung ergibt. Der Fall von mehrdimensionalen und diskrete Regressorvariablen wird mit Hilfe einer partiellen linearen Modell behandelt. Das Verfahren wird mithilfe der Arbeitsmarktanalysebeispiel erklärt. Hoch-dimensionale Zeitreihen, die nichtstationäre und eventuell periodische Verhalten zeigen, sind häufig in vielen Bereichen der Wissenschaft, zB Makroökonomie, Meteorologie, Medizin und Financial Engineering, getroffen. Der typische Modelierungsansatz ist die Modellierung von hochdimensionalen Zeitreihen in Zeit Ausbreitung der niedrig dimensionalen Zeitreihen und hoch-dimensionale zeitinvarianten Funktionen über dynamische Faktorenanalyse zu teilen. Wir schlagen ein zweistufiges Schätzverfahren. Im ersten Schritt entfernen wir den Langzeittrend der Zeitreihen durch Einbeziehung Zeitbasis von der Gruppe Lasso-Technik und wählen den Raumbasis mithilfe der funktionalen Hauptkomponentenanalyse aus. Wir zeigen die Eigenschaften dieser Schätzer unter den abhängigen Szenario. Im zweiten Schritt erhalten wir den trendbereinigten niedrig-dimensionalen stochastischen Prozess (stationär).
In many applications it is necessary to know the stochastic fluctuation of the maximal deviations of the nonparametric quantile estimates, e.g. for various parametric models check. Uniform confidence bands are therefore constructed for nonparametric quantile estimates of regression functions. The first method is based on the strong approximations of the empirical process and extreme value theory. The strong uniform consistency rate is also established under general conditions. The second method is based on the bootstrap resampling method. It is proved that the bootstrap approximation provides a substantial improvement. The case of multidimensional and discrete regressor variables is dealt with using a partial linear model. A labor market analysis is provided to illustrate the method. High dimensional time series which reveal nonstationary and possibly periodic behavior occur frequently in many fields of science, e.g. macroeconomics, meteorology, medicine and financial engineering. One of the common approach is to separate the modeling of high dimensional time series to time propagation of low dimensional time series and high dimensional time invariant functions via dynamic factor analysis. We propose a two-step estimation procedure. At the first step, we detrend the time series by incorporating time basis selected by the group Lasso-type technique and choose the space basis based on smoothed functional principal component analysis. We show properties of this estimator under the dependent scenario. At the second step, we obtain the detrended low dimensional stochastic process (stationary).
APA, Harvard, Vancouver, ISO, and other styles
9

Sawert, Marcus. "Predicting deliveries from suppliers : A comparison of predictive models." Thesis, Mittuniversitetet, Institutionen för informationssystem och –teknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-39314.

Full text
Abstract:
In the highly competitive environment that companies find themselves in today, it is key to have a well-functioning supply chain. For manufacturing companies, having a good supply chain is dependent on having a functioning production planning. The production planning tries to fulfill the demand while considering the resources available. This is complicated by the uncertainties that exist, such as the uncertainty in demand, in manufacturing and in supply. Several methods and models have been created to deal with production planning under uncertainty, but they often overlook the complexity in the supply uncertainty, by considering it as a stochastic uncertainty. To improve these models, a prediction based on earlier data regarding the supplier or item could be used to see when the delivery is likely to arrive. This study looked to compare different predictive models to see which one could best be suited for this purpose. Historic data regarding earlier deliveries was gathered from a large international manufacturing company and was preprocessed before used in the models. The target value that the models were to predict was the actual delivery time from the supplier. The data was then tested with the following four regression models in Python: Linear regression, ridge regression, Lasso and Elastic net. The results were calculated by cross-validation and presented in the form of the mean absolute error together with the standard deviation. The results showed that the Elastic net was the overall best performing model, and that the linear regression performed worst.
APA, Harvard, Vancouver, ISO, and other styles
10

Yu, Lili. "Variable selection in the general linear model for censored data." Columbus, Ohio : Ohio State University, 2007. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1173279515.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Anderskär, Erika, and Frida Thomasson. "Inkrementell responsanalys av Scandnavian Airlines medlemmar : Vilka kunder ska väljas vid riktad marknadsföring?" Thesis, Linköpings universitet, Statistik och maskininlärning, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-139465.

Full text
Abstract:
Scandinavian Airlines has a large database containing their Eurobonus members. In order to analyze which customers they should target with direct marketing, such as emails, uplift models have been used. With a binary response variable that indicates whether the customer has bought or not, and a binary dummy variable that indicates if the customer has received the campaign or not conclusions can be drawn about which customers are persuadable. That means that the customers that buy when they receive a campaign and not if they don't are spotted. Analysis have been done with one campaign for Sweden and Scandinavia. The methods that have been used are logistic regression with Lasso and logistic regression with Penalized Net Information Value. The best method for predicting purchases is Lasso regression when comparing with a confusion matrix. The variable that best describes persuadable customers in logistic regression with PNIV is Flown (customers that have own with SAS within the last six months). In Lassoregression the variable that describes a persuadable customer in Sweden is membership level1 (the rst level of membership) and in Scandinavia customers that receive campaigns with delivery code 13 are persuadable, which is a form of dispatch.
APA, Harvard, Vancouver, ISO, and other styles
12

Lundberg, Jacob. "Resource Efficient Representation of Machine Learning Models : investigating optimization options for decision trees in embedded systems." Thesis, Linköpings universitet, Statistik och maskininlärning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-162013.

Full text
Abstract:
Combining embedded systems and machine learning models is an exciting prospect. However, to fully target any embedded system, with the most stringent resource requirements, the models have to be designed with care not to overwhelm it. Decision tree ensembles are targeted in this thesis. A benchmark model is created with LightGBM, a popular framework for gradient boosted decision trees. This model is first transformed and regularized with RuleFit, a LASSO regression framework. Then it is further optimized with quantization and weight sharing, techniques used when compressing neural networks. The entire process is combined into a novel framework, called ESRule. The data used comes from the domain of frequency measurements in cellular networks. There is a clear use-case where embedded systems can use the produced resource optimized models. Compared with LightGBM, ESRule uses 72ˆ less internal memory on average, simultaneously increasing predictive performance. The models use 4 kilobytes on average. The serialized variant of ESRule uses 104ˆ less hard disk space than LightGBM. ESRule is also clearly faster at predicting a single sample.
APA, Harvard, Vancouver, ISO, and other styles
13

Karmann, Clémence. "Inférence de réseaux pour modèles inflatés en zéro." Thesis, Université de Lorraine, 2019. http://www.theses.fr/2019LORR0146/document.

Full text
Abstract:
L'inférence de réseaux ou inférence de graphes a de plus en plus d'applications notamment en santé humaine et en environnement pour l'étude de données micro-biologiques et génomiques. Les réseaux constituent en effet un outil approprié pour représenter, voire étudier des relations entre des entités. De nombreuses techniques mathématiques d'estimation ont été développées notamment dans le cadre des modèles graphiques gaussiens mais aussi dans le cas de données binaires ou mixtes. Le traitement des données d'abondance (de micro-organismes comme les bactéries par exemple) est particulier pour deux raisons : d'une part elles ne reflètent pas directement la réalité car un processus de séquençage a lieu pour dupliquer les espèces et ce processus apporte de la variabilité, d'autre part une espèce peut être absente dans certains échantillons. On est alors dans le cadre de données inflatées en zéro. Beaucoup de méthodes d'inférence de réseaux existent pour les données gaussiennes, les données binaires et les données mixtes mais les modèles inflatés en zéro sont très peu étudiés alors qu'ils reflètent la structure de nombreux jeux de données de façon pertinente. L'objectif de cette thèse concerne l'inférence de réseaux pour les modèles inflatés en zéro. Dans cette thèse, on se limitera à des réseaux de dépendances conditionnelles. Le travail présenté dans cette thèse se décompose principalement en deux parties. La première concerne des méthodes d'inférence de réseaux basées sur l'estimation de voisinages par une procédure couplant des méthodes de régressions ordinales et de sélection de variables. La seconde se focalise sur l'inférence de réseaux dans un modèle où les variables sont des gaussiennes inflatées en zéro par double troncature (à droite et à gauche)
Network inference has more and more applications, particularly in human health and environment, for the study of micro-biological and genomic data. Networks are indeed an appropriate tool to represent, or even study, relationships between entities. Many mathematical estimation techniques have been developed, particularly in the context of Gaussian graphical models, but also in the case of binary or mixed data. The processing of abundance data (of microorganisms such as bacteria for example) is particular for two reasons: on the one hand they do not directly reflect reality because a sequencing process takes place to duplicate species and this process brings variability, on the other hand a species may be absent in some samples. We are then in the context of zero-inflated data. Many graph inference methods exist for Gaussian, binary and mixed data, but zero-inflated models are rarely studied, although they reflect the structure of many data sets in a relevant way. The objective of this thesis is to infer networks for zero-inflated models. In this thesis, we will restrict to conditional dependency graphs. The work presented in this thesis is divided into two main parts. The first one concerns graph inference methods based on the estimation of neighbourhoods by a procedure combining ordinal regression models and variable selection methods. The second one focuses on graph inference in a model where the variables are Gaussian zero-inflated by double truncation (right and left)
APA, Harvard, Vancouver, ISO, and other styles
14

Huynh, Bao Tuyen. "Estimation and feature selection in high-dimensional mixtures-of-experts models." Thesis, Normandie, 2019. http://www.theses.fr/2019NORMC237.

Full text
Abstract:
Cette thèse traite de la modélisation et de l’estimation de modèles de mélanges d’experts de grande dimension, en vue d’efficaces estimation de densité, prédiction et classification de telles données complexes car hétérogènes et de grande dimension. Nous proposons de nouvelles stratégies basées sur l’estimation par maximum de vraisemblance régularisé des modèles pour pallier aux limites des méthodes standards, y compris l’EMV avec les algorithmes d’espérance-maximisation (EM), et pour effectuer simultanément la sélection des variables pertinentes afin d’encourager des solutions parcimonieuses dans un contexte haute dimension. Nous introduisons d’abord une méthode d’estimation régularisée des paramètres et de sélection de variables d’un mélange d’experts, basée sur des régularisations l1 (lasso) et le cadre de l’algorithme EM, pour la régression et la classification adaptés aux contextes de la grande dimension. Ensuite, nous étendons la stratégie un mélange régularisé de modèles d’experts pour les données discrètes, y compris pour la classification. Nous développons des algorithmes efficaces pour maximiser la fonction de log-vraisemblance l1 -pénalisée des données observées. Nos stratégies proposées jouissent de la maximisation monotone efficace du critère optimisé, et contrairement aux approches précédentes, ne s’appuient pas sur des approximations des fonctions de pénalité, évitent l’inversion de matrices et exploitent l’efficacité de l’algorithme de montée de coordonnées, particulièrement dans l’approche proximale par montée de coordonnées
This thesis deals with the problem of modeling and estimation of high-dimensional MoE models, towards effective density estimation, prediction and clustering of such heterogeneous and high-dimensional data. We propose new strategies based on regularized maximum-likelihood estimation (MLE) of MoE models to overcome the limitations of standard methods, including MLE estimation with Expectation-Maximization (EM) algorithms, and to simultaneously perform feature selection so that sparse models are encouraged in such a high-dimensional setting. We first introduce a mixture-of-experts’ parameter estimation and variable selection methodology, based on l1 (lasso) regularizations and the EM framework, for regression and clustering suited to high-dimensional contexts. Then, we extend the method to regularized mixture of experts models for discrete data, including classification. We develop efficient algorithms to maximize the proposed l1 -penalized observed-data log-likelihood function. Our proposed strategies enjoy the efficient monotone maximization of the optimized criterion, and unlike previous approaches, they do not rely on approximations on the penalty functions, avoid matrix inversion, and exploit the efficiency of the coordinate ascent algorithm, particularly within the proximal Newton-based approach
APA, Harvard, Vancouver, ISO, and other styles
15

Liu, Li. "Grouped variable selection in high dimensional partially linear additive Cox model." Diss., University of Iowa, 2010. https://ir.uiowa.edu/etd/847.

Full text
Abstract:
In the analysis of survival outcome supplemented with both clinical information and high-dimensional gene expression data, traditional Cox proportional hazard model fails to meet some emerging needs in biological research. First, the number of covariates is generally much larger the sample size. Secondly, predicting an outcome with individual gene expressions is inadequate because a gene's expression is regulated by multiple biological processes and functional units. There is a need to understand the impact of changes at a higher level such as molecular function, cellular component, biological process, or pathway. The change at a higher level is usually measured with a set of gene expressions related to the biological process. That is, we need to model the outcome with gene sets as variable groups and the gene sets could be partially overlapped also. In this thesis work, we investigate the impact of a penalized Cox regression procedure on regularization, parameter estimation, variable group selection, and nonparametric modeling of nonlinear eects with a time-to-event outcome. We formulate the problem as a partially linear additive Cox model with high-dimensional data. We group genes into gene sets and approximate the nonparametric components by truncated series expansions with B-spline bases. After grouping and approximation, the problem of variable selection becomes that of selecting groups of coecients in a gene set or in an approximation. We apply the group Lasso to obtain an initial solution path and reduce the dimension of the problem and then update the whole solution path with the adaptive group Lasso. We also propose a generalized group lasso method to provide more freedom in specifying the penalty and excluding covariates from being penalized. A modied Newton-Raphson method is designed for stable and rapid computation. The core programs are written in the C language. An user-friendly R interface is implemented to perform all the calculations by calling the core programs. We demonstrate the asymptotic properties of the proposed methods. Simulation studies are carried out to evaluate the finite sample performance of the proposed procedure using several tuning parameter selection methods for choosing the point on the solution path as the nal estimator. We also apply the proposed approach on two real data examples.
APA, Harvard, Vancouver, ISO, and other styles
16

Shah, Smit. "Comparison of Some Improved Estimators for Linear Regression Model under Different Conditions." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/1853.

Full text
Abstract:
Multiple linear regression model plays a key role in statistical inference and it has extensive applications in business, environmental, physical and social sciences. Multicollinearity has been a considerable problem in multiple regression analysis. When the regressor variables are multicollinear, it becomes difficult to make precise statistical inferences about the regression coefficients. There are some statistical methods that can be used, which are discussed in this thesis are ridge regression, Liu, two parameter biased and LASSO estimators. Firstly, an analytical comparison on the basis of risk was made among ridge, Liu and LASSO estimators under orthonormal regression model. I found that LASSO dominates least squares, ridge and Liu estimators over a significant portion of the parameter space for large dimension. Secondly, a simulation study was conducted to compare performance of ridge, Liu and two parameter biased estimator by their mean squared error criterion. I found that two parameter biased estimator performs better than its corresponding ridge regression estimator. Overall, Liu estimator performs better than both ridge and two parameter biased estimator.
APA, Harvard, Vancouver, ISO, and other styles
17

Kim, Byung-Jun. "Semiparametric and Nonparametric Methods for Complex Data." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99155.

Full text
Abstract:
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing those complex data in this dissertation. We have then provided several contributions to semiparametric and nonparametric methods for dealing with the following problems: the first is to propose a method for testing the significance of a functional association under the matched study; the second is to develop a method to simultaneously identify important variables and build a network in HDHC data; the third is to propose a multi-class dynamic model for recognizing a pattern in the time-trend analysis. For the first topic, we propose a semiparametric omnibus test for testing the significance of a functional association between the clustered binary outcomes and covariates with measurement error by taking into account the effect modification of matching covariates. We develop a flexible omnibus test for testing purposes without a specific alternative form of a hypothesis. The advantages of our omnibus test are demonstrated through simulation studies and 1-4 bidirectional matched data analyses from an epidemiology study. For the second topic, we propose a joint semiparametric kernel machine network approach to provide a connection between variable selection and network estimation. Our approach is a unified and integrated method that can simultaneously identify important variables and build a network among them. We develop our approach under a semiparametric kernel machine regression framework, which can allow for the possibility that each variable might be nonlinear and is likely to interact with each other in a complicated way. We demonstrate our approach using simulation studies and real application on genetic pathway analysis. Lastly, for the third project, we propose a Bayesian focal-area detection method for a multi-class dynamic model under a Bayesian hierarchical framework. Two-step Bayesian sequential procedures are developed to estimate patterns and detect focal intervals, which can be used for gas chromatography. We demonstrate the performance of our proposed method using a simulation study and real application on gas chromatography on Fast Odor Chromatographic Sniffer (FOX) system.
Doctor of Philosophy
A variety of complex data has broadened in many research fields such as epidemiology, genomics, and analytical chemistry with the development of science, technologies, and design scheme over the past few decades. For example, in epidemiology, the matched case-crossover study design is used to investigate the association between the clustered binary outcomes of disease and a measurement error in covariate within a certain period by stratifying subjects' conditions. In genomics, high-correlated and high-dimensional(HCHD) data are required to identify important genes and their interaction effect over diseases. In analytical chemistry, multiple time series data are generated to recognize the complex patterns among multiple classes. Due to the great diversity, we encounter three problems in analyzing the following three types of data: (1) matched case-crossover data, (2) HCHD data, and (3) Time-series data. We contribute to the development of statistical methods to deal with such complex data. First, under the matched study, we discuss an idea about hypothesis testing to effectively determine the association between observed factors and risk of interested disease. Because, in practice, we do not know the specific form of the association, it might be challenging to set a specific alternative hypothesis. By reflecting the reality, we consider the possibility that some observations are measured with errors. By considering these measurement errors, we develop a testing procedure under the matched case-crossover framework. This testing procedure has the flexibility to make inferences on various hypothesis settings. Second, we consider the data where the number of variables is very large compared to the sample size, and the variables are correlated to each other. In this case, our goal is to identify important variables for outcome among a large amount of the variables and build their network. For example, identifying few genes among whole genomics associated with diabetes can be used to develop biomarkers. By our proposed approach in the second project, we can identify differentially expressed and important genes and their network structure with consideration for the outcome. Lastly, we consider the scenario of changing patterns of interest over time with application to gas chromatography. We propose an efficient detection method to effectively distinguish the patterns of multi-level subjects in time-trend analysis. We suggest that our proposed method can give precious information on efficient search for the distinguishable patterns so as to reduce the burden of examining all observations in the data.
APA, Harvard, Vancouver, ISO, and other styles
18

Mukhopadhyay, Shraddha. "Comparison of existing ZOI estimation methods with different model specifications and data." Thesis, Högskolan Dalarna, Mikrodataanalys, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-34397.

Full text
Abstract:
With the increasing demand and interest in wind power worldwide, it is interesting to study the effects of running windfarms on the activity of reindeers and estimate the associated Zone of Influence (ZOI) relative to these disturbances. Through simulation, Hierarchical Likelihood (HL) and adaptive Lasso methods are used to estimate the ZOI of windfarms and catching the correct threshold at which the negative effect of the disturbances on the reindeer behaviour disappears. The results found some merit to the explanation that the negative effect may not disappear abruptly and more merit to the fact that a linear model was still a better choice than the smooth polynomial models used. A real-life data related to reindeer faecal pellet counts from an area in northern Sweden were windfarms were running were analyzed. The yearly time series data was divided into three periods : before construction, during construction and during operation of the windfarms. Logistic regression, segmented model, and HL methods were implemented for data analysis by using covariates as distance from wind turbine, vegetation type, the interaction between distance to wind turbine and time period. A significant breakpoint could be estimated using the segmented model at a distance of 2.8 km from running windfarm, after which the negative effects of the windfarm on the reindeer activity disappeared. However, further work is needed for estimation of ZOI using HL method and considering other possible factors causing disturbances to the reindeer habitat and behaviour.
APA, Harvard, Vancouver, ISO, and other styles
19

Chu, Shuyu. "Change Detection and Analysis of Data with Heterogeneous Structures." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/78613.

Full text
Abstract:
Heterogeneous data with different characteristics are ubiquitous in the modern digital world. For example, the observations collected from a process may change on its mean or variance. In numerous applications, data are often of mixed types including both discrete and continuous variables. Heterogeneity also commonly arises in data when underlying models vary across different segments. Besides, the underlying pattern of data may change in different dimensions, such as in time and space. The diversity of heterogeneous data structures makes statistical modeling and analysis challenging. Detection of change-points in heterogeneous data has attracted great attention from a variety of application areas, such as quality control in manufacturing, protest event detection in social science, purchase likelihood prediction in business analytics, and organ state change in the biomedical engineering. However, due to the extraordinary diversity of the heterogeneous data structures and complexity of the underlying dynamic patterns, the change-detection and analysis of such data is quite challenging. This dissertation aims to develop novel statistical modeling methodologies to analyze four types of heterogeneous data and to find change-points efficiently. The proposed approaches have been applied to solve real-world problems and can be potentially applied to a broad range of areas.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
20

Bécu, Jean-Michel. "Contrôle des fausses découvertes lors de la sélection de variables en grande dimension." Thesis, Compiègne, 2016. http://www.theses.fr/2016COMP2264/document.

Full text
Abstract:
Dans le cadre de la régression, de nombreuses études s’intéressent au problème dit de la grande dimension, où le nombre de variables explicatives mesurées sur chaque échantillon est beaucoup plus grand que le nombre d’échantillons. Si la sélection de variables est une question classique, les méthodes usuelles ne s’appliquent pas dans le cadre de la grande dimension. Ainsi, dans ce manuscrit, nous présentons la transposition de tests statistiques classiques à la grande dimension. Ces tests sont construits sur des estimateurs des coefficients de régression produits par des approches de régressions linéaires pénalisées, applicables dans le cadre de la grande dimension. L’objectif principal des tests que nous proposons consiste à contrôler le taux de fausses découvertes. La première contribution de ce manuscrit répond à un problème de quantification de l’incertitude sur les coefficients de régression réalisée sur la base de la régression Ridge, qui pénalise les coefficients de régression par leur norme l2, dans le cadre de la grande dimension. Nous y proposons un test statistique basé sur le rééchantillonage. La seconde contribution porte sur une approche de sélection en deux étapes : une première étape de criblage des variables, basée sur la régression parcimonieuse Lasso précède l’étape de sélection proprement dite, où la pertinence des variables pré-sélectionnées est testée. Les tests sont construits sur l’estimateur de la régression Ridge adaptive, dont la pénalité est construite à partir des coefficients de régression du Lasso. Une dernière contribution consiste à transposer cette approche à la sélection de groupes de variables
In the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection
APA, Harvard, Vancouver, ISO, and other styles
21

Tjärnberg, Andreas. "Exploring the Boundaries of Gene Regulatory Network Inference." Doctoral thesis, Stockholms universitet, Institutionen för biokemi och biofysik, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:su:diva-122149.

Full text
Abstract:
To understand how the components of a complex system like the biological cell interact and regulate each other, we need to collect data for how the components respond to system perturbations. Such data can then be used to solve the inverse problem of inferring a network that describes how the pieces influence each other. The work in this thesis deals with modelling the cell regulatory system, often represented as a network, with tools and concepts derived from systems biology. The first investigation focuses on network sparsity and algorithmic biases introduced by penalised network inference procedures. Many contemporary network inference methods rely on a sparsity parameter such as the L1 penalty term used in the LASSO. However, a poor choice of the sparsity parameter can give highly incorrect network estimates. In order to avoid such poor choices, we devised a method to optimise the sparsity parameter, which maximises the accuracy of the inferred network. We showed that it is effective on in silico data sets with a reasonable level of informativeness and demonstrated that accurate prediction of network sparsity is key to elucidate the correct network parameters. The second investigation focuses on how knowledge from association networks can be transferred to regulatory network inference procedures. It is common that the quality of expression data is inadequate for reliable gene regulatory network inference. Therefore, we constructed an algorithm to incorporate prior knowledge and demonstrated that it increases the accuracy of network inference when the quality of the data is low. The third investigation aimed to understand the influence of system and data properties on network inference accuracy. L1 regularisation methods commonly produce poor network estimates when the data used for inference is ill-conditioned, even when the signal to noise ratio is so high that all links in the network can be proven to exist for the given significance. In this study we elucidated some general principles for under what conditions we expect strongly degraded accuracy. Moreover, it allowed us to estimate expected accuracy from conditions of simulated data, which was used to predict the performance of inference algorithms on biological data. Finally, we built a software package GeneSPIDER for solving problems encountered during previous investigations. The software package supports highly controllable network and data generation as well as data analysis and exploration in the context of network inference.

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.

 

APA, Harvard, Vancouver, ISO, and other styles
22

Liley, Albert James. "Statistical co-analysis of high-dimensional association studies." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/270628.

Full text
Abstract:
Modern medical practice and science involve complex phenotypic definitions. Understanding patterns of association across this range of phenotypes requires co-analysis of high-dimensional association studies in order to characterise shared and distinct elements. In this thesis I address several problems in this area, with a general linking aim of making more efficient use of available data. The main application of these methods is in the analysis of genome-wide association studies (GWAS) and similar studies. Firstly, I developed methodology for a Bayesian conditional false discovery rate (cFDR) for levering GWAS results using summary statistics from a related disease. I extended an existing method to enable a shared control design, increasing power and applicability, and developed an approximate bound on false-discovery rate (FDR) for the procedure. Using the new method I identified several new variant-disease associations. I then developed a second application of shared control design in the context of study replication, enabling improvement in power at the cost of changing the spectrum of sensitivity to systematic errors in study cohorts. This has application in studies on rare diseases or in between-case analyses. I then developed a method for partially characterising heterogeneity within a disease by modelling the bivariate distribution of case-control and within-case effect sizes. Using an adaptation of a likelihood-ratio test, this allows an assessment to be made of whether disease heterogeneity corresponds to differences in disease pathology. I applied this method to a range of simulated and real datasets, enabling insight into the cause of heterogeneity in autoantibody positivity in type 1 diabetes (T1D). Finally, I investigated the relation of subtypes of juvenile idiopathic arthritis (JIA) to adult diseases, using modified genetic risk scores and linear discriminants in a penalised regression framework. The contribution of this thesis is in a range of methodological developments in the analysis of high-dimensional association study comparison. Methods such as these will have wide application in the analysis of GWAS and similar areas, particularly in the development of stratified medicine.
APA, Harvard, Vancouver, ISO, and other styles
23

McIlhagga, William H. "penalized: A MATLAB toolbox for fitting generalized linear models with penalties." 2015. http://hdl.handle.net/10454/10882.

Full text
Abstract:
Yes
penalized is a exible, extensible, and e cient MATLAB toolbox for penalized maximum likelihood. penalized allows you to t a generalized linear model (gaussian, logistic, poisson, or multinomial) using any of ten provided penalties, or none. The toolbox can be extended by creating new maximum likelihood models or new penalties. The toolbox also includes routines for cross-validation and plotting.
APA, Harvard, Vancouver, ISO, and other styles
24

Zeng, Yan. "A Study of Missing Data Imputation and Predictive Modeling of Strength Properties of Wood Composites." 2011. http://trace.tennessee.edu/utk_gradthes/1041.

Full text
Abstract:
Problem: Real-time process and destructive test data were collected from a wood composite manufacturer in the U.S. to develop real-time predictive models of two key strength properties (Modulus of Rupture (MOR) and Internal Bound (IB)) of a wood composite manufacturing process. Sensor malfunction and data “send/retrieval” problems lead to null fields in the company’s data warehouse which resulted in information loss. Many manufacturers attempt to build accurate predictive models excluding entire records with null fields or using summary statistics such as mean or median in place of the null field. However, predictive model errors in validation may be higher in the presence of information loss. In addition, the selection of predictive modeling methods poses another challenge to many wood composite manufacturers. Approach: This thesis consists of two parts addressing above issues: 1) how to improve data quality using missing data imputation; 2) what predictive modeling method is better in terms of prediction precision (measured by root mean square error or RMSE). The first part summarizes an application of missing data imputation methods in predictive modeling. After variable selection, two missing data imputation methods were selected after comparing six possible methods. Predictive models of imputed data were developed using partial least squares regression (PLSR) and compared with models of non-imputed data using ten-fold cross-validation. Root mean square error of prediction (RMSEP) and normalized RMSEP (NRMSEP) were calculated. The second presents a series of comparisons among four predictive modeling methods using imputed data without variable selection. Results: The first part concludes that expectation-maximization (EM) algorithm and multiple imputation (MI) using Markov Chain Monte Carlo (MCMC) simulation achieved more precise results. Predictive models based on imputed datasets generated more precise prediction results (average NRMSEP of 5.8% for model of MOR model and 7.2% for model of IB) than models of non-imputed datasets (average NRMSEP of 6.3% for model of MOR and 8.1% for model of IB). The second part finds that Bayesian Additive Regression Tree (BART) produced most precise prediction results (average NRMSEP of 7.7% for MOR model and 8.6% for IB model) than other three models: PLSR, LASSO, and Adaptive LASSO.
APA, Harvard, Vancouver, ISO, and other styles
25

Kunc, Vladimír. "Comparison of different models for forecasting of Czech electricity market." Master's thesis, 2017. http://www.nusl.cz/ntk/nusl-367836.

Full text
Abstract:
There is a demand for decision support tools that can model the electricity markets and allows to forecast the hourly electricity price. Many different ap- proach such as artificial neural network or support vector regression are used in the literature. This thesis provides comparison of several different estima- tors under one settings using available data from Czech electricity market. The resulting comparison of over 5000 different estimators led to a selection of several best performing models. The role of historical weather data (temper- ature, dew point and humidity) is also assesed within the comparison and it was found that while the inclusion of weather data might lead to overfitting, it is beneficial under the right circumstances. The best performing approach was the Lasso regression estimated using modified Lars. 1
APA, Harvard, Vancouver, ISO, and other styles
26

Mpfumali, Phathutshedzo. "Probabilistic solar power forecasting using partially linear additive quantile regression models: an application to South African data." Diss., 2019. http://hdl.handle.net/11602/1349.

Full text
Abstract:
MSc (Statistics)
Department of Statistics
This study discusses an application of partially linear additive quantile regression models in predicting medium-term global solar irradiance using data from Tellerie radiometric station in South Africa for the period August 2009 to April 2010. Variables are selected using a least absolute shrinkage and selection operator (Lasso) via hierarchical interactions and the parameters of the developed models are estimated using the Barrodale and Roberts's algorithm. The best models are selected based on the Akaike information criterion (AIC), Bayesian information criterion (BIC), adjusted R squared (AdjR2) and generalised cross validation (GCV). The accuracy of the forecasts is evaluated using mean absolute error (MAE) and root mean square errors (RMSE). To improve the accuracy of forecasts, a convex forecast combination algorithm where the average loss su ered by the models is based on the pinball loss function is used. A second forecast combination method which is quantile regression averaging (QRA) is also used. The best set of forecasts is selected based on the prediction interval coverage probability (PICP), prediction interval normalised average width (PINAW) and prediction interval normalised average deviation (PINAD). The results show that QRA is the best model since it produces robust prediction intervals than other models. The percentage improvement is calculated and the results demonstrate that QRA model over GAM with interactions yields a small improvement whereas QRA over a convex forecast combination model yields a higher percentage improvement. A major contribution of this dissertation is the inclusion of a non-linear trend variable and the extension of forecast combination models to include the QRA.
NRF
APA, Harvard, Vancouver, ISO, and other styles
27

Martin, Jacqueline. "Modellierung des Unfallgeschehens im Radverkehr am Beispiel der Stadt Dresden." 2020. https://tud.qucosa.de/id/qucosa%3A73500.

Full text
Abstract:
Das Radverkehrsaufkommen in Deutschland verzeichnete in den letzten Jahren einen Zuwachs, was sich im Umkehrschluss ebenfalls im Anstieg des Unfallgeschehens mit Radfahrendenbeteiligung widerspiegelt. Um den steigenden Unfallzahlen entgegenzuwirken, empfehlen Politik und Verbände v.a. Infrastrukturmaßnahmen zu ergreifen. Davon ausgehend untersucht die vorliegende Arbeit beispielhaft für die Stadt Dresden, wie sich einzelne Infrastrukturmerkmale auf das Unfallgeschehen zwischen Rad- und motorisiertem Verkehr auswirken. Die Datengrundlage der Untersuchung stellen dabei 548 Unfälle mit Radfahrendenbeteiligung aus den Jahren 2015 bis 2019 sowie die Merkmale von 484 Knotenpunktzufahrten dar. Da die Infrastruktur das Unfallgeschehen nicht allein determiniert, werden zudem Kenngrößen des Verkehrsaufkommens einbezogen. Um das Unfallgeschehen zu untersuchen, kommen das Random Forest-Verfahren sowie die Negative Binomialregression in Form von 'Accident Prediction Models' mit vorheriger Variablenselektion anhand des LASSO-Verfahrens zum Einsatz. Die Verfahren werden jeweils auf zwei spezielle Unfalltypen für Knotenpunkte angewandt, um differenzierte Ergebnisse zu erlangen. Der erste Unfalltyp 'Abbiege-Unfall' umfasst dabei Kollisionen zwischen einem rechtsabbiegenden und einem in gleicher oder entgegengesetzter Richtung geradeausfahrenden Beteiligten, während der zweite Unfalltyp 'Einbiegen-/Kreuzen-Unfall' Kollisionen zwischen einem vorfahrtsberechtigten Verkehrsteilnehmenden und einem einbiegenden oder kreuzenden Wartepflichtigen beinhaltet. Für den Unfalltyp 'Abbiege-Unfall' zeigen die Verfahren bspw., dass eine über den Knotenpunkt komplett oder teilweise rot eingefärbte Radfahrfurt sowie eine indirekte Führung des linksabbiegenden Radverkehrs anstelle dessen Führung im Mischverkehr höhere Unfallzahlen erwarten lässt, wobei letzteres für den untersuchten Sachverhalt irrelevant erscheint und damit auf eine Schwäche bei der Variableneinbeziehung hindeutet. Im Gegensatz dazu schätzen die Verfahren für den Unfalltyp 'Einbiegen-/Kreuzen-Unfall' bspw. höhere Unfallzahlen, wenn die Anzahl der Geradeausfahrstreifen einer Zufahrt zunimmt und wenn der Knotenpunkt durch das Verkehrszeichen Z205 bzw. eine Teil-Lichtsignalanlage anstelle der Vorschrift Rechts-vor-Links geregelt wird. Zudem zeigen die Verfahren bei beiden Unfalltypen zumeist, dass die Zahl der Unfälle ab einem bestimmten Verkehrsaufkommen weniger stark ansteigt. Dieses Phänomen ist in der Wissenschaft unter dem Namen 'Safety in Numbers-Effekt' bekannt. Ein Vergleich der Modellgüten zwischen den Unfalltypen zeigt zudem, dass beide Verfahren mit ihrem Modell des Unfalltyps 'Abbiege-Unfall' bessere Vorhersagen generieren als mit ihrem Modell des Unfalltyps 'Einbiegen-/Kreuzen-Unfall'. Weiterhin unterscheiden sich die Modellgüten nach Unfalltyp nur geringfügig zwischen beiden Verfahren, weshalb davon ausgegangen werden kann, dass beide Verfahren qualitativ ähnliche Modelle des entsprechenden Unfalltyps liefern.:1 Einleitung 2 Literaturüberblick 2.1 Safety in Numbers-Effekt 2.2 Einflussfaktoren von Radverkehrsunfällen 3 Grundlagen der Unfallforschung 3.1 Unfallkategorien 3.2 Unfalltypen 4 Datengrundlage 4.1 Unfalldaten 4.2 Infrastrukturmerkmale 4.3 Überblick über verwendete Variablen 5 Methodik 5.1 Korrelationsbetrachtung 5.2 Random Forest 5.2.1 Grundlagen 5.2.2 Random Forest-Verfahren 5.2.3 Modellgütekriterien 5.2.4 Variablenbedeutsamkeit 5.3 Negative Binomialregression 5.3.1 Grundlagen 5.3.2 Accident Prediction Models 5.3.3 Variablenselektion 5.3.4 Modellgütekriterien 5.3.5 Variablenbedeutsamkeit 5.3.6 Modelldiagnostik 6 Durchführung und Ergebnisse 6.1 Korrelationsbetrachtung 6.2 Random Forest 6.2.1 Modellgütekriterien 6.2.2 Variablenbedeutsamkeit 6.3 Negative Binomialregression 6.3.1 Variablenselektion 6.3.2 Modellgütekriterien 6.3.3 Variablenbedeutsamkeit 6.3.4 Modelldiagnostik 6.4 Vergleich beider Verfahren 6.4.1 Modellgütekriterien 6.4.2 Variablenbedeutsamkeit und Handlungsempfehlungen 6.5 Vergleich mit Literaturerkenntnissen 7 Kritische Würdigung 8 Zusammenfassung und Ausblick
APA, Harvard, Vancouver, ISO, and other styles
28

Kusiak, Caroline. "Real-Time Dengue Forecasting In Thailand: A Comparison Of Penalized Regression Approaches Using Internet Search Data." 2018. https://scholarworks.umass.edu/masters_theses_2/708.

Full text
Abstract:
Dengue fever affects over 390 million people annually worldwide and is of particu- lar concern in Southeast Asia where it is one of the leading causes of hospitalization. Modeling trends in dengue occurrence can provide valuable information to Public Health officials, however many challenges arise depending on the data available. In Thailand, reporting of dengue cases is often delayed by more than 6 weeks, and a small fraction of cases may not be reported until over 11 months after they occurred. This study shows that incorporating data on Google Search trends can improve dis- ease predictions in settings with severely underreported data. We compare penalized regression approaches to seasonal baseline models and illustrate that incorporation of search data can improve prediction error. This builds on previous research show- ing that search data and recent surveillance data together can be used to create accurate forecasts for diseases such as influenza and dengue fever. This work shows that even in settings where timely surveillance data is not available, using search data in real-time can produce more accurate short-term forecasts than a seasonal baseline prediction. However, forecast accuracy degrades the further into the future the forecasts go. The relative accuracy of these forecasts compared to a seasonal average forecast varies depending on location. Overall, these data and models can improve short-term public health situational awareness and should be incorporated into larger real-time forecasting efforts.
APA, Harvard, Vancouver, ISO, and other styles
29

Thanyani, Maduvhahafani. "Forecasting hourly electricity demand in South Africa using machine learning models." Diss., 2020. http://hdl.handle.net/11602/1595.

Full text
Abstract:
MSc (Statistics)
Department of Statistics
Short-term load forecasting in South Africa using machine learning and statistical models is discussed in this study. The research is focused on carrying out a comparative analysis in forecasting hourly electricity demand. This study was carried out using South Africa’s aggregated hourly load data from Eskom. The comparison is carried out in this study using support vector regression (SVR), stochastic gradient boosting (SGB), artificial neural networks (NN) with generalized additive model (GAM) as a benchmark model in forecasting hourly electricity demand. In both modelling frameworks, variable selection is done using least absolute shrinkage and selection operator (Lasso). The SGB model yielded the least root mean square error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE) on testing data. SGB model also yielded the least RMSE, MAE and MAPE on training data. Forecast combination of the models’ forecasts is done using convex combination and quantile regres- sion averaging (QRA). The QRA was found to be the best forecast combination model ibased on the RMSE, MAE and MAPE.
NRF
APA, Harvard, Vancouver, ISO, and other styles
30

Chen, Szu-Cheng, and 陳思成. "Lasso Quantile Regression Model to Construct Asia and Taiwan Systemic Risk Measurement." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/4b2fy9.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Huang, Hsin-Hsiung, and 黃信雄. "Study on the Lasso Method for Variable Selectionin Linear Regression Model with Mallows'' Cp." Thesis, 2006. http://ndltd.ncl.edu.tw/handle/41127770529976845884.

Full text
Abstract:
碩士
國立臺灣大學
數學研究所
95
When the number of predictors in a linear regression model is large, regularization is a commonly used method to reduce the complexity of the fitted model. LASSO (Tibshirani, 1996) is being advocated as a useful regulation method for achieving sparsity or parsimony of resulting fitted model. In this thesis, we study the operating characteristics of LASSO coupled with Mallows’Cp on identifying the orthonormal predictor variables of linear regression when the number of predictors and the number of the observation are of the same magnitude. The characteristics includes the chosen number of predictors and the proportion of correctly identified predictors. This result can be useful in multiple testing.
APA, Harvard, Vancouver, ISO, and other styles
32

Hsin-Hsiung, Huang. "Study on the Lasso Method for Variable Selection in Linear Regression Model with Mallows' Cp." 2006. http://www.cetd.com.tw/ec/thesisdetail.aspx?etdun=U0001-0701200722590000.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

He, Zangdong. "Variable selection and structural discovery in joint models of longitudinal and survival data." Thesis, 2014. http://hdl.handle.net/1805/6365.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Joint models of longitudinal and survival outcomes have been used with increasing frequency in clinical investigations. Correct specification of fixed and random effects, as well as their functional forms is essential for practical data analysis. However, no existing methods have been developed to meet this need in a joint model setting. In this dissertation, I describe a penalized likelihood-based method with adaptive least absolute shrinkage and selection operator (ALASSO) penalty functions for model selection. By reparameterizing variance components through a Cholesky decomposition, I introduce a penalty function of group shrinkage; the penalized likelihood is approximated by Gaussian quadrature and optimized by an EM algorithm. The functional forms of the independent effects are determined through a procedure for structural discovery. Specifically, I first construct the model by penalized cubic B-spline and then decompose the B-spline to linear and nonlinear elements by spectral decomposition. The decomposition represents the model in a mixed-effects model format, and I then use the mixed-effects variable selection method to perform structural discovery. Simulation studies show excellent performance. A clinical application is described to illustrate the use of the proposed methods, and the analytical results demonstrate the usefulness of the methods.
APA, Harvard, Vancouver, ISO, and other styles
34

Dobreva, Maria Lubomirova. "Data-driven evaluation of real estate liquidity : predicting days on market to optimize the sales strategy of a startup." Master's thesis, 2019. http://hdl.handle.net/10362/91224.

Full text
Abstract:
Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
This is a research project for applying data mining techniques on Real Estate data in cooperation with Homeheed, a startup in the area of real estate, providing a platform solution as a single source of truth in Sofia, Bulgaria. This project suggests the development of a predictive model by using LASSO regression with the premise to determine days on market. As a consequence, the discoveries are expected to contribute to the Startup by providing insights about more attractive listings, and so will support faster return on investment. Additionally, the paper provides an experimental part where misleading and fake listings are targeted in order to support fraud and real availability of a listing detection. The project’s main objectives and assumptions are that advanced statistics and information management can build such a synergy with data and business models that allows enhancement of both market entry strategy and quality of service.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography