Dissertations / Theses: 'Bayes False discovery rate'

1

DI, BRISCO AGNESE MARIA. "Statistical Network Analysis: a Multiple Testing Approach." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2015. http://hdl.handle.net/10281/96090.

Full text

Abstract:

The problem of identifying connections between nodes in a network model is of fundamental importance in the analysis of brain networks because each node represents a specific brain region that can potentially be connected to other brain regions by means of functional relations; the dynamical behavior of each node can be quantified by adopting a correlation measure among time series. In this contest, the whole set of links between nodes in a network can be represented by means of an adjacency matrix with high dimension, that can be obtained by performing a huge number of simultaneous tests on correlations. In this regard, the Thesis has dealt with the problem of multiple testing in a Bayesian perspective, by examining in depth the “Bayesian False Discovery Rate” (FDR), already defined in Efron, and by introducing the “Bayesian Power” (BP). The behavior of the FDR and BP estimators has been analyzed both with asymptotic theory and with Monte Carlo simulations; furthermore, it has been investigated the robustness of the proposed estimators by simulating specific pattern of dependencies among the p-values associated to the multiple comparisons. Such a multiple testing approach, that allows to control both FDR and BP, has been applyied to a dataset provided by the Milan Center for Neuroscience (NeuroMi). Once selected a sample of 70 participants, classified properly into young subjects and elderly subjects, subject by subject network models have been constructed in order to verify two alternative theories about changes in the pattern of functional connectivity as time goes by, namely the de-differentiation hypothesis versus the localization hypothesis. This objective has been achieved by selecting some proper network measures in order to verify the original hypotheses about the pattern of functional connectivity in the elderly’s group and in the group of young subjects, and by constructing some ad-hoc measures.

APA, Harvard, Vancouver, ISO, and other styles

2

Rahal, Abbas. "Bayesian Methods Under Unknown Prior Distributions with Applications to The Analysis of Gene Expression Data." Thesis, Université d'Ottawa / University of Ottawa, 2021. http://hdl.handle.net/10393/42408.

Full text

Abstract:

The local false discovery rate (LFDR) is one of many existing statistical methods that analyze multiple hypothesis testing. As a Bayesian quantity, the LFDR is based on the prior probability of the null hypothesis and a mixture distribution of null and non-null hypothesis. In practice, the LFDR is unknown and needs to be estimated. The empirical Bayes approach can be used to estimate that mixture distribution. Empirical Bayes does not require complete information about the prior and hyper prior distributions as in hierarchical Bayes. When we do not have enough information at the prior level, and instead of placing a distribution at the hyper prior level in the hierarchical Bayes model, empirical Bayes estimates the prior parameters using the data via, often, the marginal distribution. In this research, we developed new Bayesian methods under unknown prior distribution. A set of adequate prior distributions maybe defined using Bayesian model checking by setting a threshold on the posterior predictive p-value, prior predictive p-value, calibrated p-value, Bayes factor, or integrated likelihood. We derive a set of adequate posterior distributions from that set. In order to obtain a single posterior distribution instead of a set of adequate posterior distributions, we used a blended distribution, which minimizes the relative entropy of a set of adequate prior (or posterior) distributions to a "benchmark" prior (or posterior) distribution. We present two approaches to generate a blended posterior distribution, namely, updating-before-blending and blending-before-updating. The blended posterior distribution can be used to estimate the LFDR by considering the nonlocal false discovery rate as a benchmark and the different LFDR estimators as an adequate set. The likelihood ratio can often be misleading in multiple testing, unless it is supplemented by adjusted p-values or posterior probabilities based on sufficiently strong prior distributions. In case of unknown prior distributions, they can be estimated by empirical Bayes methods or blended distributions. We propose a general framework for applying the laws of likelihood to problems involving multiple hypotheses by bringing together multiple statistical models. We have applied the proposed framework to data sets from genomics, COVID-19 and other data.

APA, Harvard, Vancouver, ISO, and other styles

3

Liu, Fang. "New Results on the False Discovery Rate." Diss., Temple University Libraries, 2010. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/96718.

Full text

Abstract:

Statistics
Ph.D.
The false discovery rate (FDR) introduced by Benjamini and Hochberg (1995) is perhaps the most standard error controlling measure being used in a wide variety of applications involving multiple hypothesis testing. There are two approaches to control the FDR - the fixed error rate approach of Benjamini and Hochberg (BH, 1995) where a rejection region is determined with the FDR below a fixed level and the estimation based approach of Storey (2002) where the FDR is estimated for a fixed rejection region before it is controlled. In this proposal, we concentrate on both these approaches and propose new, improved versions of some FDR controlling procedures available in the literature. A number of adaptive procedures have been put forward in the literature, each attempting to improve the method of Benjamini and Hochberg (1995), the BH method, by incorporating into this method an estimate of number true null hypotheses. Among these, the method of Benjamini, Krieger and Yekutieli (2006), the BKY method, has been receiving lots of attention recently. In this proposal, a variant of the BKY method is proposed by considering a different estimate of number true null hypotheses, which often outperforms the BKY method in terms of the FDR control and power. Storey's (2002) estimation based approach to controlling the FDR has been developed from a class of conservatively biased point estimates of the FDR under a mixture model for the underlying p-values and a fixed rejection threshold for each null hypothesis. An alternative class of point estimates of the FDR with uniformly smaller conservative bias is proposed under the same setup. Numerical evidence is provided to show that the mean squared error (MSE) is also often smaller for this new class of estimates. Compared to Storey's (2002), the present class provides a more powerful estimation based approach to controlling the FDR.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

4

Miller, Ryan. "Marginal false discovery rate approaches to inference on penalized regression models." Diss., University of Iowa, 2018. https://ir.uiowa.edu/etd/6474.

Full text

Abstract:

Data containing large number of variables is becoming increasingly more common and sparsity inducing penalized regression methods, such the lasso, have become a popular analysis tool for these datasets due to their ability to naturally perform variable selection. However, quantifying the importance of the variables selected by these models is a difficult task. These difficulties are compounded by the tendency for the most predictive models, for example those which were chosen using procedures like cross-validation, to include substantial amounts of noise variables with no real relationship with the outcome. To address the task of performing inference on penalized regression models, this thesis proposes false discovery rate approaches for a broad class of penalized regression models. This work includes the development of an upper bound for the number of noise variables in a model, as well as local false discovery rate approaches that quantify the likelihood of each individual selection being a false discovery. These methods are applicable to a wide range of penalties, such as the lasso, elastic net, SCAD, and MCP; a wide range of models, including linear regression, generalized linear models, and Cox proportional hazards models; and are also extended to the group regression setting under the group lasso penalty. In addition to studying these methods using numerous simulation studies, the practical utility of these methods is demonstrated using real data from several high-dimensional genome wide association studies.

APA, Harvard, Vancouver, ISO, and other styles

5

Wong, Adrian Kwok-Hang. "False discovery rate controller for functional brain parcellation using resting-state fMRI." Thesis, University of British Columbia, 2016. http://hdl.handle.net/2429/58332.

Full text

Abstract:

Parcellation of brain imaging data is desired for proper neurological interpretation in resting-state functional magnetic resonance imaging (rs-fMRI) data. Some methods require specifying a number of parcels and using model selection to determine the number of parcels with rs-fMRI data. However, this generalization does not fit with all subjects in a given dataset. A method has been proposed using parametric formulas for the distribution of modularity in random networks to determine the statistical significance between parcels. In this thesis, we propose an agglomerative clustering algorithm using parametric formulas for the distribution of modularity in random networks, coupled with a false discovery rate (FDR) controller to parcellate rsfMRI data. The proposed method controls the FDR to reduce the number of false positives and incorporates spatial information to ensure the regions are spatially contiguous. Simulations demonstrate that our proposed FDRcontrolled agglomerative clustering algorithm yields more accurate results when compared with existing methods. We applied our proposed method to a rs-fMRI dataset and found that it obtained higher reproducibility compared to the Ward hierarchical clustering method. Lastly, we compared the normalized total connectivity degree of each region within the motor network between normal subjects and Parkinson’s disease (PD) subjects using sub-regions defined by our proposed method and the entire region. We found that PD subjects without medication had a significant increase in functional connectivity compared to normal subjects in the right primary motor cortex using our sub-regions within the right primary motor cortex, whereas this significant increase was not found using the entire right primary motor cortex. These sub-regions are of great interest in studying the differences in functional connectivity between different neurological diseases, which can be used as biomarkers and may provide insight in severity of the disease.
Applied Science, Faculty of
Graduate

APA, Harvard, Vancouver, ISO, and other styles

6

Kubat, Jamie. "Comparing Dunnett's Test with the False Discovery Rate Method: A Simulation Study." Thesis, North Dakota State University, 2013. https://hdl.handle.net/10365/27025.

Full text

Abstract:

Recently, the idea of multiple comparisons has been criticized because of its lack of power in datasets with a large number of treatments. Many family-wise error corrections are far too restrictive when large quantities of comparisons are being made. At the other extreme, a test like the least significant difference does not control the family-wise error rate, and therefore is not restrictive enough to identify true differences. A solution lies in multiple testing. The false discovery rate (FDR) uses a simple algorithm and can be applied to datasets with many treatments. The current research compares the FDR method to Dunnett's test using agronomic data from a study with 196 varieties of dry beans. Simulated data is used to assess type I error and power of the tests. In general, the FDR method provides a higher power than Dunnett's test while maintaining control of the type I error rate.

APA, Harvard, Vancouver, ISO, and other styles

7

Guo, Ruijuan. "Sample comparisons using microarrays -- application of false discovery rate and quadratic logistic regression." Worcester, Mass. : Worcester Polytechnic Institute, 2007. http://www.wpi.edu/Pubs/ETD/Available/etd-010808-173747/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Guo, Ruijuan. "Sample comparisons using microarrays: - Application of False Discovery Rate and quadratic logistic regression." Digital WPI, 2008. https://digitalcommons.wpi.edu/etd-theses/28.

Full text

Abstract:

In microarray analysis, people are interested in those features that have different characters in diseased samples compared to normal samples. The usual p-value method of selecting significant genes either gives too many false positives or cannot detect all the significant features. The False Discovery Rate (FDR) method controls false positives and at the same time selects significant features. We introduced Benjamini's method and Storey's method to control FDR, applied the two methods to human Meningioma data. We found that Benjamini's method is more conservative and that, after the number of the tests exceeds a threshold, increase in number of tests will lead to decrease in number of significant genes. In the second chapter, we investigate ways to search interesting gene expressions that cannot be detected by linear models as t-test or ANOVA. We propose a novel approach to use quadratic logistic regression to detect genes in Meningioma data that have non-linear relationship within phenotypes. By using quadratic logistic regression, we can find genes whose expression correlates to their phenotypes both linearly and quadratically. Whether these genes have clinical significant is a very interesting question, since these genes most likely be neglected by traditional linear approach.

APA, Harvard, Vancouver, ISO, and other styles

9

Dalmasso, Cyril. "Estimation du positive False Discovery Rate dans le cadre d'études comparatives en génomique." Paris 11, 2006. http://www.theses.fr/2006PA11T015.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Liley, Albert James. "Statistical co-analysis of high-dimensional association studies." Thesis, University of Cambridge, 2017. https://www.repository.cam.ac.uk/handle/1810/270628.

Full text

Abstract:

Modern medical practice and science involve complex phenotypic definitions. Understanding patterns of association across this range of phenotypes requires co-analysis of high-dimensional association studies in order to characterise shared and distinct elements. In this thesis I address several problems in this area, with a general linking aim of making more efficient use of available data. The main application of these methods is in the analysis of genome-wide association studies (GWAS) and similar studies. Firstly, I developed methodology for a Bayesian conditional false discovery rate (cFDR) for levering GWAS results using summary statistics from a related disease. I extended an existing method to enable a shared control design, increasing power and applicability, and developed an approximate bound on false-discovery rate (FDR) for the procedure. Using the new method I identified several new variant-disease associations. I then developed a second application of shared control design in the context of study replication, enabling improvement in power at the cost of changing the spectrum of sensitivity to systematic errors in study cohorts. This has application in studies on rare diseases or in between-case analyses. I then developed a method for partially characterising heterogeneity within a disease by modelling the bivariate distribution of case-control and within-case effect sizes. Using an adaptation of a likelihood-ratio test, this allows an assessment to be made of whether disease heterogeneity corresponds to differences in disease pathology. I applied this method to a range of simulated and real datasets, enabling insight into the cause of heterogeneity in autoantibody positivity in type 1 diabetes (T1D). Finally, I investigated the relation of subtypes of juvenile idiopathic arthritis (JIA) to adult diseases, using modified genetic risk scores and linear discriminants in a penalised regression framework. The contribution of this thesis is in a range of methodological developments in the analysis of high-dimensional association study comparison. Methods such as these will have wide application in the analysis of GWAS and similar areas, particularly in the development of stratified medicine.

APA, Harvard, Vancouver, ISO, and other styles

11

Benditkis, Julia [Verfasser], Arnold [Akademischer Betreuer] Janssen, and Helmut [Akademischer Betreuer] Finner. "Martingale Methods for Control of False Discovery Rate and Expected Number of False Rejections / Julia Benditkis. Gutachter: Arnold Janssen ; Helmut Finner." Düsseldorf : Universitäts- und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf, 2015. http://d-nb.info/1077295170/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Iyer, Vishwanath. "An adaptive single-step FDR controlling procedure." Diss., Temple University Libraries, 2010. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/75410.

Full text

Abstract:

Statistics
Ph.D.
This research is focused on identifying a single-step procedure that, upon adapting to the data through estimating the unknown parameters, would asymptotically control the False Discovery Rate when testing a large number of hypotheses simultaneously, and exploring some of the characteristics of this procedure.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

13

Gomez, Kayeromi Donoukounmahou. "A Comparison of False Discovery Rate Method and Dunnett's Test for a Large Number of Treatments." Diss., North Dakota State University, 2015. http://hdl.handle.net/10365/24842.

Full text

Abstract:

It has become quite common nowadays to perform multiple tests simultaneously in order to detect differences of a certain trait among groups. This often leads to an inflated probability of at least one Type I Error, a rejection of a null hypothesis when it is in fact true. This inflation generally leads to a loss of power of the test especially in multiple testing and multiple comparisons. The aim of the research is to use simulation to address what a researcher should do to determine which treatments are significantly different from the control when there is a large number of treatments and the number of replicates in each treatment is small. We examine two situations in this simulation study: when the number of replicates per treatment is 3 and also when it is 5 and in each of these situations, we simulated from a normal distribution and in mixture of normal distributions. The total number of simulated treatments was progressively increased from 50 to 100 then 150 and finally 300. The goal is to measure the change in the performances of the False Discovery Rate method and Dunnett’s test in terms of type I error and power as the total number of treatments increases. We reported two ways of examining type I error and power: first, we look at the performances of the two tests in relation to all other comparisons in our simulation study, and secondly per simulated sample. In the first assessment, the False Discovery Rate method appears to have a higher power while keeping its type I error in the same neighborhood as Dunnett’s test and in the latter, both tests have similar powers and the False Discovery Rate method has a higher type I error. Overall, the results show that when the objective of the researcher is to detect as many of the differences as possible, then FDR method is preferred. However if error is more detrimental to the outcomes of the research, Dunnett’s test offers a better alternative.

APA, Harvard, Vancouver, ISO, and other styles

14

ALHARBI, YOUSEF S. "RECOVERING SPARSE DIFFERENCES BETWEEN TWO HIGH-DIMENSIONAL COVARIANCE MATRICES." Kent State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=kent1500392318023941.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Jesus, Marcelo de. "Falso positivo na performance dos fundos de investimento com gestão ativa no Brasil: mensurando sorte dos gestores nos alfas estimados." Universidade Presbiteriana Mackenzie, 2011. http://tede.mackenzie.br/jspui/handle/tede/770.

Full text

Abstract:

Made available in DSpace on 2016-03-15T19:30:42Z (GMT). No. of bitstreams: 1 Marcelo de Jesus.pdf: 753815 bytes, checksum: 4b3631ad6c0a3a4e6928e2b70685850d (MD5) Previous issue date: 2011-02-01
This study investigates, for the period between 2002 and 2009, what is the impact of luck on the performance of stocks mutual funds managers with active management in Brazil to surpass its benchmark. To that purpose, we used a new method, the False Discovery Rate approach - FDR to empirically test those impact. To measure precisely luck and unluck, ig, the frequency of false positives (Type I errors) in the tails of the cross-section of the tdistribution associated with the alphas of funds in the sample, this new approach was applied to measure the skills of grouped shape managers of stock funds with active management in Brazil. The FDR approach offers a simple and objective method to estimate the proportion of skilled funds (with a positive alpha), alpha-zero funds, and unskilled funds (with a negative alpha) across the population. Applying the FDR technique, it was found as a result of research that the majority of funds were alpha-zero, then no truly skilled funds, and only a small proportion of truly skilled funds.
Esta pesquisa investiga, para o período entre 2002 e 2009, qual o impacto da sorte na performance dos gestores de fundos de investimentos em ações com gestão ativa no Brasil que superam o seu benchmark. Para tanto, foi usado um novo método, a abordagem False Discovery Rate - FDR para testar empiricamente esse impacto. Para mensurar precisamente sorte e azar, ou seja, a freqüência de falsos positivos (erros do tipo I) nas caudas do crosssection da distribuição t associadas aos alfas dos fundos da amostra, foi aplicada essa nova abordagem para mensurar de forma agrupada a habilidade dos gestores de fundos de ações com gestão ativa no Brasil. A abordagem FDR oferece um método simples e objetivo para estimar a proporção de fundos habilidosos (com um alfa positivo), fundos de alfa-zero, e fundos não habilidosos (com um alfa negativo) em toda a população. Aplicando-se a técnica FDR, encontrou-se como resultado da pesquisa que a maioria dos fundos foram alfa-zero, seguida pelos fundos verdadeiramente não habilidosos, e apenas uma pequena proporção de fundos verdadeiramente habilidosos.

APA, Harvard, Vancouver, ISO, and other styles

16

Abbas, Aghababazadeh Farnoosh. "Estimating the Local False Discovery Rate via a Bootstrap Solution to the Reference Class Problem: Application to Genetic Association Data." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/33367.

Full text

Abstract:

Modern scientific technology such as microarrays, imaging devices, genome-wide association studies or social science surveys provide statisticians with hundreds or even thousands of tests to consider simultaneously. Testing many thousands of null hypotheses may increase the number of Type $I$ errors. In large-scale hypothesis testing, researchers can use different statistical techniques such as family-wise error rates, false discovery rates, permutation methods, local false discovery rate, where all available data usually should be analyzed together. In applications, the thousands of tests are related by a scientifically meaningful structure. Ignoring that structure can be misleading as it may increase the number of false positives and false negatives. As an example, in genome-wide association studies each test corresponds to a specific genetic marker. In such a case, the scientific structure for each genetic marker can be its minor allele frequency. In this research, the local false discovery rate as a relevant statistical approach is considered to analyze the thousands of tests together. We present a model for multiple hypothesis testing when the scientific structure of each test is incorporated as a co-variate. The purpose of this model is to incorporate the co-variate to improve the performance of testing procedures. The method we consider has different estimates depending on the tuning parameter. We would like to estimate the optimal value of that parameter by considering observed statistics. Thus, among those estimators, the one which minimizes the estimated errors due to bias and to variance is chosen by applying the bootstrap approach. Such an estimation method is called an adaptive reference class method. Under the combined reference class method, the effect of the co-variates is ignored and all null hypotheses should be analyzed together. In this research, under some assumptions for the co-variates and the prior probabilities, the proposed adaptive reference class method shows smaller error than the combined reference class method in estimating the local false discovery rate, when the number of tests gets large. We describe the adaptive reference class method to the coronary artery disease data, and we use simulation data to evaluate the performance of the estimator associated with the adaptive reference class method.

APA, Harvard, Vancouver, ISO, and other styles

17

Bancroft, Timothy J. "Estimating the number of true null hypotheses and the false discovery rate from multiple discrete non-uniform permutation p-values." [Ames, Iowa : Iowa State University], 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3389284.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Qian, Yi. "Topics in multiple hypotheses testing." Texas A&M University, 2005. http://hdl.handle.net/1969.1/4754.

Full text

Abstract:

It is common to test many hypotheses simultaneously in the application of statistics. The probability of making a false discovery grows with the number of statistical tests performed. When all the null hypotheses are true, and the test statistics are indepen- dent and continuous, the error rates from the family wise error rate (FWER)- and the false discovery rate (FDR)-controlling procedures are equal to the nominal level. When some of the null hypotheses are not true, both procedures are conservative. In the first part of this study, we review the background of the problem and propose methods to estimate the number of true null hypotheses. The estimates can be used in FWER- and FDR-controlling procedures with a consequent increase in power. We conduct simulation studies and apply the estimation methods to data sets with bio- logical or clinical significance. In the second part of the study, we propose a mixture model approach for the analysis of ChIP-chip high density oligonucleotide array data to study the interac- tions between proteins and DNA. If we could identify the specific locations where proteins interact with DNA, we could increase our understanding of many important cellular events. Most experiments to date are performed in culture on cell lines, bac- teria, or yeast, and future experiments will include those in developing tissues, organs, or cancer biopsies, and they are critical in understanding the function of genes and proteins. Here we investigate the ChIP-chip data structure and use a beta-mixture model to help identify the binding sites. To determine the appropriate number of components in the mixture model, we suggest the Anderson-Darling testing. Our study indicates that it is a reasonable means of choosing the number of components in a beta-mixture model. The mixture model procedure has broad applications in biology and is illustrated with several data sets from bioinformatics experiments.

APA, Harvard, Vancouver, ISO, and other styles

19

Clements, Nicolle. "Multiple Testing in Grouped Dependent Data." Diss., Temple University Libraries, 2013. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/253695.

Full text

Abstract:

Statistics
Ph.D.
This dissertation is focused on multiple testing procedures to be used in data that are naturally grouped or possess a spatial structure. We propose `Two-Stage' procedure to control the False Discovery Rate (FDR) in situations where one-sided hypothesis testing is appropriate, such as astronomical source detection. Similarly, we propose a `Three-Stage' procedure to control the mixed directional False Discovery Rate (mdFDR) in situations where two-sided hypothesis testing is appropriate, such as vegetation monitoring in remote sensing NDVI data. The Two and Three-Stage procedures have provable FDR/mdFDR control under certain dependence situations. We also present the Adaptive versions which are examined under simulation studies. The `Stages' refer to testing hypotheses both group-wise and individually, which is motivated by the belief that the dependencies among the p-values associated with the spatially oriented hypotheses occur more locally than globally. Thus, these `Staged' procedures test hypotheses in groups that incorporate the local, unknown dependencies of neighboring p-values. If a group is found significant, further investigation is done to the individual p-values within that group. For the vegetation monitoring data, we extend the investigation by providing some spatio-temporal models and forecasts to some regions where significant change was detected through the multiple testing procedure.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

20

SALA, SARA. "Statistical analysis of brain network." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2013. http://hdl.handle.net/10281/43723.

Full text

Abstract:

Recent developments in the complex networks analysis, based largely on graph theory, have been used to study the brain network organization. The brain is a complex system that can be represented by a graph. A graph is a mathematical representation which can be useful to study the connectivity of the brain. Nodes in the brain can be identified dividing its volume in regions of interest and links can be identified calculating a measure of dependence between pairs of regions whose activation signal, measured by functional magnetic resonance imaging (fMRI) techniques, represents the strength of the connec-tion between regions. A graph can be synthesized by the so-called adjacency matrix, which, in its simplest form, is an undirected, binary, and symmetric matrix, whose en-tries are set to one if a link exists between a pair of brain areas and zero otherwise. The adjacency matrix is particularly useful because allows the calculation of several measures which summarize global and local character-istics of functional brain connectivity, such as centrality, e ciency, density and small worldness property. In this work, we consider the global measures, such as the clustering coe cient, the characteristic path length and the global e ciency, and the local measures, such as centrality measures and local e ciency, in order to represent global and local dynam-ics and changes between networks. This is achieved by studying with resting state (rs) fMRI data of healthy subjects and patients with neurodegenerative diseases. Furthermore we illustrate an original methodology to construct the adjacency matrix. Its entries, containing the information about the ex-istence of links, are identified by testing the correlation between the time series that characterized the dynamic behavior of the nodes. This involves the problem of multiple comparisons in order to control the error rates. The method based on the estimation of positive false discovery rate (pFDR) has been used. A similar measure involving false negatives (type II errors), called the positive false nondiscovery rate (pFNR) is then considered, proposing new point and interval estimators for pFNR and a method for balancing the two types of error. This approach is demonstrated using both simulations and fMRI data, and providing nite sample as well as large sample results for pFDR and pFNR estimators. Besides a ranking of the most central nodes in the networks is proposed using q-values, the pFDR analog of the p-values. The di erences on the inter-regional connectivity between cases and controls are studied. Finally network models are discussed. In order to gain deeper insights into the complex neurobiological interaction, exponential random graph models (ERGMs) are applied to assess several network properties simultaneously and to compare case/control brain networks.

APA, Harvard, Vancouver, ISO, and other styles

21

Heesen, Philipp [Verfasser], Arnold [Akademischer Betreuer] Janssen, and Helmut [Akademischer Betreuer] Finner. "Adaptive step up tests for the false discovery rate (FDR) under independence and dependence / Philipp Heesen. Gutachter: Arnold Janssen ; Helmut Finner." Düsseldorf : Universitäts- und Landesbibliothek der Heinrich-Heine-Universität Düsseldorf, 2015. http://d-nb.info/1064694039/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Yi, Hui. "Assessment of Penalized Regression for Genome-wide Association Studies." Diss., Virginia Tech, 2014. http://hdl.handle.net/10919/64845.

Full text

Abstract:

The data from genome-wide association studies (GWAS) in humans are still predominantly analyzed using single marker association methods. As an alternative to Single Marker Analysis (SMA), all or subsets of markers can be tested simultaneously. This approach requires a form of Penalized Regression (PR) as the number of SNPs is much larger than the sample size. Here we review PR methods in the context of GWAS, extend them to perform penalty parameter and SNP selection by False Discovery Rate (FDR) control, and assess their performance (including penalties incorporating linkage disequilibrium) in comparison with SMA. PR methods were compared with SMA on realistically simulated GWAS data consisting of genotype data from single and multiple chromosomes and a continuous phenotype and on real data. Based on our comparisons our analytic FDR criterion may currently be the best approach to SNP selection using PR for GWAS. We found that PR with FDR control provides substantially more power than SMA with genome-wide type-I error control but somewhat less power than SMA with Benjamini-Hochberg FDR control. PR controlled the FDR conservatively while SMA-BH may not achieve FDR control in all situations. Differences among PR methods seem quite small when the focus is on variable selection with FDR control. Incorporating LD into PR by adapting penalties developed for covariates measured on graphs can improve power but also generate morel false positives or wider regions for follow-up. We recommend using the Elastic Net with a mixing weight for the Lasso penalty near 0.5 as the best method.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

23

Breheny, Patrick John. "Regularized methods for high-dimensional and bi-level variable selection." Diss., University of Iowa, 2009. https://ir.uiowa.edu/etd/325.

Full text

Abstract:

Many traditional approaches cease to be useful when the number of variables is large in comparison with the sample size. Penalized regression methods have proved to be an attractive approach, both theoretically and empirically, for dealing with these problems. This thesis focuses on the development of penalized regression methods for high-dimensional variable selection. The first part of this thesis deals with problems in which the covariates possess a grouping structure that can be incorporated into the analysis to select important groups as well as important members of those groups. I introduce a framework for grouped penalization that encompasses the previously proposed group lasso and group bridge methods, sheds light on the behavior of grouped penalties, and motivates the proposal of a new method, group MCP. The second part of this thesis develops fast algorithms for fitting models with complicated penalty functions such as grouped penalization methods. These algorithms combine the idea of local approximation of penalty functions with recent research into coordinate descent algorithms to produce highly efficient numerical methods for fitting models with complicated penalties. Importantly, I show these algorithms to be both stable and linear in the dimension of the feature space, allowing them to be efficiently scaled up to very large problems. In the third part of this thesis, I extend the idea of false discovery rates to penalized regression. The Karush-Kuhn-Tucker conditions describing penalized regression estimates provide testable hypotheses involving partial residuals. I use these hypotheses to connect the previously disparate elds of multiple comparisons and penalized regression, develop estimators for the false discovery rates of methods such as the lasso and elastic net, and establish theoretical results. Finally, the methods from all three sections are studied in a number of simulations and applied to real data from gene expression and genetic association studies.

APA, Harvard, Vancouver, ISO, and other styles

24

Labare, Mathieu. "Search for cosmic sources of high energy neutrinos with the AMANDA-II detector." Doctoral thesis, Universite Libre de Bruxelles, 2010. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210183.

Full text

Abstract:

AMANDA-II est un télescope à neutrinos composé d'un réseau tri-dimensionnel de senseurs optiques déployé dans la glace du Pôle Sud.

Son principe de détection repose sur la mise en évidence de particules secondaires chargées émises lors de l'interaction d'un neutrino de haute énergie (> 100 GeV) avec la matière environnant le détecteur, sur base de la détection de rayonnement Cerenkov.

Ce travail est basé sur les données enregistrées par AMANDA-II entre 2000 et 2006, afin de rechercher des sources cosmiques de neutrinos.

Le signal recherché est affecté d'un bruit de fond important de muons et de neutrinos issus de l'interaction du rayonnement cosmique primaire dans l'atmosphère. En se limitant à l'observation de l'hémisphère nord, le bruit de fond des muons atmosphériques, absorbés par la Terre, est éliminé.

Par contre, les neutrinos atmosphériques forment un bruit de fond irréductible constituant la majorité des 6100 événements sélectionnés pour cette analyse.

Il est cependant possible d'identifier une source ponctuelle de neutrinos cosmiques en recherchant un excès local se détachant du bruit de fond isotrope de neutrinos atmosphériques, couplé à une sélection basée sur l'énergie, dont le spectre est différent pour les deux catégories de neutrinos.

Une approche statistique originale est développée dans le but d'optimiser le pouvoir de détection de sources ponctuelles, tout en contrôlant le taux de fausses découvertes, donc le niveau de confiance d'une observation.

Cette méthode repose uniquement sur la connaissance de l'hypothèse de bruit de fond, sans aucune hypothèse sur le modèle de production de neutrinos par les sources recherchées. De plus, elle intègre naturellement la notion de facteur d'essai rencontrée dans le cadre de test d'hypothèses multiples.La procédure a été appliquée sur l'échantillon final d'évènements récoltés par AMANDA-II.

---------

MANDA-II is a neutrino telescope which comprises a three dimensional array of optical sensors deployed in the South Pole glacier.

Its principle rests on the detection of the Cherenkov radiation emitted by charged secondary particles produced by the interaction of a high energy neutrino (> 100 GeV) with the matter surrounding the detector.

This work is based on data recorded by the AMANDA-II detector between 2000 and 2006 in order to search for cosmic sources of neutrinos. A potential signal must be extracted from the overwhelming background of muons and neutrinos originating from the interaction of primary cosmic rays within the atmosphere.

The observation is limited to the northern hemisphere in order to be free of the atmospheric muon background, which is stopped by the Earth. However, atmospheric neutrinos constitute an irreducible background composing the main part of the 6100 events selected for this analysis.

It is nevertheless possible to identify a point source of cosmic neutrinos by looking for a local excess breaking away from the isotropic background of atmospheric neutrinos;

This search is coupled with a selection based on the energy, whose spectrum is different from that of the atmospheric neutrino background.

An original statistical approach has been developed in order to optimize the detection of point sources, whilst controlling the false discovery rate -- hence the confidence level -- of an observation. This method is based solely on the knowledge of the background hypothesis, without any assumption on the production model of neutrinos in sought sources. Moreover, the method naturally accounts for the trial factor inherent in multiple testing.The procedure was applied on the final sample of events collected by AMANDA-II.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

25

Shen, Shihao. "Statistical methods for deep sequencing data." Diss., University of Iowa, 2012. https://ir.uiowa.edu/etd/5059.

Full text

Abstract:

Ultra-deep RNA sequencing has become a powerful approach for genome-wide analysis of pre-mRNA alternative splicing. We develop MATS (Multivariate Analysis of Transcript Splicing), a Bayesian statistical framework for flexible hypothesis testing of differential alternative splicing patterns on RNA-Seq data. MATS uses a multivariate uniform prior to model the between-sample correlation in exon splicing patterns, and a Markov chain Monte Carlo (MCMC) method coupled with a simulation-based adaptive sampling procedure to calculate the P value and false discovery rate (FDR) of differential alternative splicing. Importantly, the MATS approach is applicable to almost any type of null hypotheses of interest, providing the flexibility to identify differential alternative splicing events that match a given user-defined pattern. We evaluated the performance of MATS using simulated and real RNA-Seq data sets. In the RNA-Seq analysis of alternative splicing events regulated by the epithelial-specific splicing factor ESRP1, we obtained a high RT-PCR validation rate of 86% for differential alternative splicing events with a MATS FDR of < 10%. Additionally, over the full list of RT-PCR tested exons, the MATS FDR estimates matched well with the experimental validation rate. Our results demonstrate that MATS is an effective and flexible approach for detecting differential alternative splicing from RNA-Seq data.

APA, Harvard, Vancouver, ISO, and other styles

26

Scott, Nigel A. "An Application of Armitage Trend Test to Genome-wide Association Studies." Digital Archive @ GSU, 2009. http://digitalarchive.gsu.edu/math_theses/74.

Full text

Abstract:

Genome-wide Association (GWA) studies have become a widely used method for analyzing genetic data. It is useful in detecting associations that may exist between particular alleles and diseases of interest. This thesis investigates the dataset provided from problem 1 of the Genetic Analysis Workshop 16 (GAW 16). The dataset consists of GWA data from the North American Rheumatoid Arthritis Consortium (NARAC). The thesis attempts to determine a set of single nucleotide polymorphisms (SNP) that are associated significantly with rheumatoid arthritis. Moreover, this thesis also attempts to address the question of whether the one-sided alternative hypothesis that the minor allele is positively associated with the disease or the two-sided alternative hypothesis that the genotypes at a locus are associated with the disease is appropriate, or put another way, the question of whether examining both alternative hypotheses yield more information.

APA, Harvard, Vancouver, ISO, and other styles

27

Xu, Yihuan. "ROBUST ESTIMATION OF THE PARAMETERS OF g - and - h DISTRIBUTIONS, WITH APPLICATIONS TO OUTLIER DETECTION." Diss., Temple University Libraries, 2014. http://cdm16002.contentdm.oclc.org/cdm/ref/collection/p245801coll10/id/294733.

Full text

Abstract:

Statistics
Ph.D.
The g - and - h distributional family is generated from a relatively simple transformation of the standard normal. By changing the skewness and elongation parameters g and h, this distributional family can approximate a broad spectrum of commonly used distributional shapes, such as normal, lognormal, Weibull and exponential. Consequently, it is easy to use in simulation studies and has been applied in multiple areas, including risk management, stock return analysis and missing data imputation studies. The current available methods to estimate the g - and - h distributional family include: letter value based method (LV), numerical maximal likelihood method (NMLE), and moment methods. Although these methods work well when no outliers or contaminations exist, they are not resistant to a moderate amount of contaminated observations or outliers. Meanwhile, NMLE is a computational time consuming method when data sample size is large. In this dissertation a quantile based least squares (QLS) estimation method is proposed to fit the g - and - h distributional family parameters and then derive its basic properties. Then QLS method is extended to a robust version (rQLS). Simulation studies are performed to compare the performance of QLS and rQLS methods with LV and NMLE methods to estimate the g - and - h parameters from random samples with or without outliers. In random samples without outliers, QLS and rQLS estimates are comparable to LV and NMLE in terms of bias and standard error. On the other hand, rQLS performs better than other non-robust method to estimate the g - and - h parameters when moderate amount of contaminated observations or outliers exist. The flexibility of the g - and - h distribution and the robustness of rQLS method make it a useful tool in various fields. The boxplot (BP) method had been used in multiple outlier detections by controlling the some-outside rate, which is the probability of one or more observations, in an outlier-free sample, falling into the outlier region. The BP method is distribution dependent. Usually the random sample is assumed normally distributed; however, this assumption may not be valid in many applications. The robustly estimated g - and - h distribution provides an alternative approach without distributional assumptions. Simulation studies indicate that the BP method based on robustly estimated g - and - h distribution identified reasonable number of true outliers while controlling number of false outliers and some-outside rate compared to normal distributional assumption when it is not valid. Another application of the robust g - and - h distribution is as an empirical null distribution in false discovery rate method (denoted as BH method thereafter). The performance of BH method depends on the accuracy of the null distribution. It has been found that theoretical null distributions were often not valid when simultaneously performing many thousands, even millions, of hypothesis tests. Therefore, an empirical null distribution approach is introduced that uses estimated distribution from the data. This is recommended as a substitute to the currently used empirical null methods of fitting a normal distribution or another member of the exponential family. Similar to BP outlier detection method, the robustly estimated g - and - h distribution can be used as empirical null distribution without any distributional assumptions. Several real data examples of microarray are used as illustrations. The QLS and rQLS methods are useful tools to estimate g - and - h parameters, especially rQLS because it noticeably reduces the effect of outliers on the estimates. The robustly estimated g - and - h distributions have multiple applications where distributional assumptions are required, such as boxplot outlier detection or BH methods.
Temple University--Theses

APA, Harvard, Vancouver, ISO, and other styles

28

Nascimento, Guilherme Batista do [UNESP]. "Estratégias de imputação e associação genômica com dados de sequenciamento para características de produção de leite na raça Gir." Universidade Estadual Paulista (UNESP), 2018. http://hdl.handle.net/11449/153060.

Full text

Abstract:

Submitted by Guilherme Batista do Nascimento null (guilhermebn@msn.com) on 2018-03-16T12:24:54Z No. of bitstreams: 1 Tese_Guilherme_Batista_do_Nascimento.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5)
Approved for entry into archive by Alexandra Maria Donadon Lusser Segali null (alexmar@fcav.unesp.br) on 2018-03-16T19:03:02Z (GMT) No. of bitstreams: 1 nascimento_gb_dr_jabo.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5)
Made available in DSpace on 2018-03-16T19:03:02Z (GMT). No. of bitstreams: 1 nascimento_gb_dr_jabo.pdf: 1770231 bytes, checksum: ad03948ecc7b09b89d46d26b7c9e3bf8 (MD5) Previous issue date: 2018-02-22
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
A implementação de dados de sequenciamento de nova geração - “next-generation sequence” (NGS) em programas de melhoramento genético animal representa a mais recente ferramenta na utilização de dados genotípicos nos modelos de associação genômica, tendo em vista que todo polimorfismo é considerado nas associações entre registros fenotípicos e dados de sequenciamento. Como em toda nova tecnologia, a prospecção das variantes ainda representa um desafio no sentido computacional e de viabilidade dos custos para sua implementação em larga escala. Diante desses desafios, neste trabalho buscou-se meios de explorar os benefícios na utilização da NGS nas predições genômicas e superar as limitações inerentes a esse processo. Registros fenotípicos e genotípicos (Illumina Bovine HD BeadChip) de 2.279 animais da raça Gir (Bos taurus indicus) foram disponibilizados pela Embrapa Gado de Leite (MG) e utilizados para as análises de associação genômica. Além disso, dados de sequenciamento de 53 animais do 1000 “Bulls Project” deram origem à população de referência de imputação. Visando verificar a eficiência de imputação, foram testados diferentes cenários quanto a sua acurácia de imputação por meio da análise “leave-one-out”, utilizando apenas os dados de sequenciamento, que apresentaram eficiências de até 84%, no cenário com todos os 51 animais disponíveis após o controle de qualidade. Também foram verificadas as influências das variantes em baixa frequência na acurácia de imputação em diferentes regiões do genoma. Com a escolha da melhor estrutura da população de referência de imputação e aplicação dos controles de qualidade nos dados de NGS e genômicos, foi possível imputar os 2.237 animais genotipados, que passaram pelo controle de qualidade para dados de sequenciamento e realizar análise de associação genômica para as características produção de leite (PL305), teor de gordura (PG305), proteína (PP305) e sólidos totais (PS305), mensuradas aos 305 dias em animais da raça Gir leiteiro. Para tal, foram utilizados os valores genéticos desregredidos (dEBV) como variável resposta no modelo de regressão múltipla. Regiões de 1Mb que contivessem 100 ou mais variantes com “False Discovery Rate” (FDR) inferior a 0,05, foram consideradas significativas e submetidas a análise de enriquecimento por meio dos termos MeSh (“Medical Subject Headings”). As três regiões significativas (FDR<0,05) para PS305 foram observadas nos cromossomos 11, 12 e 28 e a única região significativa em PG305 foi no cromossomo 6. Tais regiões apresentaram variantes associadas com vias metabólicas da produção de leite, ausentes nos painéis comerciais de genotipagem, podendo representar genes candidatos a seleção.
- Implementing "next-generation sequence" (NGS) data in animal breeding programs represents the latest tool in the use of genotypic data in genomic association models, since all polymorphisms are considered in the associations between phenotypic records and sequencing data. As with any new technology, variant prospecting still represents a computational and cost-effective challenge for large-scale implementation. Front to these challenges, this work sought ways to explore the benefits of using NGS in genomic predictions and overcome the inherent limitations of this process. Phenotypic and genotypic (Illumina Bovine HD BeadChip) records of 2,279 Gir animals (Bos taurus indicus) were made available by Embrapa Gado de Leite (MG) and used for genomic association analysis. In addition, sequence data of 53 animals from the 1000 Bulls Project gave rise to the imputation reference population. In order to verify the imputation efficiency, different scenarios were tested for their imputation accuracy through the leave-one-out analysis, using only the sequencing data, which presented efficiencies of up to 84%, in the scenario with all the 51 animals available after quality control. Influences from the low-frequency variants on the accuracy of imputation in different regions of the genome were also verified. After identifying the best reference population structure of imputation and applying the quality controls in the NGS and genomic data, it was possible to impute the 2 237 genotyped animals that passed in the quality control to sequencing data and perform genomic association analysis for (PL305), fat content (PG305), protein (PP305) and total solids (PS305), measured at 305 days in dairy Gir animals. For this, unregulated genetic values (dEBV) were used as response variable in the multiple regression model. Regions of 1Mb containing 100 or more variants with a False Discovery Rate (FDR) lower than 0.05 were considered statistically significant and submitted to pathways enrichment analysis using the MeSh (Medical Subject Headings) terms. The three significant regions (FDR <0.05) for PS305 were observed on chromosomes 11, 12 and 28 and only one significant region in PG305, was on chromosome 6. These regions presented variants associated with metabolic pathways of milk production, absent in the panels genotyping, and may represent genes that are candidates for selection
convênio Capes/Embrapa (edital 15/2014)

APA, Harvard, Vancouver, ISO, and other styles

29

Manandhr-Shrestha, Nabin K. "Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis." Scholar Commons, 2010. http://scholarcommons.usf.edu/etd/3513.

Full text

Abstract:

The aim of the present study is to identify the di®erentially expressed genes be- tween two di®erent conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statis- ticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal vari- ances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as di®erentially expressed, even though the signi¯cance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p -values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be di®erentially expressed if the p-value is less than the threshold. We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of ¯nding di®erentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic’s t-test method, Tusher and Tibshirani’s SAM method among others. The next step of this research is to check whether the genes selected by the pro- posed Behrens -Fisher method is useful for the classi¯cation of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene se- lection method with some other statistical learning methods, we have found better classi¯cation result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive de¯nite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insu±ciency. The e±ciency of this established method has been demonstrated by ap- plying them in three real microarray data and calculating the misclassi¯cation error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature. We have studied the classi¯cation performance of di®erent classi¯ers before and after taking the correlation between the genes. The classi¯cation performance of the classi¯er has been signi¯cantly improved once the correlation was accounted. The classi¯cation performance of di®erent classi¯ers have been measured by the misclas- si¯cation rates and the confusion matrix. The another problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. we have taken the correlation between the test statistics into account. If there were no correlation, then it will not a®ect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the signi¯cance level is not su±cient. The rejection region should be rede¯ned accordingly and depends on the degree of correlation. The e®ect of the correlation in selecting the appropriate rejection region have also been studied.

APA, Harvard, Vancouver, ISO, and other styles

30

Mogrovejo, Carrasco Daniel Estuardo. "Enhancing Pavement Surface Macrotexture Characterization." Diss., Virginia Tech, 2015. http://hdl.handle.net/10919/51957.

Full text

Abstract:

One of the most important objectives for transportation engineers is to understand pavement surface properties and their positive and negative effects on the user. This can improve the design of the infrastructure, adequacy of tools, and consistency of methodologies that are essential for transportation practitioners regarding macrotexture characterization. Important pavement surface characteristics, or tire-pavement interactions, such as friction, tire-pavement noise, splash and spray, and rolling resistance, are significantly influenced by pavement macrotexture. This dissertation compares static and dynamic macrotexture measurements and proposes and enhanced method to quantify the macrotexture. Dynamic measurements performed with vehicle-mounted lasers have the advantage of measuring macrotexture at traffic speed. One drawback of these laser devices is the presence of 'spikes' in the collected data, which impact the texture measurements. The dissertation proposes two robust and innovative methods to overcome this limitation. The first method is a data-driven adaptive method that detects and removes the spikes from high-speed laser texture measurements. The method first calculates the discrete wavelet transform of the texture measurements. It then detects (at all levels) and removes the spikes from the obtained wavelet coefficients (or differences). Finally, it calculates the inverse discrete wavelet transform with the processed wavelet coefficients (without outliers) to obtain the Mean Profile Depth (MPD) from the measurements with the spikes removed. The method was validated by comparing the results with MPD measurements obtained with a Circular Texture Meter (CTMeter) that was chosen as the control device. Although this first method was able to successfully remove the spikes, it has the drawback that it depends on manual modeling of the distribution of the wavelet coefficients to correctly define an appropriate threshold. The next step of this dissertation proposes an enhanced to the spike removal methodology for macrotexture measurements taken with high-speed laser devices. This denoising methodology uses an algorithm that defines the distribution of texture measurements by using the family of Generalized Gaussian Distributions (GGD), along with the False Discovery Rate (FDR) method that controls the proportion of wrongly identified spikes among all identified spikes. The FDR control allows for an adaptive threshold selection that differentiates between valid measurements and spikes. The validation of the method showed that the MPD results obtained with denoised dynamic measurements are comparable to MPD results from the control devices. This second method is included as a crucial step in the last stage of this dissertation as explained following. The last part of the dissertation presents an enhanced macrotexture characterization index based on the Effective Area for Water Evacuation (EAWE), which: (1) Estimates the potential of the pavement to drain water and (2) Correlates better with two pavement surface properties affected by macrotexture (friction and noise) that the current MPD method. The proposed index is defined by a three-step process that: (1) removes the spikes, assuring the reliability of the texture profile data, (2) finds the enveloping profile that is necessary to delimit the area between the tire and the pavement when contact occurs, and (3) computes the EAWE. Comparisons of current (MPD) and proposed (EAWE) macrotexture characterization indices showed that the MPD overestimates the ability of the pavement for draining the surface water under a tire.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

31

Elmi, Mohamed Abdillahi. "Détection des changements de points multiples et inférence du modèle autorégressif à seuil." Thesis, Bourgogne Franche-Comté, 2018. http://www.theses.fr/2018UBFCD005/document.

Full text

Abstract:

Cette thèse est composée de deux parties: une première partie traite le problème de changement de régime et une deuxième partie concerne le processusautorégressif à seuil dont les innovations ne sont pas indépendantes. Toutefois, ces deux domaines de la statistique et des probabilités se rejoignent dans la littérature et donc dans mon projet de recherche. Dans la première partie, nous étudions le problème de changements derégime. Il existe plusieurs méthodes pour la détection de ruptures mais les principales méthodes sont : la méthode de moindres carrés pénalisés (PLS)et la méthode de derivée filtrée (FD) introduit par Basseville et Nikirov. D’autres méthodes existent telles que la méthode Bayésienne de changementde points. Nous avons validé la nouvelle méthode de dérivée filtrée et taux de fausses découvertes (FDqV) sur des données réelles (des données du vent sur des éoliennes et des données du battement du coeur). Bien naturellement, nous avons donné une extension de la méthode FDqV sur le cas des variables aléatoires faiblement dépendantes.Dans la deuxième partie, nous étudions le modèle autorégressif à seuil (en anglais Threshold Autoregessive Model (TAR)). Le TAR est étudié dans la littérature par plusieurs auteurs tels que Tong(1983), Petrucelli(1984, 1986), Chan(1993). Les applications du modèle TAR sont nombreuses par exemple en économie, en biologie, l'environnement, etc. Jusqu'à présent, le modèle TAR étudié concerne le cas où les innovations sont indépendantes. Dans ce projet, nous avons étudié le cas où les innovations sont non corrélées. Nous avons établi les comportements asymptotiques des estimateurs du modèle. Ces résultats concernent la convergence presque sûre, la convergence en loi et la convergence uniforme des paramètres
This thesis has two parts: the first part deals the change points problem and the second concerns the weak threshold autoregressive model (TAR); the errors are not correlated.In the first part, we treat the change point analysis. In the litterature, it exists two popular methods: The Penalized Least Square (PLS) and the Filtered Derivative introduced by Basseville end Nikirov.We give a new method of filtered derivative and false discovery rate (FDqV) on real data (the wind turbines and heartbeats series). Also, we studied an extension of FDqV method on weakly dependent random variables.In the second part, we spotlight the weak threshold autoregressive (TAR) model. The TAR model is studied by many authors such that Tong(1983), Petrucelli(1984, 1986). there exist many applications, for example in economics, biological and many others. The weak TAR model treated is the case where the innovations are not correlated

APA, Harvard, Vancouver, ISO, and other styles

32

Buschmann, Tilo. "The Systematic Design and Application of Robust DNA Barcodes." Doctoral thesis, Universitätsbibliothek Leipzig, 2016. http://nbn-resolving.de/urn:nbn:de:bsz:15-qucosa-209812.

Full text

Abstract:

High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance. Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance. However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform. To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives. For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.

APA, Harvard, Vancouver, ISO, and other styles

33

Dewaele, Benoît. "On the performance of hedge funds." Doctoral thesis, Universite Libre de Bruxelles, 2013. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/209487.

Full text

Abstract:

This thesis investigates the performance of hedge funds, funds of hedge funds and alternative Ucits together with the determinants of this performance by using new or well-suited econometric techniques. As such, it lies at the frontier of finance and financial econometrics and contributes to both fields. For the sake of clarity, we summarize the main contributions to each field separately.

The contribution of this thesis to the field of financial econometrics is the time-varying style analysis developed in the second chapter. This statistical tool combines the Sharpe analysis with a time-varying coefficient method; thereby, it is taking the best of both worlds.

Sharpe (1992) has developed the idea of “style analysis”, building on the conclusion that a regression taking into account the constraints faced by mutual funds should give a better picture of their holdings. To get an estimate of their holdings, he incorporates, in a standard regression, typical constraints related to the regulation of mutual funds, such as no short-selling and value preservation. He argues that this gives a more realistic picture of their investments and consequently better estimations of their future expected returns.

Unfortunately, in the style analysis, the weights are constrained to be constant. Even if, for funds of hedge funds the weights should also sum up to 1, given their dynamic nature, the constant weights seem more restrictive than for mutual funds. Hence, the econometric literature was lacking a method incorporating the constraints and the possibility for the weights to vary. Motivated by this gap, we develop a method that allows the weights to vary while being constrained to sum up to 1 by combining the Sharpe analysis with a time-varying coefficient model. As the style analysis has proven to be a valuable tool for mutual fund analysis, we believe our approach offers many potential fields of application both for funds of hedge funds and mutual funds.

The contributions of our thesis to the field of finance are numerous.

Firstly, we are the first to offer a comprehensive and exhaustive assessment of the world of FoHFs. Using both a bootstrap analysis and a method that allows dealing with multiple hypothesis tests straightforwardly, we show that after fees, the majority of FoHFs do not channel alpha from single-manager hedge funds and that only very few FoHFs deliver after-fee alpha per se, i.e. on top of the alpha of the hedge fund indices. We conclude that the added value of the vast majority of FoHFs should thus not be expected to come from the selection of the best HFs but from the risk management-monitoring skills and the easy access they provide to the HF universe.

Secondly, despite that the leverage is one of the key features of funds of hedge funds, there was a gap in the understanding of the impact it might have on the investor’s alpha. This was likely due to the quasi-absence of data about leverage and to the fact that literature was lacking a proper tool to implicitly estimate this leverage.

We fill this gap by proposing a theoretical model of fund of hedge fund leverage and alpha where the cost of borrowing is increasing with leverage. In the literature, this is the first model which integrates the rising cost of borrowing in the leverage decision of FoHFs. We use this model to determine the conditions under which the leverage has a negative or a positive impact on investor’s alpha and show that the manager has an incentive to take a leverage that hurts the investor’s alpha. Next, using estimates of the leverages of a sample of FoHFs obtained through the time-varying style analysis, we show that leverage has indeed a negative impact on alphas and appraisal ratios. We argue that this effect may be an explanation for the disappointing alphas delivered by funds of hedge funds and can be interpreted as a potential explanation for the “capacity constraints ” effect. To the best of our knowledge, we are the first to report and explain this negative relationship between alpha and leverage in the industry.

Thirdly, we show the interest of the time-varying coefficient model in hedge fund performance assessment and selection. Since the literature underlines that manager skills are varying with macro-economic conditions, the alpha should be dynamic. Unfortunately, using ordinary least-squares regressions forces the estimate of the alpha to be constant over the estimation period. The alpha of an OLS regression is thus static whereas the alpha generation process is by nature varying. On the other hand, we argue that the time-varying alpha captures this dynamic behaviour.

As the literature shows that abnormal-return persistence is essentially short-term, we claim that using the quasi-instantaneous detection ability of the time-varying model to determine the abnormal-return should lead to outperforming portfolios. Using a persistence analysis, we check this conjecture and show that contrary to top performers in terms of OLS alpha, the top performers in terms of past time-varying alpha generate superior and significant ex-post performance. Additionally, we contribute to the literature on the topic by showing that persistence exists and can be as long as 3 years. Finally, we use the time-varying analysis to obtain estimates of the expected returns of hedge funds and show that using those estimates in a mean-variance framework leads to better ex-post performance. Therefore, we conclude that in terms of hedge fund performance detection, the time-varying model is superior to the OLS analysis.

Lastly, we investigate the funds that have chosen to adopt the “Alternative UCITS” framework. Contrary to the previous frameworks that were designed for mutual fund managers, this new set of European Union directives can be suited to hedge fund-like strategies. We show that for Ucits funds there is some evidence, although weak, of the added value of offshore experience. On the other hand, we find no evidence of added value in the case of non-offshore experienced managers. Motivated to further refine our results, we separate Ucits with offshore experienced managers into two groups: those with equivalent offshore hedge funds (replicas) and those without (new funds). This time, Ucits with no offshore equivalents show low volatility and a strongly positive alpha. Ucits with offshore equivalents on the other hand bring no added value and, not surprisingly, bear no substantial differences in their risk profile with their paired funds offshore. Therefore, we conclude that offshore experience plays a significant role in creating positive alpha, as long as it translates into real innovations. If the fund is a pure replica, the additional costs brought by the Ucits structure represent a handicap that is hardly compensated. As “Alternative Ucits” have only been scarcely investigated, this paper represents a contribution to the better understanding of those funds.

In summary, this thesis improves the knowledge of the distribution, detection and determinants of the performance in the industry of hedge funds. It also shows that a specific field such as the hedge fund industry can still tell us more about the sources of its performance as long as we can use methodologies in adequacy with their behaviour, uses, constraints and habits. We believe that both our results and the methods we use pave the way for future research questions in this field, and are of the greatest interest for professionals of the industry as well.

Doctorat en Sciences économiques et de gestion
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

34

Stephens, Nathan Wallace. "A Comparison of Microarray Analyses: A Mixed Models Approach Versus the Significance Analysis of Microarrays." BYU ScholarsArchive, 2006. https://scholarsarchive.byu.edu/etd/1115.

Full text

Abstract:

DNA microarrays are a relatively new technology for assessing the expression levels of thousands of genes simultaneously. Researchers hope to find genes that are differentially expressed by hybridizing cDNA from known treatment sources with various genes spotted on the microarrays. The large number of tests involved in analyzing microarrays has raised new questions in multiple testing. Several approaches for identifying differentially expressed genes have been proposed. This paper considers two: (1) a mixed models approach, and (2) the Signiffcance Analysis of Microarrays.

APA, Harvard, Vancouver, ISO, and other styles

35

Bécu, Jean-Michel. "Contrôle des fausses découvertes lors de la sélection de variables en grande dimension." Thesis, Compiègne, 2016. http://www.theses.fr/2016COMP2264/document.

Full text

Abstract:

Dans le cadre de la régression, de nombreuses études s’intéressent au problème dit de la grande dimension, où le nombre de variables explicatives mesurées sur chaque échantillon est beaucoup plus grand que le nombre d’échantillons. Si la sélection de variables est une question classique, les méthodes usuelles ne s’appliquent pas dans le cadre de la grande dimension. Ainsi, dans ce manuscrit, nous présentons la transposition de tests statistiques classiques à la grande dimension. Ces tests sont construits sur des estimateurs des coefficients de régression produits par des approches de régressions linéaires pénalisées, applicables dans le cadre de la grande dimension. L’objectif principal des tests que nous proposons consiste à contrôler le taux de fausses découvertes. La première contribution de ce manuscrit répond à un problème de quantification de l’incertitude sur les coefficients de régression réalisée sur la base de la régression Ridge, qui pénalise les coefficients de régression par leur norme l2, dans le cadre de la grande dimension. Nous y proposons un test statistique basé sur le rééchantillonage. La seconde contribution porte sur une approche de sélection en deux étapes : une première étape de criblage des variables, basée sur la régression parcimonieuse Lasso précède l’étape de sélection proprement dite, où la pertinence des variables pré-sélectionnées est testée. Les tests sont construits sur l’estimateur de la régression Ridge adaptive, dont la pénalité est construite à partir des coefficients de régression du Lasso. Une dernière contribution consiste à transposer cette approche à la sélection de groupes de variables
In the regression framework, many studies are focused on the high-dimensional problem where the number of measured explanatory variables is very large compared to the sample size. If variable selection is a classical question, usual methods are not applicable in the high-dimensional case. So, in this manuscript, we develop the transposition of statistical tests to the high dimension. These tests operate on estimates of regression coefficients obtained by penalized linear regression, which is applicable in high-dimension. The main objective of these tests is the false discovery control. The first contribution of this manuscript provides a quantification of the uncertainty for regression coefficients estimated by ridge regression in high dimension. The Ridge regression penalizes the coefficients on their l2 norm. To do this, we devise a statistical test based on permutations. The second contribution is based on a two-step selection approach. A first step is dedicated to the screening of variables, based on parsimonious regression Lasso. The second step consists in cleaning the resulting set by testing the relevance of pre-selected variables. These tests are made on adaptive-ridge estimates, where the penalty is constructed on Lasso estimates learned during the screening step. A last contribution consists to the transposition of this approach to group-variables selection

APA, Harvard, Vancouver, ISO, and other styles

36

Nguyen, Van Hanh. "Modèles de mélange semi-paramétriques et applications aux tests multiples." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00987035.

Full text

Abstract:

Dans un contexte de test multiple, nous considérons un modèle de mélange semi-paramétrique avec deux composantes. Une composante est supposée connue et correspond à la distribution des p-valeurs sous hypothèse nulle avec probabilité a priori p. L'autre composante f est nonparamétrique et représente la distribution des p-valeurs sous l'hypothèse alternative. Le problème d'estimer les paramètres p et f du modèle apparaît dans les procédures de contrôle du taux de faux positifs (''false discovery rate'' ou FDR). Dans la première partie de cette dissertation, nous étudions l'estimation de la proportion p. Nous discutons de résultats d'efficacité asymptotique et établissons que deux cas différents arrivent suivant que f s'annule ou non surtout un intervalle non-vide. Dans le premier cas (annulation surtout un intervalle), nous présentons des estimateurs qui convergent \' la vitesse paramétrique, calculons la variance asymptotique optimale et conjecturons qu'aucun estimateur n'est asymptotiquement efficace (i.e atteint la variance asymptotique optimale). Dans le deuxième cas, nous prouvons que le risque quadratique de n'importe quel estimateur ne converge pas à la vitesse paramétrique. Dans la deuxième partie de la dissertation, nous nous concentrons sur l'estimation de la composante inconnue nonparamétrique f dans le mélange, en comptant sur un estimateur préliminaire de p. Nous proposons et étudions les propriétés asymptotiques de deux estimateurs différents pour cette composante inconnue. Le premier estimateur est un estimateur à noyau avec poids aléatoires. Nous établissons une borne supérieure pour son risque quadratique ponctuel, en montrant une vitesse de convergence nonparamétrique classique sur une classe de Holder. Le deuxième estimateur est un estimateur du maximum de vraisemblance régularisée. Il est calculé par un algorithme itératif, pour lequel nous établissons une propriété de décroissance d'un critère. De plus, ces estimateurs sont utilisés dans une procédure de test multiple pour estimer le taux local de faux positifs (''local false discovery rate'' ou lfdr).

APA, Harvard, Vancouver, ISO, and other styles

37

Rosahl, Agnes Lioba. "How tissues tell time." Doctoral thesis, Humboldt-Universität zu Berlin, Mathematisch-Naturwissenschaftliche Fakultät I, 2015. http://dx.doi.org/10.18452/17113.

Full text

Abstract:

Durch ihren Einfluß auf die Genexpression reguliert die zirkadiane Uhr physiologische Funktionen vieler Organe. Obwohl der zugrundeliegende allgemeine Uhrmechanismus gut untersucht ist, bestehen noch viele Unklarheiten über die gewebespezifische Regulation zirkadianer Gene. Neben ihrer gemeinsamen 24-h-Periode im Expressionsmuster unterscheiden diese sich darin, zu welcher Tageszeit sie am höchsten exprimiert sind und in welchem Gewebe sie oszillieren. Mittels Überrepräsentationsanalyse lassen sich Bindungsstellen von Transkriptionsfaktoren identifizieren, die an der Regulation ähnlich exprimierter Gene beteiligt sind. Um diese Methode auf zirkadiane Gene anzuwenden, ist es nötig, Untergruppen ähnlich exprimierter Gene genau zu definieren und Vergleichsgene passend auszuwählen. Eine hierarchische Methode zur Kontrolle der FDR hilft, aus der daraus entstehenden Menge vieler Untergruppenvergleiche signifikante Ergebnisse zu filtern. Basierend auf mit Microarrays gemessenen Zeitreihen wurde durch Promotoranalyse die gewebespezifische Regulation von zirkadianen Genen zweier Zelltypen in Mäusen untersucht. Bindungsstellen der Transkriptionsfaktoren CLOCK:BMAL1, NF-Y und CREB fanden sich in beiden überrepräsentiert. Diesen verwandte Transkriptionsfaktoren mit spezifischen Komplexierungsdomänen binden mit unterschiedlicher Stärke an Motivvarianten und arrangieren dabei Interaktionen mit gewebespezifischeren Regulatoren (z.B. HOX, GATA, FORKHEAD, REL, IRF, ETS Regulatoren und nukleare Rezeptoren). Vermutlich beeinflußt dies den Zeitablauf der Komplexbildung am Promotor zum Transkriptionsstart und daher auch gewebespezifische Transkriptionsmuster. In dieser Hinsicht sind der Gehalt an Guanin (G) und Cytosin (C) sowie deren CpG-Dinukleotiden wichtige Promotoreigenschaften, welche die Interaktionswahrscheinlichkeit von Transkriptionsfaktoren steuern. Grund ist, daß die Affinitäten, mit denen Regulatoren zu Promotoren hingezogen werden, von diesen Sequenzeigenschaften abhängen.
A circadian clock in peripheral tissues regulates physiological functions through gene expression timing. However, despite the common and well studied core clock mechanism, understanding of tissue-specific regulation of circadian genes is marginal. Overrepresentation analysis is a tool to detect transcription factor binding sites that might play a role in the regulation of co-expressed genes. To apply it to circadian genes that do share a period of about 24 hours, but differ otherwise in peak phase timing and tissue-specificity of their oscillation, clear definition of co-expressed gene subgroups as well as the appropriate choice of background genes are important prerequisites. In this setting of multiple subgroup comparisons, a hierarchical method for false discovery control reveals significant findings. Based on two microarray time series in mouse macrophages and liver cells, tissue-specific regulation of circadian genes in these cell types is investigated by promoter analysis. Binding sites for CLOCK:BMAL1, NF-Y and CREB transcription factors are among the common top candidates of overrepresented motifs. Related transcription factors of BHLH and BZIP families with specific complexation domains bind to motif variants with differing strengths, thereby arranging interactions with more tissue-specific regulators (e.g. HOX, GATA, FORKHEAD, REL, IRF, ETS regulators and nuclear receptors). Presumably, this influences the timing of pre-initiation complexes and hence tissue-specific transcription patterns. In this respect, the content of guanine (G) and cytosine (C) bases as well as CpG dinucleotides are important promoter properties directing the interaction probability of regulators, because affinities with which transcription factors are attracted to promoters depend on these sequence characteristics.

APA, Harvard, Vancouver, ISO, and other styles

38

Yang-YuCheng and 鄭暘諭. "Estimation of False Discovery Rate Using Empirical Bayes Method." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/78t3ye.

Full text

Abstract:

碩士
國立成功大學
統計學系
104
In multiple testing problems, if you do not adjust the individual type I error rate and still set the individual significance level α, then the overall type I error rate of m hypotheses will be expanded to be mα. This study assumes that several genes have mixed normal distribution, and parameters have prior distribution. We use the Bayesian posterior distribution and EM algorithm to estimate the proportion of the null hypothesis which is true, then to estimate the number of null hypothesis which is true, and FDR. We compare the performance of these estimators for different parameters through the Monte Carlo algorithm. The estimator using McNemar test proposed by Ma ＆ Chao (2011) may cause estimation error too large as the significance level is set to be α=0.05. The estimator proposed by Benjamini ＆ Hochberg (2000) is unstable when the ratio of gene mutation is set to be random. The estimator using Friedman test proposed by Ma ＆ Tsai (2011) also has the same scenario. When the number of genes and the number of patients both are large and the proportion of true null hypothesis is higher, the proposed EBay estimator has the smaller RMSE. Hence it’s more accurate.

APA, Harvard, Vancouver, ISO, and other styles

39

Lin, Jian-Ping, and 林建平. "A Note on False Discovery Rate." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/03432627778278009898.

Full text

Abstract:

碩士
國立東華大學
應用數學系
97
Recent applications, particularly in genomics and imaging, call for testing large number of hypothesis tests at the same time. The False Discovery Rate (FDR) has been proposed and recognized as a powerful criterion in these contexts. Approximations of FDR such as local false discovery rate (lfdr) has been proposed and justified using two-group models, for example Efron (2007a). A generalization of two-group models is proposed. Under this framework, we study various approximations of false discovery rate and their validity. The connection with skew normality is also addressed.

APA, Harvard, Vancouver, ISO, and other styles

40

Dickhaus, Thorsten-Ingo [Verfasser]. "False discovery rate and asymptotics / vorgelegt von Thorsten-Ingo Dickhaus." 2008. http://d-nb.info/987358731/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Han, Bing. "A Bayesian approach to false discovery rate for large scale simultaneous inference." 2007. http://etda.libraries.psu.edu/theses/approved/WorldWideIndex/ETD-2014/index.html.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Jiao, Shuo. "Detecting differentially expressed genes while controlling the false discovery rate for microarray data." 2009. http://proquest.umi.com/pqdweb?did=1921650101&sid=12&Fmt=2&clientId=14215&RQT=309&VName=PQD.

Full text

Abstract:

Thesis (Ph.D.)--University of Nebraska-Lincoln, 2009.
Title from title screen (site viewed March 2, 2010). PDF text: 100 p. : col. ill. ; 953 K. UMI publication number: AAT 3379821. Includes bibliographical references. Also available in microfilm and microfiche formats.

APA, Harvard, Vancouver, ISO, and other styles

43

"Regaining control of false findings in feature selection, classification, and prediction on neuroimaging and genomics data." Tulane University, 2018.

Find full text

Abstract:

acase@tulane.edu
The technological advances of past decades have led to the accumulation of large amounts of genomic and neuroimaging data, enabling novel strategies in precision medicine. These largely rely on machine learning algorithms and modern statistical methods for big biological datasets, which are data-driven rather than hypothesis-driven. These methods often lack guarantees on the validity of the research findings. Because it can be a matter of life and death, when computational methods are deployed in clinical practice in medicine, establishing guarantees on the validity of the results is essential for the advancement of precision medicine. This thesis proposes several novel sparse regression and sparse canonical correlation analysis techniques, which by design include guarantees on the false discovery rate in variable selection. Variable selection on biomedical data is essential for many areas of healthcare, including precision medicine, population stratification, drug development, and predictive modeling of disease phenotypes. Predictive machine learning models can directly affect the patient when used to aid diagnosis, and therefore they need to be thoroughly evaluated before deployment. We present a novel approach to validly reuse the test data for performance evaluation of predictive models. The proposed methods are validated in the application on large genomic and neuroimaging datasets, where they confirm results from previous studies and also lead to new biological insights. In addition, this work puts a focus on making the proposed methods widely available to the scientific community though the release of free and open-source scientific software.
1
Alexej Gossmann

APA, Harvard, Vancouver, ISO, and other styles

44

Clarke, Sandra Jane. "The performance of multiple hypothesis testing procedures in the presence of dependence." 2010. http://repository.unimelb.edu.au/10187/7284.

Full text

Abstract:

Hypothesis testing is foundational to the discipline of statistics. Procedures exist which control for individual Type I error rates and more global or family-wise error rates for a series of hypothesis tests. However, the ability of scientists to produce very large data sets with increasing ease has led to a rapid rise in the number of statistical tests performed, often with small sample sizes. This is seen particularly in the area of biotechnology and the analysis of microarray data. This thesis considers this high-dimensional context with particular focus on the effects of dependence on existing multiple hypothesis testing procedures.
While dependence is often ignored, there are many existing techniques employed currently to deal with this context but these are typically highly conservative or require difficult estimation of large correlation matrices. This thesis demonstrates that, in this high-dimensional context when the distribution of the test statistics is light-tailed, dependence is not as much of a concern as in the classical contexts. This is achieved with the use of a moving average model. One important implication of this is that, when this is satisfied, procedures designed for independent test statistics can be used confidently on dependent test statistics.
This is not the case however for heavy-tailed distributions, where we expect an asymptotic Poisson cluster process of false discoveries. In these cases, we estimate the parameters of this process along with the tail-weight from the observed exceedences and attempt to adjust procedures. We consider both conservative error rates such as the family-wise error rate and more popular methods such as the false discovery rate. We are able to demonstrate that, in the context of DNA microarrays, it is rare to find heavy-tailed distributions because most test statistics are averages.

APA, Harvard, Vancouver, ISO, and other styles

45

Leap, Katie. "Multiple Testing Correction with Repeated Correlated Outcomes: Applications to Epigenetics." 2017. https://scholarworks.umass.edu/masters_theses_2/559.

Full text

Abstract:

Epigenetic changes (specifically DNA methylation) have been associated with adverse health outcomes; however, unlike genetic markers that are fixed over the lifetime of an individual, methylation can change. Given that there are a large number of methylation sites, measuring them repeatedly introduces multiple testing problems beyond those that exist in a static genetic context. Using simulations of epigenetic data, we considered different methods of controlling the false discovery rate. We considered several underlying associations between an exposure and methylation over time. We found that testing each site with a linear mixed effects model and then controlling the false discovery rate (FDR) had the highest positive predictive value (PPV), a low number of false positives, and was able to differentiate between differential methylation that was present at only one time point vs. a persistent relationship. In contrast, methods that controlled FDR at a single time point and ad hoc methods tended to have lower PPV, more false positives, and/or were unable to differentiate these conditions. Validation in data obtained from Project Viva found a difference between fitting longitudinal models only to sites significant at one time point and fitting all sites longitudinally.

APA, Harvard, Vancouver, ISO, and other styles

46

Buschmann, Tilo. "The Systematic Design and Application of Robust DNA Barcodes." Doctoral thesis, 2015. https://ul.qucosa.de/id/qucosa%3A14951.

Full text

Abstract:

High-throughput sequencing technologies are improving in quality, capacity, and costs, providing versatile applications in DNA and RNA research. For small genomes or fraction of larger genomes, DNA samples can be mixed and loaded together on the same sequencing track. This so-called multiplexing approach relies on a specific DNA tag, index, or barcode that is attached to the sequencing or amplification primer and hence accompanies every read. After sequencing, each sample read is identified on the basis of the respective barcode sequence. Alterations of DNA barcodes during synthesis, primer ligation, DNA amplification, or sequencing may lead to incorrect sample identification unless the error is revealed and corrected. This can be accomplished by implementing error correcting algorithms and codes. This barcoding strategy increases the total number of correctly identified samples, thus improving overall sequencing efficiency. Two popular sets of error-correcting codes are Hamming codes and codes based on the Levenshtein distance. Levenshtein-based codes operate only on words of known length. Since a DNA sequence with an embedded barcode is essentially one continuous long word, application of the classical Levenshtein algorithm is problematic. In this thesis we demonstrate the decreased error correction capability of Levenshtein-based codes in a DNA context and suggest an adaptation of Levenshtein-based codes that is proven of efficiently correcting nucleotide errors in DNA sequences. In our adaptation, we take any DNA context into account and impose more strict rules for the selection of barcode sets. In simulations we show the superior error correction capability of the new method compared to traditional Levenshtein and Hamming based codes in the presence of multiple errors. We present an adaptation of Levenshtein-based codes to DNA contexts capable of guaranteed correction of a pre-defined number of insertion, deletion, and substitution mutations. Our improved method is additionally capable of correcting on average more random mutations than traditional Levenshtein-based or Hamming codes. As part of this work we prepared software for the flexible generation of DNA codes based on our new approach. To adapt codes to specific experimental conditions, the user can customize sequence filtering, the number of correctable mutations and barcode length for highest performance. However, not every platform is susceptible to a large number of both indel and substitution errors. The Illumina “Sequencing by Synthesis” platform shows a very large number of substitution errors as well as a very specific shift of the read that results in inserted and deleted bases at the 5’-end and the 3’-end (which we call phaseshifts). We argue in this scenario that the application of Sequence-Levenshtein-based codes is not efficient because it aims for a category of errors that barely occurs on this platform, which reduces the code size needlessly. As a solution, we propose the “Phaseshift distance” that exclusively supports the correction of substitutions and phaseshifts. Additionally, we enable the correction of arbitrary combinations of substitution and phaseshift errors. Thus, we address the lopsided number of substitutions compared to phaseshifts on the Illumina platform. To compare codes based on the Phaseshift distance to Hamming Codes as well as codes based on the Sequence-Levenshtein distance, we simulated an experimental scenario based on the error pattern we identified on the Illumina platform. Furthermore, we generated a large number of different sets of DNA barcodes using the Phaseshift distance and compared codes of different lengths and error correction capabilities. We found that codes based on the Phaseshift distance can correct a number of errors comparable to codes based on the Sequence-Levenshtein distance while offering the number of DNA barcodes comparable to Hamming codes. Thus, codes based on the Phaseshift distance show a higher efficiency in the targeted scenario. In some cases (e.g., with PacBio SMRT in Continuous Long Read mode), the position of the barcode and DNA context is not well defined. Many reads start inside the genomic insert so that adjacent primers might be missed. The matter is further complicated by coincidental similarities between barcode sequences and reference DNA. Therefore, a robust strategy is required in order to detect barcoded reads and avoid a large number of false positives or negatives. For mass inference problems such as this one, false discovery rate (FDR) methods are powerful and balanced solutions. Since existing FDR methods cannot be applied to this particular problem, we present an adapted FDR method that is suitable for the detection of barcoded reads as well as suggest possible improvements.

APA, Harvard, Vancouver, ISO, and other styles

47

Gültas, Mehmet. "Development of novel Classical and Quantum Information Theory Based Methods for the Detection of Compensatory Mutations in MSAs." Doctoral thesis, 2013. http://hdl.handle.net/11858/00-1735-0000-0022-5EB0-1.

Full text

Abstract:

Multiple Sequenzalignments (MSAs) von homologen Proteinen sind nützliche Werkzeuge, um kompensatorische Mutationen zwischen nicht-konservierten Residuen zu charakterisieren. Die Identifizierung dieser Residuen in MSAs ist eine wichtige Aufgabe um die strukturellen Grundlagen und molekularen Mechanismen von Proteinfunktionen besser zu verstehen. Trotz der vielen Anzahl an Literatur über kompensatorische Mutationen sowie über die Sequenzkonservierungsanalyse für die Erkennung von wichtigen Residuen, haben vorherige Methoden meistens die biochemischen Eigenschaften von Aminosäuren nicht mit in Betracht gezogen, welche allerdings entscheidend für die Erkennung von kompensatorischen Mutationssignalen sein können. Jedoch werden kompensatorische Mutationssignale in MSAs oft durch das Rauschen verfälscht. Aus diesem Grund besteht ein weiteres Problem der Bioinformatik in der Trennung signifikanter Signale vom phylogenetischen Rauschen und beziehungslosen Paarsignalen. Das Ziel dieser Arbeit besteht darin Methoden zu entwickeln, welche biochemische Eigenschaften wie Ähnlichkeiten und Unähnlichkeiten von Aminosäuren in der Identifizierung von kompensatorischen Mutationen integriert und sich mit dem Rauschen auseinandersetzt. Deshalb entwickeln wir unterschiedliche Methoden basierend auf klassischer- und quantum Informationstheorie sowie multiple Testverfahren. Unsere erste Methode basiert auf der klassischen Informationstheorie. Diese Methode betrachtet hauptsächlich BLOSUM62-unähnliche Paare von Aminosäuren als ein Modell von kompensatorischen Mutationen und integriert sie in die Identifizierung von wichtigen Residuen. Um diese Methode zu ergänzen, entwickeln wir unsere zweite Methode unter Verwendung der Grundlagen von quantum Informationstheorie. Diese neue Methode unterscheidet sich von der ersten Methode durch gleichzeitige Modellierung ähnlicher und unähnlicher Signale in der kompensatorischen Mutationsanalyse. Des Weiteren, um signifikante Signale vom Rauschen zu trennen, entwickeln wir ein MSA-spezifisch statistisches Modell in Bezug auf multiple Testverfahren. Wir wenden unsere Methode für zwei menschliche Proteine an, nämlich epidermal growth factor receptor (EGFR) und glucokinase (GCK). Die Ergebnisse zeigen, dass das MSA-spezifisch statistische Modell die signifikanten Signale vom phylogenetischen Rauschen und von beziehungslosen Paarsignalen trennen kann. Nur unter Berücksichtigung BLOSUM62-unähnlicher Paare von Aminosäuren identifiziert die erste Methode erfolgreich die krankheits-assoziierten wichtigen Residuen der beiden Proteine. Im Gegensatz dazu, durch die gleichzeitige Modellierung ähnlicher und unähnlicher Signale von Aminosäurepaare ist die zweite Methode sensibler für die Identifizierung von katalytischen und allosterischen Residuen.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Bayes False discovery rate'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles