Dissertations / Theses on the topic 'Biometry; Breeding – Statistical methods'

To see the other types of publications on this topic, follow the link: Biometry; Breeding – Statistical methods.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 42 dissertations / theses for your research on the topic 'Biometry; Breeding – Statistical methods.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Thompson, Robin. "Statistical methods and applications to animal breeding." Thesis, University of Edinburgh, 1987. http://hdl.handle.net/1842/30836.

Full text
Abstract:
This thesis comprises a collection of 39 research papers divided into three groups. The first group is entitled 'Statistical Methods, including variance component estimation with general application'. The second group report on 'Application of statistical methods to animal breeding studies'. The last group 'Experimental Studies' reports on studies on animal breeding data in beef and dairy cattle. The major theme of Group I is variance component estimation and the introduction of a method, now known as REML (Residual Maximum Likelihood) that unifies the area. The method was introduced for the analysis of incomplete block designs with unequal block size but was found to have important applications in the analysis of groups of trials, time-series, multivariate data and detecting outliers. The work on variance components has applications to animal breeding and is discussed in Group II. Papers discuss efficient designs for estimation of genetic parameters, including heritability, maternal and multivariate genetic parameters. These designs can lead to substantial reductions in the variances of the parameters over classical designs. It is shown that REML can be applied in certain circumstances when there is selection of animals. Links between variance estimation and best linear unbiased prediction are explored. Methods of prediction, estimation of genetic parameters and optimal designs are given for non-normal data. The last group includes reports on the comparison of breeds and cross-breeding in beef cattle in Zambia. Other studies include estimating the genetic relationship between beef can dairy characters in british Friesian cattle. The validity of models used in dairy sire evaluation are investigated including the heterogenity of heritability of milk yield at different levels of production and a novel method for taking account of environmental variation within herds.
APA, Harvard, Vancouver, ISO, and other styles
2

Batidzirai, Jesca Mercy. "Randomization in a two armed clinical trial: an overview of different randomization techniques." Thesis, University of Fort Hare, 2011. http://hdl.handle.net/10353/395.

Full text
Abstract:
Randomization is the key element of any sensible clinical trial. It is the only way we can be sure that the patients have been allocated into the treatment groups without bias and that the treatment groups are almost similar before the start of the trial. The randomization schemes used to allocate patients into the treatment groups play a role in achieving this goal. This study uses SAS simulations to do categorical data analysis and comparison of differences between two main randomization schemes namely unrestricted and restricted randomization in dental studies where there are small samples, i.e. simple randomization and the minimization method respectively. Results show that minimization produces almost equally sized treatment groups, but simple randomization is weak in balancing prognostic factors. Nevertheless, simple randomization can also produce balanced groups even in small samples, by chance. Statistical power is also improved when minimization is used than in simple randomization, but bigger samples might be needed to boost the power.
APA, Harvard, Vancouver, ISO, and other styles
3

Wong, Chun-mei May, and 王春美. "Multilevel models for survival analysis in dental research." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2005. http://hub.hku.hk/bib/B3637216X.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

李友榮 and Yau-wing Lee. "Modelling multivariate survival data using semiparametric models." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2000. http://hub.hku.hk/bib/B4257528X.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Long, Yongxian, and 龙泳先. "Semiparametric analysis of interval censored survival data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2010. http://hub.hku.hk/bib/B45541152.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Yeatts, Sharon Dziuba. "Statistical Methods and Experimental Design for Inference Regarding Dose and/or Interaction Thresholds Along a Fixed-Ratio Ray." VCU Scholars Compass, 2006. http://hdl.handle.net/10156/1956.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Buntaran, Harimurti [Verfasser], and Hans-Peter [Akademischer Betreuer] Piepho. "Statistical methods for analysis of multienvironment trials in plant breeding : accuracy and precision / Harimurti Buntaran ; Betreuer: Hans-Peter Piepho." Hohenheim : Kommunikations-, Informations- und Medienzentrum der Universität Hohenheim, 2021. http://nbn-resolving.de/urn:nbn:de:bsz:100-opus-19265.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Forster, Jeri E. "Varying-coefficient models for longitudinal data : piecewise-continuous, flexible, mixed-effects models and methods for analyzing data with nonignorable dropout /." Connect to full text via ProQuest. Limited to UCD Anschutz Medical Campus, 2006.

Find full text
Abstract:
Thesis (Ph.D. in Biostatistics) -- University of Colorado at Denver and Health Sciences Center, 2006.
Typescript. Includes bibliographical references (leaves 72-75). Free to UCD Anschutz Medical Campus. Online version available via ProQuest Digital Dissertations;
APA, Harvard, Vancouver, ISO, and other styles
9

Gates, Peter J. "Analyzing categorical traits in domestic animal data collected in the field /." Uppsala : Swedish Univ. of Agricultural Sciences (Sveriges lantbruksuniv.), 1999. http://epsilon.slu.se/avh/1999/91-576-5473-5.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Pook, Torsten [Verfasser], Henner [Akademischer Betreuer] Simianer, Henner [Gutachter] Simianer, Timothy Mathes [Gutachter] Beissinger, and Hans-Peter [Gutachter] Piepho. "Methods and software to enhance statistical analysis in large scale problems in breeding and quantitative genetics / Torsten Pook ; Gutachter: Henner Simianer, Timothy Mathes Beissinger, Hans-Peter Piepho ; Betreuer: Henner Simianer." Göttingen : Niedersächsische Staats- und Universitätsbibliothek Göttingen, 2019. http://d-nb.info/1199608254/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Girabent, i. Farrés Montserrat. "Aplicació dels models de Thrustone i de Bradely-Terry a l’anàlisi de dades “ranking” obtingudes de mesures de preferència en escala ipsativa." Doctoral thesis, Universitat de Barcelona, 2013. http://hdl.handle.net/10803/277566.

Full text
Abstract:
La recerca va néixer de l’interès de mesurar les preferències dels individus quan se’ls demana que ordenin una llista d’opcions, ja siguin conductes o objectes, obtenint així dades rànquing. Això determina que l’individu està forçat a establir un ordre entre les seves preferències, donant lloc al que es coneix com a escala de mesura ipsativa ordinal. Aquest tipus de mesura té com a avantatge front a la d’escala normativa, com Likert, que disminueix la probabilitat del conegut problema d’ “acquiescense bias” i s’elimini l’efecte “halo and horn”. Per altre banda, la principal característica del vector de respostes es que la suma dels seus components serà sempre una mateixa constant i això dificulta l’anàlisi de les dades. El primer objectiu fou el de revisar els models estadístics per analitzar dades rànquing mesurades en escala ipsativa que donen informació sobre el procés discriminador. El segon fou estendre aquests quan es tenen mesures repetides de l’elecció dels individus respecte les seves preferències i/o quan es consideren covariables referents a característiques dels propis individus o de les alternatives a ordenar. La primera teoria que marca l’ús d’una escala ipsativa és la llei dels judicis comparatius de Thurstone (1927), on es postula que quan a un individu se li demana que emeti un judici es produeix un procés discriminador en el contínuum psicològic. És en aquesta escala continua no observada en la que rau el interès a fi de veure el perfil de preferències en termes d’ordre i distància entre les opcions. La metodologia avaluada per a trobar les solucions en escala d’interval continua, es va fonamentar en dues aproximacions. La primera, treballada per el grup de Böckenholt (1991-2006) es base en els models clàssics desenvolupats per Thurstone al 1931. En aquesta les observacions ordenades s’expressen com a diferències de les variables latents subjacents a cada un dels ítems de comparació. Així, imposant les restriccions proposades per Maydeu-Olivares (2005) a la matriu de covariàncies, s’obté un cas particular d’un model d’equacions estructural (SEM). Aquest permet estimar les mitjanes de les variables latents que correspondran a la posició de cada opció en l’escala continua d’interval. Si bé, la solució depèn de que es compleixi la condició de normalitat de les variables latents i l’algorisme no troba solució a partir de cert nombre d’opcions. A més el model no permet modelar situacions de mesures repetides. En la segona aproximació es troben els treballs del grup de Dittrich (1998-2012) basats en els models de Bradley-Terry (BTM) del 1952. Els BTM consideren que la distribució de cada un dels judicis aparellats segueix una llei Binomial. Així, treballant directament amb la taula de contingències, es pot expressar la funció de versemblança com un model log-lineal general (LLBTM). És a partir d’aquest segon model, i de les seves extensions per a covariables que proposem l’extensió pel cas de mesures repetides. Les diferents propostes metodològiques es van provar tant per dades simulades com en dos exemples reals de l’àmbit de l’educació en ciències de la salut. En un, s’estudien les preferències sobre l’estil d’aprenentatge (Test Canfield) d’estudiants de medicina i en l’altre es valora si l’opinió dels estudiants de fisioteràpia sobre les activitats d’autoaprenentatge és diferent abans i després de realitzar-les. Com a conclusions, • La diferència entre les aproximacions de Thurstone y Bradley-Terry rau en la distribució que segueix la funció de versemblança. • El model LLBTM permet incorporar modificacions a les condicions d’aplicació que donen lloc a cada una de les extensions del model que incorporen covariables. • El model LLBTM permet una extensió en la que la comparació entre les opcions no sigui independent donant lloc als models per a mesures repetides.
The research focus on measuring individual preferences when people are asked to sort a list of options, thus obtaining data ranking. This determines that the subject is forced to establish an order between their preferences, resulting in what is known as ordinal Ipsative measurement scale.The advantage of this type of measure over the normative measurement scale such as Likert, which reduces the likelihood of problems known as "acquiescence bias" and removed the effect "halo and horn". However, they statistical analysis is difficult because the vector-response sums always a constant.The objectives were to review the statistical models to analyze the preferences measured in Ipsative scale, to give information about the discriminating process and to extend these models when we had repeated measures and / or covariates.The law of comparative judgments (Thurstone, 1927) postulated that this process occurs in discriminatory psychological continuum. This continuum scale allows finding the distance between the options.The methodology evaluated based on two approaches. First, the working of Böckenholt group (1991-2006) based on classical models developed by Thurstone in 1931. They expressed the ranking data as differences in the latent variables underlying each of items for comparison. So imposing the Maydeu-Olivares (2005) restrictions on the covariance matrix, we obtain a special case of a structural equation model to estimate the means of the latent variables that correspond to the position each option in the continuous interval scale. While the answer depends on the Normality of the latent variables. In addition, the model not allows to have repeated measurements. The second approach is the work of the Dittrich group (1998-2012) based on Bradley-Terry model (1952), which assumes a binomial distribution of the pairs of comparison. Thus, the likelihood function expressed as a general log-linear model (LBTM). The extension we developed is from LBTM.The aim of first applied study was to known the learning style preferences of medical students. The purpose of the second study was assessed whether physiotherapy students' opinions about self-learning is different before and after perform them.Conclusions:• The difference between the approaches of Thurstone and Bradley-Terry lies in the likelihood function distribution.• The model BTM allows incorporate modifications to the application conditions that give rise the extensions incorporating covariates and consider repeated measures.
APA, Harvard, Vancouver, ISO, and other styles
12

Studeny, Angelika C. "Quantifying biodiversity trends in time and space." Thesis, University of St Andrews, 2012. http://hdl.handle.net/10023/3414.

Full text
Abstract:
The global loss of biodiversity calls for robust large-scale diversity assessment. Biological diversity is a multi-faceted concept; defined as the “variety of life”, answering questions such as “How much is there?” or more precisely “Have we succeeded in reducing the rate of its decline?” is not straightforward. While various aspects of biodiversity give rise to numerous ways of quantification, we focus on temporal (and spatial) trends and their changes in species diversity. Traditional diversity indices summarise information contained in the species abundance distribution, i.e. each species' proportional contribution to total abundance. Estimated from data, these indices can be biased if variation in detection probability is ignored. We discuss differences between diversity indices and demonstrate possible adjustments for detectability. Additionally, most indices focus on the most abundant species in ecological communities. We introduce a new set of diversity measures, based on a family of goodness-of-fit statistics. A function of a free parameter, this family allows us to vary the sensitivity of these measures to dominance and rarity of species. Their performance is studied by assessing temporal trends in diversity for five communities of British breeding birds based on 14 years of survey data, where they are applied alongside the current headline index, a geometric mean of relative abundances. Revealing the contributions of both rare and common species to biodiversity trends, these "goodness-of-fit" measures provide novel insights into how ecological communities change over time. Biodiversity is not only subject to temporal changes, but it also varies across space. We take first steps towards estimating spatial diversity trends. Finally, processes maintaining biodiversity act locally, at specific spatial scales. Contrary to abundance-based summary statistics, spatial characteristics of ecological communities may distinguish these processes. We suggest a generalisation to a spatial summary, the cross-pair overlap distribution, to render it more flexible to spatial scale.
APA, Harvard, Vancouver, ISO, and other styles
13

Pierce, Brian Thomas. "The influence of the environment on the volume growth, stem form and disease tolerance of Eucalyptus grandis clones in the summer rainfall areas of South Africa." Thesis, Stellenbosch : Stellenbosch University, 2000. http://hdl.handle.net/10019.1/51988.

Full text
Abstract:
Thesis (MPhil)--Stellenbosch University, 2000.
ENGLISH ABSTRACT: A thesis undertaken to quantify genotype-by-environment interaction within Eucalyptus grandis clones growing in the eastern portion of South Africa. Thirty one sites were selected to represent the "traditional" E. grandis growing areas of South Africa. Eleven common macro- site variables and twelve common micro- site soil variables were recorded at each site. Twenty seven E. grandis clones and four E. grandis hybrid clones were then evaluated over these 31 sites. An incomplete latin square design was used to evaluate the 31 test clones, and five E. grandis controls were incorporated into the trial design to link the 31 sites. Volume production, stem form, stem defects and survival were assessed at two and five years, as well as the disease infestation of three stem cankers at five years. The analytical methods which were used to evaluate and quantify the GEl portion of the study are the analysis of variance (ANOV A), correlation analysis, and joint regression analysis (IRA) together with the analysis of co-variance (ACOV AR). The growth-site association for volume production, stem form and Endothia disease infestation were investigated using factor analysis (FA), and equations derived for the species and for the individual clones using a stepwise multiple regression approach. GEl, as evaluated through JRA, revealed that an increase in site productivity lead to a positive linear response in productivity on a clonal level, and that there was a diverging or fanning pattern among the regression lines of the clones. This tendency was also observed for both the stem form and the Endothia infestation. Hence, no significant changes in the rankings of the clones were found, and only relevant differences between the clones were found to change significantly. Juvenile-mature genetic correlations for volume production and the stem form showed moderate (rg = 0,66 and rg = 0,70) correlations between the two and the five year assessments. On a species level, rainfall was the main environmental factor responsible for volume production, while latitude was the main influence on stem form and Endothia infestation. On an individual clone basis, some micro-site soil factor interaction within the clones was found for the growth-site response models. Keywords: Eucalyptus grandis, genotype environment interaction, clones, site factors, growth-site response, ANOV A, ACOV AR, GEl, FA, JRA,
AFRIKAANSE OPSOMMING: 'n Studie is ondemeem om die genotipe-omgewingsinteraksie van Eucalyptus grandis klone, wat in die oostelike deel van Suid-Afrika groei, te kwantifiseer. Eenen- dertig groeiplekke is geselekteer om die "tradisionele" E. grandis groeiplekke in Suid-Afrika te verteenwoordig. Elf gemeenskaplike makro-groeiplek veranderlikes en twaalf gemeenskaplike mikro-groeiplek veranderlikes is by elk van die groei areas opgeteken. Sewe-en-twintig E. grandis klone en vier E. grandis basterklone is daama oor hierdie 31 groeiplekke geevalueer. 'n Onvolledige Latynse roosterontwerp is gebruik om die 31 toetsklone te evalueer en vyf kontroles is gebruik om die groeiplekke gemeenskaplik te verbind. Volume produksie, stamvorm, stamdefekte en oorlewing is op twee- en vyfjarige ouderdomme geevalueer terwyl besmetting met drie stamkankers op vyf jaar beoordeel is. Die analitiese metodes wat gebruik was om genotipeomgewingsinteraksie te evalueer en te kwantifiseer is die variansie analise (ANOYA), korrelasie analise, en gesamentlike regressie analise (JRA) tesame met ko-variansie analise (ACOY AR). Die groeiplek assosiasie vir volume produksie, stamvorm en Endothia besmetting is ondersoek deur gebruik te maak van faktor analise (FA), en vergelykings is verkry vir die spesies en individuele klone deur gebruik van 'n stapsgewyse meervoudige regressie benadering. Genotipe-omgewingsinteraksie, soos geevalueer deur JRA, wys dat 'n toename in groeiplek produktiwiteit lei tot 'n positiewe lineere reaksie in produktiwiteit op klonale vlak en dat daar 'n divergerende patroon tussen die regressielyne van die klone is. Hierdie tendens is ook vir beide die stamvorm en Endothia besmetting waargeneem. Gevolglik is nie-beduidende veranderings in die rangorde van die klone gevind en slegs reletiewe verskille tussen klone is gevind. Onvolwasse-volwasse genetiese korrelasies vir volume produksie en stamvorm toon matige korrelasies (rg =0.66 en rg =0.70) tussen die twee- en vyfjaar metings. Op 'n spesiesvlak was reenval die oorheersende omgewingsfaktor verantwoordelik vir volume produksie terwyl die breedtegraad ligging stamvorm en Endothia besmetting bemvloed het. Op individuele kloonvlak het sommige mikro-groeiplek interaksie binne klone bygedra tot die groei en groeiplek reaksie modelle. Sleutelwoorde: Eucalyptus grandis, Genotipe-omgewingsinteraksie, klone, groeiplek faktore, groeiplek reaksie, ANOY A, ACOY AR, FA, JRA
APA, Harvard, Vancouver, ISO, and other styles
14

Wang, Ya. "Statistical Methods for Epigenetic Data." Thesis, 2019. https://doi.org/10.7916/d8-9z87-ts07.

Full text
Abstract:
DNA methylation plays a crucial role in human health, especially cancer. Traditional DNA methylation analysis aims to identify CpGs/genes with differential methylation (DM) between experimental groups. Differential variability (DV) was recently observed that contributes to cancer heterogeneity and was also shown to be essential in detecting early DNA methylation alterations, notably epigenetic field defects. Moreover, studies have demonstrated that environmental factors may modify the effect of DNA methylation on health outcomes, or vice versa. Therefore, this dissertation seeks to develop new statistical methods for epigenetic data focusing on DV and interactions when efficient analytical tools are lacking. First, as neighboring CpG sites are usually highly correlated, we introduced a new method to detect differentially methylated regions (DMRs) that uses combined DM and DV signals between diseased and non-diseased groups. Next, using both DM and DV signals, we considered the problem of identifying epigenetic field defects, when CpG-site-level DM and DV signals are minimal and hard to be detected by existing methods. We proposed a weighted epigenetic distance-based method that accumulates CpG-site-level DM and DV signals in a gene. Here DV signals were captured by a pseudo-data matrix constructed using centered quadratic methylation measures. CpG-site-level association signal annotations were introduced as weights in distance calculations to up-weight signal CpGs and down-weight noise CpGs to further boost the study power. Lastly, we extended the weighted epigenetic distance-based method to incorporate DNA methylation by environment interactions in the detection of overall association between DNA methylation and health outcomes. A pseudo-data matrix was constructed with cross-product terms between DNA methylation and environmental factors that is able to capture their interactions. The superior performance of the proposed methods were shown through intensive simulation studies and real data applications to multiple DNA methylation data.
APA, Harvard, Vancouver, ISO, and other styles
15

Qiu, Xin. "Statistical Learning Methods for Personalized Medicine." Thesis, 2018. https://doi.org/10.7916/D85X3SCG.

Full text
Abstract:
The theme of this dissertation is to develop simple and interpretable individualized treatment rules (ITRs) using statistical learning methods to assist personalized decision making in clinical practice. Considerable heterogeneity in treatment response is observed among individuals with mental disorders. Administering an individualized treatment rule according to patient-specific characteristics offers an opportunity to tailor treatment strategies to improve response. Black-box machine learning methods for estimating ITRs may produce treatment rules that have optimal benefit but lack transparency and interpretability. Barriers to implementing personalized treatments in clinical psychiatry include a lack of evidence-based, clinically interpretable, individualized treatment rules, a lack of diagnostic measure to evaluate candidate ITRs, a lack of power to detect treatment modifiers from a single study, and a lack of reproducibility of treatment rules estimated from single studies. This dissertation contains three parts to tackle these barriers: (1) methods to estimate the best linear ITR with guaranteed performance among the class of linear rules; (2) a tree-based method to improve the performance of a linear ITR fitted from the overall sample and identify subgroups with a large benefit; and (3) an integrative learning combining information across trials to provide an integrative ITR with improved efficiency and reproducibility. In the first part of the dissertation, we propose a machine learning method to estimate optimal linear individualized treatment rules for data collected from single stage randomized controlled trials (RCTs). In clinical practice, an informative and practically useful treatment rule should be simple and transparent. However, because simple rules are likely to be far from optimal, effective methods to construct such rules must guarantee performance, in terms of yielding the best clinical outcome (highest reward) among the class of simple rules under consideration. Furthermore, it is important to evaluate the benefit of the derived rules on the whole sample and in pre-specified subgroups (e.g., vulnerable patients). To achieve both goals, we propose a robust machine learn- ing algorithm replacing zero-one loss with an authentic approximation loss (ramp loss) for value maximization, referred to as the asymptotically best linear O-learning (ABLO), which estimates a linear treatment rule that is guaranteed to achieve optimal reward among the class of all linear rules. We then develop a diagnostic measure and inference procedure to evaluate the benefit of the obtained rule and compare it with the rules estimated by other methods. We provide theoretical justification for the proposed method and its inference procedure, and we demonstrate via simulations its superior performance when compared to existing methods. Lastly, we apply the proposed method to the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial on major depressive disorder (MDD) and show that the estimated optimal linear rule provides a large benefit for mildly depressed and severely depressed patients but manifests a lack-of-fit for moderately depressed patients. The second part of the dissertation is motivated by the results of real data analysis in the first part, where the global linear rule estimated by ABLO from the overall sample performs inadequately on the subgroup of moderately depressed patients. Therefore, we aim to derive a simple and interpretable piece-wise linear ITR to maintain certain optimality that leads to improved benefit in subgroups of patients, as well as the overall sample. In this work, we propose a tree-based robust learning method to estimate optimal piece-wise linear ITRs and identify subgroups of patients with a large benefit. We achieve these goals by simultaneously identifying qualitative and quantitative interactions through a tree model, referred to as the composite interaction tree (CITree). We show that it has improved performance compared to existing methods on both overall sample and subgroups via extensive simulation studies. Lastly, we fit CITree to Research Evaluating the Value of Augmenting Medication with Psychotherapy (REVAMP) trial for treating major depressive disorders, where we identified both qualitative and quantitative interactions and subgroups of patients with a large benefit. The third part deals with the difficulties in the low power of identifying ITRs and replicating ITRs due to small sample sizes of single randomized controlled trials. In this work, a novel integrative learning method is developed to synthesize evidence across trials and provide an integrative ITR that improves efficiency and reproducibility. Our method does not require all studies to collect a common set of variables and thus allows information to be combined from ITRs identified from randomized controlled trials with heterogeneous sets of baseline covariates collected from different domains with different resolution. Based on the research goal, the integrative learning can be used to enhance a high-resolution ITR by borrowing information from coarsened ITRs or improve the coarsened ITR from a high-resolution ITR. With a simple modification, the proposed integrative learning can also be applied to improve the estimation of ITRs for studies with blockwise missing feature variables. We conduct extensive simulation studies to show that our method has improved performance compared to existing methods where only single-trial ITRs are used to learn personalized treatment rules. Lastly, we apply the proposed method to RCTs of major depressive disorder and other comorbid mental disorders. We found that by combining information from two studies, the integrated ITR has a greater benefit and improved efficiency compared to single-trial rules or universal non-personalized treatment rule.
APA, Harvard, Vancouver, ISO, and other styles
16

Xie, Shanghong. "Statistical Methods for Constructing Heterogeneous Biomarker Networks." Thesis, 2019. https://doi.org/10.7916/d8-5tzf-0747.

Full text
Abstract:
The theme of this dissertation is to construct heterogeneous biomarker networks using graphical models for understanding disease progression and prognosis. Biomarkers may organize into networks of connected regions. Substantial heterogeneity in networks between individuals and subgroups of individuals is observed. The strengths of network connections may vary across subjects depending on subject-specific covariates (e.g., genetic variants, age). In addition, the connectivities between biomarkers, as subject-specific network features, have been found to predict disease clinical outcomes. Thus, it is important to accurately identify biomarker network structure and estimate the strength of connections. Graphical models have been extensively used to construct complex networks. However, the estimated networks are at the population level, not accounting for subjects’ covariates. More flexible covariate-dependent graphical models are needed to capture the heterogeneity in subjects and further create new network features to improve prediction of disease clinical outcomes and stratify subjects into clinically meaningful groups. A large number of parameters are required in covariate-dependent graphical models. Regularization needs to be imposed to handle the high-dimensional parameter space. Furthermore, personalized clinical symptom networks can be constructed to investigate co-occurrence of clinical symptoms. When there are multiple biomarker modalities, the estimation of a target biomarker network can be improved by incorporating prior network information from the external modality. This dissertation contains four parts to achieve these goals: (1) An efficient l0-norm feature selection method based on augmented and penalized minimization to tackle the high-dimensional parameter space involved in covariate-dependent graphical models; (2) A two-stage approach to identify disease-associated biomarker network features; (3) An application to construct personalized symptom networks; (4) A node-wise biomarker graphical model to leverage the shared mechanism between multi-modality data when external modality data is available. In the first part of the dissertation, we propose a two-stage procedure to regularize l0-norm as close as possible and solve it by a highly efficient and simple computational algorithm. Advances in high-throughput technologies in genomics and imaging yield unprecedentedly large numbers of prognostic biomarkers. To accommodate the scale of biomarkers and study their association with disease outcomes, penalized regression is often used to identify important biomarkers. The ideal variable selection procedure would search for the best subset of predictors, which is equivalent to imposing an l0-penalty on the regression coefficients. Since this optimization is a non-deterministic polynomial-time hard (NP-hard) problem that does not scale with number of biomarkers, alternative methods mostly place smooth penalties on the regression parameters, which lead to computationally feasible optimization problems. However, empirical studies and theoretical analyses show that convex approximation of l0-norm (e.g., l1) does not outperform their l0 counterpart. The progress for l0-norm feature selection is relatively slower, where the main methods are greedy algorithms such as stepwise regression or orthogonal matching pursuit. Penalized regression based on regularizing l0-norm remains much less explored in the literature. In this work, inspired by the recently popular augmenting and data splitting algorithms including alternating direction method of multipliers, we propose a two-stage procedure for l0-penalty variable selection, referred to as augmented penalized minimization-L0 (APM-L0). APM-L0 targets l0-norm as closely as possible while keeping computation tractable, efficient, and simple, which is achieved by iterating between a convex regularized regression and a simple hard-thresholding estimation. The procedure can be viewed as arising from regularized optimization with truncated l1 norm. Thus, we propose to treat regularization parameter and thresholding parameter as tuning parameters and select based on cross-validation. A one-step coordinate descent algorithm is used in the first stage to significantly improve computational efficiency. Through extensive simulation studies and real data application, we demonstrate superior performance of the proposed method in terms of selection accuracy and computational speed as compared to existing methods. The proposed APM-L0 procedure is implemented in the R-package APML0. In the second part of the dissertation, we develop a two-stage method to estimate biomarker networks that account for heterogeneity among subjects and evaluate the network’s association with disease clinical outcome. In the first stage, we propose a conditional Gaussian graphical model with mean and precision matrix depending on covariates to obtain subject- or subgroup-specific networks. In the second stage, we evaluate the clinical utility of network measures (connection strengths) estimated from the first stage. The second stage analysis provides the relative predictive power of between-region network measures on clinical impairment in the context of regional biomarkers and existing disease risk factors. We assess the performance of the proposed method by extensive simulation studies and application to a Huntington’s disease (HD) study to investigate the effect of HD causal gene on the rate of change in motor symptom through affecting brain subcortical and cortical grey matter atrophy connections. We show that cortical network connections and subcortical volumes, but not subcortical connections are identified to be predictive of clinical motor function deterioration. We validate these findings in an independent HD study. Lastly, highly similar patterns seen in the grey matter connections and a previous white matter connectivity study suggest a shared biological mechanism for HD and support the hypothesis that white matter loss is a direct result of neuronal loss as opposed to the loss of myelin or dysmyelination. In the third part of the dissertation, we apply the methodology to construct heterogeneous cross-sectional symptom networks. The co-occurrence of symptoms may result from the direct interactions between these symptoms and the symptoms can be treated as a system. In addition, subject-specific risk factors (e.g., genetic variants, age) can also exert external influence on the system. In this work, we develop a covariate-dependent conditional Gaussian graphical model to obtain personalized symptom networks. The strengths of network connections are modeled as a function of covariates to capture the heterogeneity among individuals and subgroups of individuals. We assess the performance of the proposed method by simulation studies and an application to a Huntington’s disease study to investigate the networks of symptoms in different domains (motor, cognitive, psychiatric) and identify the important brain imaging biomarkers associated with the connections. We show that the symptoms in the same domain interact more often with each other than across domains. We validate the findings using subjects’ measurements from follow-up visits. In the fourth part of the dissertation, we propose an integrative learning approach to improve the estimation of subject-specific networks of target modality when external modality data is available. The biomarker networks measured by different modalities of data (e.g., structural magnetic resonance imaging (sMRI), diffusion tensor imaging (DTI)) may share the same true underlying biological mechanism. In this work, we propose a node-wise biomarker graphical model to leverage the shared mechanism between multi-modality data to provide a more reliable estimation of the target modality network and account for the heterogeneity in networks due to differences between subjects and networks of external modality. Latent variables are introduced to represent the shared unobserved biological network and the information from the external modality is incorporated to model the distribution of the underlying biological network. An approximation approach is used to calculate the posterior expectations of latent variables to reduce time. The performance of the proposed method is demonstrated by extensive simulation studies and an application to construct gray matter brain atrophy network of Huntington’s disease by using sMRI data and DTI data. The estimated network measures are shown to be meaningful for predicting follow-up clinical outcomes in terms of patient stratification and prediction. Lastly, we conclude the dissertation with comments on limitations and extensions.
APA, Harvard, Vancouver, ISO, and other styles
17

Sun, Ming. "Statistical Methods for Modeling Biomarkers of Neuropsychiatric Diseases." Thesis, 2018. https://doi.org/10.7916/D8GJ11VW.

Full text
Abstract:
Due to a lack of a gold standard objective marker, the current practice for diagnosing neuropsychiatric disorders is mostly based on clinical symptoms, which may occur in the late stage of the disease. Clinical diagnosis is also subject to high variance due to between- and within-subject variability of patient symptomatology and between-clinician variability. Effectively modeling disease course and making early predictions using biomarkers and subtle clinical signs are critical and challenging both for improving diagnostic accuracy and designing preventive clinical trials for neurological disorders. Leveraging the domain knowledge that certain biological characteristics (i.e., causal genetic mutation, cognitive reserve) are part of the disease mechanism, we first propose a nonlinear model with random inflection points depending on subject-specific characteristics to jointly estimate the trajectories of the biomarkers. The model scales different biomarkers into comparable progression curves with a temporal order based on the mean inflection point. Meanwhile, it assesses how subject-specific characteristics affect the dynamic trajectory of different markers, which offers information on designing preventive therapeutics and personalized disease management strategy. We use EM algorithm for the estimation. Extensive simulation studies are conducted. The method is applied to biomarkers in neuroimaging, cognitive, and motor domains of Huntington’s disease. Under the same nonlinear random effects model framework, we propose the second model inspired by the neural mass models. Biomarkers are modeled as the average manifestation of the functioning status of neuronal ensembles. A latent liability score is shared across biomarkers to pool information. We use EM algorithm for maximum likelihood estimation, and a normal approximation is used to facilitate numerical integration. The results show that some neuroimaging biomarkers are early signs of the onset of Huntington’s disease. Finally, we develop an online tool that provides the personalized prediction of biomarker trajectory given the medical history and baseline measurements. The third model uses a dynamical system based on differential equations to model the evolution of biomarkers. The dynamical system is not only useful to characterize the temporal patterns of the biomarkers, but also informative of the interaction among the biomarkers. We propose a semiparametric dynamical system based on multi-index models. For estimation and inference, we consider a two-step procedure based on the integral equations from the proposed model. The algorithm iterates between the estimation of the link function through splines and the estimation of the index parameters, allowing for regularization to achieve sparsity. We prove the model identifiability and derive the asymptotic properties of the model parameters. A benefit of the model and the estimation approach is to pool information from multiple subjects to construct the network of biomarkers and provide inference. We demonstrate the empirical improvement over competing approaches with the simulated gene expression data from the third DREAM challenge. It is applied to the electroencephalogram (EEG) data and it reveals different effective connectivity of brain networks for patients with alcohol dependence under different cognitive tasks.
APA, Harvard, Vancouver, ISO, and other styles
18

Lee, Annie Jehe. "Statistical Methods for Genetic Studies with Family History of Diseases." Thesis, 2019. https://doi.org/10.7916/d8-bshz-wf55.

Full text
Abstract:
The theme of this dissertation is to develop statistical methods for genetic studies with family history of diseases. Family history of disease is a major risk factor for many health outcomes. To study diseases that aggregate in the families of patients, genetic epidemiological studies recruit independent study participants, often referred to as probands. Probands also provide information on their relatives through a family health history interview. However, due to the high cost of in-person collection of blood samples or death of a relative, dense genotypes are often collected only in probands but not in their family members. In these designs, estimating genetic risk of a disease or identifying genetic risk factors for a complex disease is challenging due to unavailable genotypes in relatives as well as correlation presented among family members' phenotypes. This dissertation contains three parts to tackle these barriers in family studies: (1) develop methods to estimate the genetic risk of a disease more precisely; (2) develop methods to test for association between genetic markers and correlated phenotypes; and (3) develop methods to control population substructure and familial relatedness in genome-wide association studies (GWAS). In the first part of the dissertation, we propose a method to estimate the age-specific disease risk of genetic mutation in family studies that permits the adjustment for multiple covariates and interaction effects in the presence of unobserved genotypes in relatives. Compared to our previous nonparametric approaches that do not control covariates, our semiparametric estimation method allows controlling for individual characteristics such as sex, ethnicity, environmental risk factors, and genotypes at other loci. Moreover, gene-gene interactions and gene-environment interactions can also be handled within the framework of a semiparametric model. The analyses may provide insights on whether demographics or environmental variables play a role in modifying the genetic risk of a disease. We examine the performance of the proposed methods by simulations and apply them to estimate the age-specific cumulative risk of Parkinson's disease (PD) in relatives predicted to carry the LRRK2 G2019S mutation. The utility of the estimated carrier risk is demonstrated through designing a future clinical trial under various assumptions. The second part of the dissertation is motivated by extending the single genetic variant set up in the first part to genome-wide genotype data, but focuses on the genetic association tests. Here, we propose a computationally efficient multilevel model to analyze the association of a genetic marker with correlated binary phenotypes in family studies. Our method accounts for both random polygenic effects as well as shared non-genetic familial effects while handling unavailable genotypes in relatives. To discover genetic variants of a complex disorder that aggregates in the families of patients, we consider the combined data of probands with genome-wide genotypes and family history of diseases in relatives (GWAS+FH). To allow for large-scale genetic testing in GWAS+FH, we handle the unobserved genotypes as well as estimate the random effects with reduced computational cost through fast and stable EM-type algorithm as well as score test. Through simulations, we demonstrate that our method of incorporating family history of disease improves efficiency as well as power of detecting disease-associated genetic variants over the methods of using probands data alone, which emphasizes the importance of family studies. Lastly, we apply these methods to discover genetic variants associated with the risk of Alzheimer's disease (AD) for GWAS+FH collected in Washington Heights-Inwood Columbia Aging Project (WHICAP) Caribbean Hispanics. We identified several genetic variants which would not have been discovered by GWAS using proband data alone. In the third part of the dissertation, we build on the previously introduced random effects to propose a method for genetic association tests in order to control confounding due to familial relatedness in GWAS. It is critical to correct for confounding due to familial relatedness in GWAS in order to minimize spurious associations as well as maximize power to detect true association signals. With available pedigree data, our method uses the polygenic effects as well as the shared non-genetic familial effects in order to control confounding due to familial relatedness in GWAS. Through application to the WHICAP Caribbean Hispanic probands, we show that our method of using the polygenic effects as well as the shared familial effects achieves similar or better performance of controlling the familial relatedness compared to using principal components in GWAS. Notably, our method allows for controlling the confounding due to using family history data, but without requiring dense genotypes in the relatives. We conclude this dissertation by discussing future extensions of this work.
APA, Harvard, Vancouver, ISO, and other styles
19

Liu, Ying. "Statistical Learning Methods for Personalized Medical Decision Making." Thesis, 2016. https://doi.org/10.7916/D8HH6K22.

Full text
Abstract:
The theme of my dissertation is on merging statistical modeling with medical domain knowledge and machine learning algorithms to assist in making personalized medical decisions. In its simplest form, making personalized medical decisions for treatment choices and disease diagnosis modality choices can be transformed into classification or prediction problems in machine learning, where the optimal decision for an individual is a decision rule that yields the best future clinical outcome or maximizes diagnosis accuracy. However, challenges emerge when analyzing complex medical data. On one hand, statistical modeling is needed to deal with inherent practical complications such as missing data, patients' loss to follow-up, ethical and resource constraints in randomized controlled clinical trials. On the other hand, new data types and larger scale of data call for innovations combining statistical modeling, domain knowledge and information technologies. This dissertation contains three parts addressing the estimation of optimal personalized rule for choosing treatment, the estimation of optimal individualized rule for choosing disease diagnosis modality, and methods for variable selection if there are missing data. In the first part of this dissertation, we propose a method to find optimal Dynamic treatment regimens (DTRs) in Sequential Multiple Assignment Randomized Trial (SMART) data. Dynamic treatment regimens (DTRs) are sequential decision rules tailored at each stage of treatment by potentially time-varying patient features and intermediate outcomes observed in previous stages. The complexity, patient heterogeneity, and chronicity of many diseases and disorders call for learning optimal DTRs that best dynamically tailor treatment to each individual's response over time. We propose a robust and efficient approach referred to as Augmented Multistage Outcome-Weighted Learning (AMOL) to identify optimal DTRs from sequential multiple assignment randomized trials. We improve outcome-weighted learning (Zhao et al.~2012) to allow for negative outcomes; we propose methods to reduce variability of weights to achieve numeric stability and higher efficiency; and finally, for multiple-stage trials, we introduce robust augmentation to improve efficiency by drawing information from Q-function regression models at each stage. The proposed AMOL remains valid even if the regression model is misspecified. We formally justify that proper choice of augmentation guarantees smaller stochastic errors in value function estimation for AMOL; we then establish the convergence rates for AMOL. The comparative advantage of AMOL over existing methods is demonstrated in extensive simulation studies and applications to two SMART data sets: a two-stage trial for attention deficit hyperactivity disorder and the STAR*D trial for major depressive disorder. The second part of the dissertation introduced a machine learning algorithm to estimate personalized decision rules for medical diagnosis/screening to maximize a weighted combination of sensitivity and specificity. Using subject-specific risk factors and feature variables, such rules administer screening tests with balanced sensitivity and specificity, and thus protect low-risk subjects from unnecessary pain and stress caused by false positive tests, while achieving high sensitivity for subjects at high risk. We conducted simulation study mimicking a real breast cancer study, and we found significant improvements on sensitivity and specificity comparing our personalized screening strategy (assigning mammography+MRI to high-risk patients and mammography alone to low-risk subjects based on a composite score of their risk factors) to one-size-fits-all strategy (assigning mammography+MRI or mammography alone to all subjects). When applying to a Parkinson's disease (PD) FDG-PET and fMRI data, we showed that the method provided individualized modality selection that can improve AUC, and it can provide interpretable decision rules for choosing brain imaging modality for early detection of PD. To the best of our knowledge, this is the first time in the literature to propose automatic data-driven methods and learning algorithm for personalized diagnosis/screening strategy. In the last part of the dissertation, we propose a method, Multiple Imputation Random Lasso (MIRL), to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. % in the presence of missing data. In this study, 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after list-wise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (SES) Asian boys who are at high risk of developing obesity.
APA, Harvard, Vancouver, ISO, and other styles
20

Zabor, Emily Craig. "Statistical methods for the study of etiologic heterogeneity." Thesis, 2019. https://doi.org/10.7916/d8-22xy-kf52.

Full text
Abstract:
Traditionally, cancer epidemiologists have investigated the causes of disease under the premise that patients with a certain site of disease can be treated as a single entity. Then risk factors associated with the disease are identified through case-control or cohort studies for the disease as a whole. However, with the rise of molecular and genomic profiling, in recent years biologic subtypes have increasingly been identified. Once subtypes are known, it is natural to ask the question of whether they share a common etiology, or in fact arise from distinct sets of risk factors, a concept known as etiologic heterogeneity. This dissertation seeks to evaluate methods for the study of etiologic heterogeneity in the context of cancer research and with a focus on methods for case-control studies. First, a number of existing regression-based methods for the study of etiologic heterogeneity in the context of pre-defined subtypes are compared using a data example and simulation studies. This work found that a standard polytomous logistic regression approach performs at least as well as more complex methods, and is easy to implement in standard software. Next, simulation studies investigate the statistical properties of an approach that combines the search for the most etiologically distinct subtype solution from high dimensional tumor marker data with estimation of risk factor effects. The method performs well when appropriate up-front selection of tumor markers is performed, even when there is confounding structure or high-dimensional noise. And finally, an application to a breast cancer case-control study demonstrates the usefulness of the novel clustering approach to identify a more risk heterogeneous class solution in breast cancer based on a panel of gene expression data and known risk factors.
APA, Harvard, Vancouver, ISO, and other styles
21

Chen, Tianle. "Statistical modeling and statistical learning for disease prediction and classification." Thesis, 2014. https://doi.org/10.7916/D8222RX9.

Full text
Abstract:
This dissertation studies prediction and classification models for disease risk through semiparametric modeling and statistical learning. It consists of three parts. In the first part, we propose several survival models to analyze the Cooperative Huntington's Observational Research Trial (COHORT) study data accounting for the missing mutation status in relative participants (Kieburtz and Huntington Study Group, 1996a). Huntington's disease (HD) is a progressive neurodegenerative disorder caused by an expansion of cytosine-adenine-guanine (CAG) repeats at the IT15 gene. A CAG repeat number greater than or equal to 36 is defined as carrying the mutation and carriers will eventually show symptoms if not censored by other events. There is an inverse relationship between the age-at-onset of HD and the CAG repeat length; the greater the CAG expansion, the earlier the age-at-onset. Accurate estimation of age-at-onset based on CAG repeat length is important for genetic counseling and the design of clinical trials for HD. Participants in COHORT (denoted as probands) undergo a genetic test and their CAG repeat number is determined. Family members of the probands do not undergo the genetic test and their HD onset information is provided by probands. Several methods are proposed in the literature to model the age specific cumulative distribution function (CDF) of HD onset as a function of the CAG repeat length. However, none of the existing methods can be directly used to analyze COHORT proband and family data because family members' mutation status is not always known. In this work, we treat the presence or absence of an expanded CAG repeat in first-degree family members as missing data and use the expectation-maximization (EM) algorithm to carry out the maximum likelihood estimation of the COHORT proband and family data jointly. We perform simulation studies to examine finite sample performance of the proposed methods and apply these methods to estimate the CDF of HD age-at-onset from the COHORT proband and family combined data. Our results show a slightly lower estimated cumulative risk of HD with the combined data compared to using proband data alone. We then extend the approach to predict the cumulative risk of disease accommodating predictors with time-varying effects and outcomes subject to censoring. We model the time-specific effect through a nonparametric varying-coefficient function and handle censoring through self-consistency equations that redistribute the probability mass of censored outcomes to the right. The computational procedure is extremely convenient and can be implemented by standard software. We prove large sample properties of the proposed estimator and evaluate its finite sample performance through simulation studies. We apply the method to estimate the cumulative risk of developing HD from the mutation carriers in COHORT data and illustrate an inverse relationship between the cumulative risk of HD and the length of CAG repeats at the IT15 gene. In the second part of the dissertation, we develop methods to accurately predict whether pre-symptomatic individuals are at risk of a disease based on their various marker profiles, which offers an opportunity for early intervention well before definitive clinical diagnosis. For many diseases, existing clinical literature may suggest the risk of disease varies with some markers of biological and etiological importance, for example age. To identify effective prediction rules using nonparametric decision functions, standard statistical learning approaches treat markers with clear biological importance (e.g., age) and other markers without prior knowledge on disease etiology interchangeably as input variables. Therefore, these approaches may be inadequate in singling out and preserving the effects from the biologically important variables, especially in the presence of potential noise markers. Using age as an example of a salient marker to receive special care in the analysis, we propose a local smoothing large margin classifier implemented with support vector machine to construct effective age-dependent classification rules. The method adaptively adjusts age effect and separately tunes age and other markers to achieve optimal performance. We derive the asymptotic risk bound of the local smoothing support vector machine, and perform extensive simulation studies to compare with standard approaches. We apply the proposed method to two studies of premanifest HD subjects and controls to construct age-sensitive predictive scores for the risk of HD and risk of receiving HD diagnosis during the study period. In the third part of the dissertation, we develop a novel statistical learning method for longitudinal data. Predicting disease risk and progression is one of the main goals in many clinical studies. Cohort studies on the natural history and etiology of chronic diseases span years and data are collected at multiple visits. Although kernel-based statistical learning methods are proven to be powerful for a wide range of disease prediction problems, these methods are only well studied for independent data but not for longitudinal data. It is thus important to develop time-sensitive prediction rules that make use of the longitudinal nature of the data. We develop a statistical learning method for longitudinal data by introducing subject-specific long-term and short-term latent effects through designed kernels to account for within-subject correlation of longitudinal measurements. Since the presence of multiple sources of data is increasingly common, we embed our method in a multiple kernel learning framework and propose a regularized multiple kernel statistical learning with random effects to construct effective nonparametric prediction rules. Our method allows easy integration of various heterogeneous data sources and takes advantage of correlation among longitudinal measures to increase prediction power. We use different kernels for each data source taking advantage of distinctive feature of data modality, and then optimally combine data across modalities. We apply the developed methods to two large epidemiological studies, one on Huntington's disease and the other on Alzhemeier's Disease (Alzhemeier's Disease Neuroimaging Initiative, ADNI) where we explore a unique opportunity to combine imaging and genetic data to predict the conversion from mild cognitive impairment to dementia, and show a substantial gain in performance while accounting for the longitudinal feature of data.
APA, Harvard, Vancouver, ISO, and other styles
22

Chen, Yuan. "Statistical and Machine Learning Methods for Precision Medicine." Thesis, 2021. https://doi.org/10.7916/d8-j4tw-wa07.

Full text
Abstract:
Heterogeneous treatment responses are commonly observed in patients with mental disorders. Thus, a universal treatment strategy may not be adequate, and tailored treatments adapted to individual characteristics could improve treatment responses. The theme of the dissertation is to develop statistical and machine learning methods to address patients heterogeneity and derive robust and generalizable individualized treatment strategies by integrating evidence from multi-domain data and multiple studies to achieve precision medicine. Unique challenges arising from the research of mental disorders need to be addressed in order to facilitate personalized medical decision-making in clinical practice. This dissertation contains four projects to achieve these goals while addressing the challenges: (i) a statistical method to learn dynamic treatment regimes (DTRs) by synthesizing independent trials over different stages when sequential randomization data is not available; (ii) a statistical method to learn optimal individualized treatment rules (ITRs) for mental disorders by modeling patients' latent mental states using probabilistic generative models; (iii) an integrative learning algorithm to incorporate multi-domain and multi-treatment-phase measures for optimizing individualized treatments; (iv) a statistical machine learning method to optimize ITRs that can benefit subjects in a target population for mental disorders with improved learning efficiency and generalizability. DTRs adaptively prescribe treatments based on patients' intermediate responses and evolving health status over multiple treatment stages. Data from sequential multiple assignment randomization trials (SMARTs) are recommended to be used for learning DTRs. However, due to the re-randomization of the same patients over multiple treatment stages and a prolonged follow-up period, SMARTs are often difficult to implement and costly to manage, and patient adherence is always a concern in practice. To lessen such practical challenges, in the first part of the dissertation, we propose an alternative approach to learn optimal DTRs by synthesizing independent trials over different stages without using data from SMARTs. Specifically, at each stage, data from a single randomized trial along with patients' natural medical history and health status in previous stages are used. We use a backward learning method to estimate optimal treatment decisions at a particular stage, where patients' future optimal outcome increment is estimated using data observed from independent trials with future stages' information. Under some conditions, we show that the proposed method yields consistent estimation of the optimal DTRs, and we obtain the same learning rates as those from SMARTs. We conduct simulation studies to demonstrate the advantage of the proposed method. Finally, we learn DTRs for treating major depressive disorder (MDD) by stage-wise synthesis of two randomized trials. We perform a validation study on independent subjects and show that the synthesized DTRs lead to the greatest MDD symptom reduction compared to alternative methods. The second part of the dissertation focuses on optimizing individualized treatments for mental disorders. Due to disease complexity, substantial diversity in patients' symptomatology within the same diagnostic category is widely observed. Leveraging the measurement model theory in psychiatry and psychology, we learn patient's intrinsic latent mental status from psychological or clinical symptoms under a probabilistic generative model, restricted Boltzmann machine (RBM), through which patients' heterogeneous symptoms are represented using an economic number of latent variables and yet remains flexible. These latent mental states serve as a better characterization of the underlying disorder status than a simple summary score of the symptoms. They also serve as more reliable and representative features to differentiate treatment responses. We then optimize a value function defined by the latent states after treatment by exploiting a transformation of the observed symptoms based on the RBM without modeling the relationship between the latent mental states before and after treatment. The optimal treatment rules are derived using a weighted large margin classifier. We derive the convergence rate of the proposed estimator under the latent models. Simulation studies are conducted to test the performance of the proposed method. Finally, we apply the developed method to real-world studies. We demonstrate the utility and advantage of our method in tailoring treatments for patients with major depression and identify patient subgroups informative for treatment recommendations. In the third part of the dissertation, based on the general framework introduced in the previous part, we propose an integrated learning algorithm that can simultaneously learn patients' underlying mental states and recommend optimal treatments for each individual with improved learning efficiency. It allows incorporation of both the pre- and post-treatment outcomes in learning the invariant latent structure and allows integration of outcome measures from different domains to characterize patients' mental health more comprehensively. A multi-layer neural network is used to allow complex treatment effect heterogeneity. Optimal treatment policy can be inferred for future patients by comparing their potential mental states under different treatments given the observed multi-domain pre-treatment measurements. Experiments on simulated data and real-world clinical trial data show that the learned treatment polices compare favorably to alternative methods on heterogeneous treatment effects and have broad utilities which lead to better patient outcomes on multiple domains. The fourth part of the dissertation aims to infer optimal treatments of mental disorders for a target population considering the potential distribution disparities between the patient data in a study we collect and the target population of interest. To achieve that, we propose a learning approach that connects measurement theory, efficient weighting procedure, and flexible neural network architecture through latent variables. In our method, patients' underlying mental states are represented by a reduced number of latent state variables allowing for incorporating domain knowledge, and the invariant latent structure is preserved for interpretability and validity. Subject-specific weights to balance population differences are constructed using these compact latent variables, which capture the major variations and facilitate the weighting procedure due to the reduced dimensionality. Data from multiple studies can be integrated to learn the latent structure to improve learning efficiency and generalizability. Extensive simulation studies demonstrate consistent superiority of the proposed method and the weighting scheme to alternative methods when applying to the target population. Application of our method to real-world studies is conducted to recommend treatments to patients with major depressive disorder and has shown a broader utility of the ITRs learned from the proposed method in improving the mental states of patients in the target population.
APA, Harvard, Vancouver, ISO, and other styles
23

Gabaitiri, Lesego. "Likelihood based statistical methods for estimating HIV incidence rate." Thesis, 2013. http://hdl.handle.net/10413/9212.

Full text
Abstract:
Estimation of current levels of human immunodeficiency virus (HIV) incidence is essential for monitoring the impact of an epidemic, determining public health priorities, assessing the impact of interventions and for planning purposes. However, there is often insufficient data on incidence as compared to prevalence. A direct approach is to estimate incidence from longitudinal cohort studies. Although this approach can provide direct and unbiased measure of incidence for settings where the study is conducted, it is often too expensive and time consuming. An alternative approach is to estimate incidence from cross sectional survey using biomarkers that distinguish between recent and non-recent/longstanding infections. The original biomarker based approach proposes the detection of HIV-1 p24 antigen in the pre-seroconversion period to identify persons with acute infection for estimating HIV incidence. However, this approach requires large sample sizes in order to obtain reliable estimates of HIV incidence because the duration of antigenemia before antibody detection is short, about 22.5 days. Subsequently, another method that involves dual antibody testing system was developed. In stage one, a sensitive test is used to diagnose HIV infection and a less sensitive test such is used in the second stage to distinguish between long standing infections and recent infections among those who tested positive for HIV in stage one. The question is: how do we combine this data with other relevant information, such as the period an individual takes from being undetectable by a less sensitive test to being detectable, to estimate incidence? The main objective of this thesis is therefore to develop likelihood based method that can be used to estimate HIV incidence when data is derived from cross sectional surveys and the disease classification is achieved by combining two biomarker or assay tests. The thesis builds on the dual antibody testing approach and extends the statistical framework that uses the multinomial distribution to derive the maximum likelihood estimators of HIV incidence for different settings. In order to improve incidence estimation, we develop a model for estimating HIV incidence that incorporate information on the previous or past prevalence and derive maximum likelihood estimators of incidence assuming incidence density is constant over a specified period. Later, we extend the method to settings where a proportion of subjects remain non-reactive to a less sensitive test long after seroconversion. Diagnostic tests used to determine recent infections are prone to errors. To address this problem, we considered a method that simultaneously makes adjustment for sensitivity and specificity. In addition, we also showed that sensitivity is similar to the proportion of subjects who eventually transit the “recent infection” state. We also relax the assumption of constant incidence density by proposing linear incidence density to accommodate settings where incidence might be declining or increasing. We extend the standard adjusted model for estimating incidence to settings where some subjects who tested positive for HIV antibodies were not tested by a less sensitive test resulting in missing outcome data. Models for the risk factors (covariates) of HIV incidence are considered in the last but one chapter. We used data from Botswana AIDS Impact (BAIS) III of 2008 to illustrate the proposed methods. The general conclusion and recommendations for future work are provided in the final chapter.
Theses (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2013.
APA, Harvard, Vancouver, ISO, and other styles
24

Drill, Esther. "Statistical Methods for Integrated Cancer Genomic Data Using a Joint Latent Variable Model." Thesis, 2018. https://doi.org/10.7916/D85M7P7V.

Full text
Abstract:
Inspired by the TCGA (The Cancer Genome Atlas), we explore multimodal genomic datasets with integrative methods using a joint latent variable approach. We use iCluster+, an existing clustering method for integrative data, to identify potential subtypes within TCGA sarcoma and mesothelioma tumors, and across a large cohort of 33 dierent TCGA cancer datasets. For classication, motivated to improve the prediction of platinum resistance in high grade serous ovarian cancer (HGSOC) treatment, we propose novel integrative methods, iClassify to perform classication using a joint latent variable model. iClassify provides eective data integration and classication while handling heterogeneous data types, while providing a natural framework to incorporate covariate risk factors and examine genomic driver by covariate risk factor interaction. Feature selection is performed through a thresholding parameter that combines both latent variable and feature coecients. We demonstrate increased accuracy in classication over methods that assume homogeneous data type, such as linear discriminant analysis and penalized logistic regression, and improved feature selection. We apply iClassify to a TCGA cohort of HGSOC patients with three types of genomic data and platinum response data. This methodology has broad applications beyond predicting treatment outcomes and disease progression in cancer, including predicting prognosis and diagnosis in other diseases with major public health implications.
APA, Harvard, Vancouver, ISO, and other styles
25

Wang, Qinxia. "Statistical Methods for Modeling Progression and Learning Mechanisms of Neuropsychiatric Disorders." Thesis, 2021. https://doi.org/10.7916/d8-cngh-ty69.

Full text
Abstract:
The theme of this dissertation focuses on developing statistical models to learn progression dynamics and mechanisms of neuropsychiatric disorders using data from various domains. Due to limited knowledge about the underlying pathological processes in neurological disorders, it remains a challenge to establish reliable diagnostic criteria and predict disease prognosis in the presence of substantial phenotypic heterogeneity. As a result, current diagnosis and treatment of neurological disorders often rely on late-stage clinical symptoms, which poses barriers for developing effective interventions at the premanifest stage. It is crucial to characterize the temporal disease progression course and study the underlying mechanisms using clinical assessments, blood biomarkers, and neuroimaging biomarkers to evaluate disease stages, identify markers that are useful for early clinical diagnosis, compare or monitor treatment effects and accelerate drug discovery. We propose three projects to tackle challenges in leveraging multi-domain biomarkers and clinical symptoms to learn disease dynamics and progression of neurological disorders: (1) A nonlinear mixture model with subject-specific random inflection points to jointly fit multiple longitudinal markers and estimate marker progression trajectories in a single modality; (2) A multi-layer exponential family factor model integrating multi-domain data to learn lower-dimensional latent space of disease impairment and fully map disease risk and progression; (3) A latent state space model that jointly analyzes multi-channel EEG signals and learns dynamics of different sources corresponding to brain cortical activities. In addition, motivated by the ongoing COVID-19 pandemic, we propose a parsimonious survival-convolution model to predict daily new cases and estimate the time-varying reproduction numbers to evaluate effects of mitigation strategies. In the first project, we propose a nonlinear mixture model with random time shifts to jointly estimate long-term progression trajectories using multivariate discrete longitudinal outcomes. The model can identify early disease markers, their orders of occurrence, and the rates of impairment. Specifically, a latent binary variable representing disease susceptibility status incorporates subject covariates (e.g., biological measures) in the mixture model to capture between-subject heterogeneity. Measures of disease impairment for susceptible patients are modeled jointly under the exponential family framework. Our model allows for subject-specific and marker-specific inflection points associated with patients' characteristics (e.g., genetic mutation) to indicate a critical time when the fastest degeneration occurs. Furthermore, it uses subject-specific latent scores shared among markers to improve efficiency. The model is estimated using an EM algorithm. Extensive simulation studies are conducted to demonstrate validity of the proposed method and algorithm. Lastly, we apply our method to the Parkinson's Progression Markers Initiative (PPMI), and show utility to identify early disease signs and compare clinical symptomatology for the genetic form of Parkinson's Disease (PD) and idiopathic PD. In the second project, we tackle challenges to leverage multi-domain markers to learn early disease progression of neurological disorders. We propose to integrate heterogeneous types of measures from multiple domains (e.g., discrete clinical symptoms, ordinal cognitive markers, continuous neuroimaging and blood biomarkers) using a hierarchical Multi-layer Exponential Family Factor (MEFF) model, where the observations follow exponential family distributions with lower-dimensional latent factors. The latent factors are decomposed into shared factors across multiple domains and domain-specific factors, where the shared factors provide robust information to perform behavioral phenotyping and partition patients into clinically meaningful and biologically homogeneous subgroups. Domain-specific factors capture the remaining unique variations for each domain. The MEFF model also captures the nonlinear trajectory of disease progression and order critical events of neurodegeneration measured by each marker. To overcome computational challenges, we fit our model by approximate inference techniques for large-scale data. We apply the developed method to Parkinson's Progression Markers Initiative (PPMI) data to integrate biological, clinical and cognitive markers arising from heterogeneous distributions. The model learns lower-dimensional representations of Parkinson's disease and the temporal ordering of the neurodegeneration of PD. In the third project, we propose methods that can be used to analyze multi-channel electroencephalogram (EEG) signals intensively measured at a high temporal resolution. Modern neuroimaging technologies have substantially advanced the measurement of brain activities. EEG as a non-invasive neuroimaging technique measures changes in electrical voltage on the scalp induced by cortical activities. With its high temporal resolution, EEG has emerged as an increasingly useful tool to study brain connectivity. Challenges with modeling EEG signals of complex brain activities include interactions among unknown sources, low signal-to-noise ratio and substantial between-subject heterogeneity. In this work, we propose a state space model that jointly analyzes multi-channel EEG signals and learns dynamics of different sources corresponding to brain cortical activities. Our model borrows strength from spatially correlated measurements and uses low-dimensional latent sources to explain all observed channels. The model can account for patient heterogeneity and quantify the effect of a subject's covariates on the latent space. The EM algorithm, Kalman filtering, and bootstrap resampling are used to fit the state space model and provide comparisons between patient diagnostic groups. We apply the developed approach to a case-control study of alcoholism and reveal significant attenuation of brain activities in response to visual stimuli in alcoholic subjects compared to healthy controls. Lastly, motivated by the ongoing COVID-19 pandemic, we propose a robust and parsimonious survival-convolution model aiming to predict COVID-19 disease course and compare effectiveness of mitigation measures across countries to inform policy decision making. We account for transmission during a pre-symptomatic incubation period and use a time-varying effective reproduction number to reflect the temporal trend of transmission and change in response to a public health intervention. We estimate the intervention effect on reducing the infection rate using a natural experiment design and quantify uncertainty by permutation. In China and South Korea, we predicted the entire disease epidemic using only early phase data (two to three weeks after the outbreak). A fast rate of decline in reproduction number was observed and adopting mitigation strategies early in the epidemic was effective in reducing the infection rate in these two countries. The nationwide lockdown in Italy did not accelerate the speed at which the infection rate decreases. In the United States, the reproduction number significantly decreased during a 2-week period after the declaration of national emergency, but declines at a much slower rate afterwards. If the trend continues after May 1, COVID-19 may be controlled by late July. However, a loss of temporal effect (e.g., due to relaxing mitigation measures after May 1) could lead to a long delay in controlling the epidemic.
APA, Harvard, Vancouver, ISO, and other styles
26

"Clustering Algorithm for Zero-Inflated Data." Thesis, 2020. https://doi.org/10.7916/d8-kze1-fr94.

Full text
Abstract:
Zero-inflated data are common in biomedical research. In cluster analysis, the heuristic approach fails to provide inferential properties to the outcome while the existing model-based approach only works in the case of a mixture of multivariate normal. In this dissertation, I developed two new model-based clustering algorithms- the multivariate zero-inflated log-normal and the multivariate zero-inflated Poisson clustering algorithms. I then applied these methods to the questionnaire data and compare the resulting clusters to the ones derived from assuming multivariate normal distribution. Associations between clustering results and clinical outcomes were also investigated.
APA, Harvard, Vancouver, ISO, and other styles
27

Gibson, Elizabeth Atkeson. "Statistical and Machine Learning Methods for Pattern Identification in Environmental Mixtures." Thesis, 2021. https://doi.org/10.7916/d8-tnfc-et36.

Full text
Abstract:
Background: Statistical and machine learning techniques are now being incorporated into high-dimensional mixture research to overcome issues with traditional methods. Though some methods perform well on specific tasks, no method consistently outperforms all others in complex mixture analyses, largely because different methods were developed to answer different research questions. The research presented here concentrates on answering a single mixtures question: Are there exposure patterns within a mixture corresponding with sources or behaviors that give rise to exposure? Objective: This dissertation details work to design, adapt, and apply pattern recognition methods to environmental mixtures and introduces two methods adapted to specific challenges of environmental health data, (1) Principal Component Pursuit (PCP) and (2) Bayesian non-parametric non-negative matrix factorization (BN²MF). We build on this work to characterize the relationship between identified patterns of in utero endocrine disrupting chemical (EDC) exposure and child neurodevelopment. Methods: PCP---a dimensionality reduction technique in computer vision---decomposes the exposure mixture into a low-rank matrix of consistent patterns and a sparse matrix of unique or extreme exposure events. We incorporated two existing PCP extensions that suit environmental data, (1) a non-convex rank penalty, and (2) a formulation that removes the need for parameter tuning. We further adapted PCP to accommodate environmental mixtures by including (1) a non-negativity constraint, (2) a modified algorithm to allow for missing values, and (3) a separate penalty for measurements below the limit of detection (PCP-LOD). BN²MF decomposes the exposure mixture into three parts, (1) a matrix of chemical loadings on identified patterns, (2) a matrix of individual scores on identified patterns, and (3) and diagonal matrix of pattern weights. It places non-negative continuous priors on pattern loadings, weights, and individual scores and uses a non-parametric sparse prior on the pattern weights to estimate the optimal number. We extended BN²MF to explicitly account for uncertainty in identified patterns by estimating the full distribution of scores and loadings. To test both methods, we simulated data to represent environmental mixtures with various structures, altering the level of complexity in the patterns, the noise level, the number of patterns, the size of the mixture, and the sample size. We evaluated PCP-LOD's performance against principal component analysis (PCA), and we evaluated BN²MF's performance against PCA, factor analysis, and frequentist nonnegative matrix factorization (NMF). For all methods, we compared their solutions with true simulated values to measure performance. We further assessed BN²MF's coverage of true simulated scores. We applied PCP-LOD to an exposure mixture of 21 persistent organic pollutants (POPs) measured in 1,000 U.S. adults from the 2001--2002 National Health and Nutrition Examination Survey (NHANES). We applied BN²MF to an exposure mixture of 17 EDCs measured in 343 pregnant women in the Columbia Center for Children’s Environmental Health's Mothers and Newborns Cohort. Finally, we designed a two-stage Bayesian hierarchical model to estimate health effects of environmental exposure patterns while incorporating the uncertainty of pattern identification. In the first stage, we identified EDC exposure patterns using BN²MF. In the second stage, we included individual pattern scores and their distributions as exposures of interest in a hierarchical regression model, with child IQ as the outcome, adjusting for potential confounders. We present sex-specific results. Results: PCP-LOD recovered the true number of patterns through cross-validation for all simulations; based on an a priori specified criterion, PCA recovered the true number of patterns in 32% of simulations. PCP-LOD achieved lower relative predictive error than PCA for all simulated datasets with up to 50% of the data < LOD. When 75% of values were < LOD, PCP-LOD outperformed PCA only when noise was low. In the POP mixture, PCP-LOD identified a rank three underlying structure. One pattern represented comprehensive exposure to all POPs. The other two patterns grouped chemicals based on known properties such as structure and toxicity. PCP-LOD also separated 6% of values as extreme events. Most participants had no extreme exposures (44%) or only extremely low exposures (18%). BN²MF estimated the true number of patterns for 99% of simulated datasets. BN²MF's variational confidence intervals achieved 95% coverage across all levels of structural complexity with up to 40% added noise. BN²MF performed comparably with frequentist methods in terms of overall prediction and estimation of underlying loadings and scores. We identified two patterns of EDC exposure in pregnant women, corresponding with diet and personal care product use as potentially separate sources or behaviors leading to exposure. The diet pattern expressed exposure to phthalates and BPA. One standard deviation increase in this pattern was associated with a decrease of 3.5 IQ points (95% credible interval: -6.7, -0.3), on average, in female children but not in males. The personal care product pattern represented exposure to phenols, including parabens, and diethyl phthalate. We found no associations between this pattern and child cognition. Conclusion: PCP-LOD and BN^2MF address limitations of existing pattern recognition methods employed in this field such as user-specified pattern number, lack of interpretability of patterns in terms of human understanding, influence of outlying values, and lack of uncertainty quantification. Both methods identified patterns that grouped chemicals based on known sources (e.g., diet), behaviors (e.g., personal care product use), or properties (e.g., structure and toxicity). Phthalates and BPA found in food packaging and can linings formed a BN²MF-identified pattern of EDC exposure negatively associated with female child intelligence in the Mothers and Newborns cohort. Results may be used to inform interventions designed to target modifiable behavior or regulations to act on dietary exposure sources.
APA, Harvard, Vancouver, ISO, and other styles
28

Pook, Torsten. "Methods and software to enhance statistical analysis in large scale problems in breeding and quantitative genetics." Doctoral thesis, 2019. http://hdl.handle.net/21.11130/00-1735-0000-0005-129C-7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Han, Baoguang. "Statistical analysis of clinical trial data using Monte Carlo methods." Thesis, 2014. http://hdl.handle.net/1805/4650.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
In medical research, data analysis often requires complex statistical methods where no closed-form solutions are available. Under such circumstances, Monte Carlo (MC) methods have found many applications. In this dissertation, we proposed several novel statistical models where MC methods are utilized. For the first part, we focused on semicompeting risks data in which a non-terminal event was subject to dependent censoring by a terminal event. Based on an illness-death multistate survival model, we proposed flexible random effects models. Further, we extended our model to the setting of joint modeling where both semicompeting risks data and repeated marker data are simultaneously analyzed. Since the proposed methods involve high-dimensional integrations, Bayesian Monte Carlo Markov Chain (MCMC) methods were utilized for estimation. The use of Bayesian methods also facilitates the prediction of individual patient outcomes. The proposed methods were demonstrated in both simulation and case studies. For the second part, we focused on re-randomization test, which is a nonparametric method that makes inferences solely based on the randomization procedure used in clinical trials. With this type of inference, Monte Carlo method is often used for generating null distributions on the treatment difference. However, an issue was recently discovered when subjects in a clinical trial were randomized with unbalanced treatment allocation to two treatments according to the minimization algorithm, a randomization procedure frequently used in practice. The null distribution of the re-randomization test statistics was found not to be centered at zero, which comprised power of the test. In this dissertation, we investigated the property of the re-randomization test and proposed a weighted re-randomization method to overcome this issue. The proposed method was demonstrated through extensive simulation studies.
APA, Harvard, Vancouver, ISO, and other styles
30

Yu, Gary. "Identifying Patterns in Behavioral Public Health Data Using Mixture Modeling with an Informative Number of Repeated Measures." Thesis, 2014. https://doi.org/10.7916/D8F197VX.

Full text
Abstract:
Finite mixture modeling is a useful statistical technique for clustering individuals based on patterns of responses. The fundamental idea of the mixture modeling approach is to assume there are latent clusters of individuals in the population which each generate their own distinct distribution of observations (multivariate or univariate) which are then mixed up together in the full population. Hence, the name mixture comes from the fact that what we observe is a mixture of distributions. The goal of this model-based clustering technique is to identify what the mixture of distributions is so that, given a particular response pattern, individuals can be clustered accordingly. Commonly, finite mixture models, as well as the special case of latent class analysis, are used on data that inherently involve repeated measures. The purpose of this dissertation is to extend the finite mixture model to allow for the number of repeated measures to be incorporated and contribute to the clustering of individuals rather than measures. The dimension of the repeated measures or simply the count of responses is assumed to follow a truncated Poisson distribution and this information can be incorporated into what we call a dimension informative finite mixture model (DIMM). The outline of this dissertation is as follows. Paper 1 is entitled, "Dimension Informative Mixture Modeling (DIMM) for questionnaire data with an informative number of repeated measures." This paper describes the type of data structures considered and introduces the dimension informative mixture model (DIMM). A simulation study is performed to examine how well the DIMM fits the known specified truth. In the first scenario, we specify a mixture of three univariate normal distributions with different means and similar variances with different and similar counts of repeated measurements. We found that the DIMM predicts the true underlying class membership better than the traditional finite mixture model using a predicted value metric score. In the second scenario, we specify a mixture of two univariate normal distributions with the same means and variances with different and similar counts of repeated measurements. We found that that the count-informative finite mixture model predicts the truth much better than the non-informative finite mixture model. Paper 2 is entitled, "Patterns of Physical Activity in the Northern Manhattan Study (NOMAS) Using Multivariate Finite Mixture Modeling (MFMM)." This is a study that applies a multivariate finite mixture modeling approach to examining and elucidating underlying latent clusters of different physical activity profiles based on four dimensions: total frequency of activities, average duration per activity, total energy expenditure and the total count of the number of different activities conducted. We found a five cluster solution to describe the complex patterns of physical activity levels, as measured by fifteen different physical activity items, among a US based elderly cohort. Adding in a class of individuals who were not doing any physical activity, the labels of these six clusters are: no exercise, very inactive, somewhat inactive, slightly under guidelines, meet guidelines and above guidelines. This methodology improves upon previous work which utilized only the total metabolic equivalent (a proxy of energy expenditure) to classify individuals into inactive, active and highly active. Paper 3 is entitled, "Complex Drug Use Patterns and Associated HIV Transmission Risk Behaviors in an Internet Sample of US Men Who Have Sex With Men." This is a study that applies the count-informative information into a latent class analysis on nineteen binary drug items of drugs consumed within the past year before a sexual encounter. In addition to the individual drugs used, the mixture model incorporated a count of the total number of drugs used. We found a six class solution: low drug use, some recreational drug use, nitrite inhalants (poppers) with prescription erectile dysfunction (ED) drug use, poppers with prescription/non-prescription ED drug use and high polydrug use. Compared to participants in the low drug use class, participants in the highest drug use class were 5.5 times more likely to report unprotected anal intercourse (UAI) in their last sexual encounter and approximately 4 times more likely to report a new sexually transmitted infection (STI) in the past year. Younger men were also less likely to report UAI than older men but more likely to report an STI.
APA, Harvard, Vancouver, ISO, and other styles
31

Santika, Truly. "The assessment of factors affecting species distribution model inference and prediction using simulated data." Phd thesis, 2010. http://hdl.handle.net/1885/150462.

Full text
Abstract:
In past decades, a variety of statistical techniques have been used and developed to predict species occurrences over broad geographical areas. These models generally employ correlations between point-location data on species occurrence and environmental predictors from GIS and other mapped data. These models have wide management applications in the context of conservation biology, biogeography and climate change studies. Despite substantial progress, there are number of critical factors that have a significant impact on the performance of species distribution models, leading to uncertainties in species distribution modelling. These factors originate either from the nature of species and habitat data used to derive the distribution models, or from the modelling methodology. Numerous empirical studies using real species field data have been conducted by various authors to assess how such factors affect species distribution model performance. Comparing models with real data, however, can be problematic due to the lack of knowledge of the process controlling the true distributions of the species. Furthermore, empirical studies have yielded various, sometimes contradictory, recommendations regarding the model used and ways to minimize the impact of certain factors on model performance. From a wildlife management perspective, such conflicting recommendations are not helpful. In contrast, generating simulated data on species distributions has the advantage of providing perfect control over the causal factors of interest. Simulated data provide a way to assess the underlying response of model performance with respect to underlying assumptions, and can guide the inferences obtained from empirical studies in a systematic manner. This thesis constructs a systematic simulation data framework in order to provide an understanding of how various data and methodological factors can affect species distribution model prediction and inference. The data issues examined include the form of species occurrence and environmental dependence, prevalence (i.e. the proportion of observed sites where the species is present), and spatial autocorrelation in species occurrence data and in supporting environmental data. The methodological factors examined include the predictive performance measure, the method for setting the probability threshold used to define species occurrence in the fitted distribution model, and the success of the fitted distribution model in capturing the dominant environmental determinant for the species. The findings are used to explain relationships found by existing studies for real species distribution data. Beyond the key findings described above, the simulation approach presented in this thesis offers a promising tool for testing various aspects of species distribution modelling. Such aspects could include assessment of how constraints on species dispersal can affect model predictive performance, assessment of the sensitivity of model predictive performance to species rarity and sampling prevalence, and assessment of the effect of collinearity in predictive variables on model inference.
APA, Harvard, Vancouver, ISO, and other styles
32

Iddrisu, Abdul-Karim. "Bayesian hierarchical spatial and spatio-temporal modeling and mapping of tuberculosis in Kenya." Thesis, 2013. http://hdl.handle.net/10413/10279.

Full text
Abstract:
Global spread of infectious disease threatens the well-being of human, domestic, and wildlife health. A proper understanding of global distribution of these diseases is an important part of disease management and policy making. However, data are subject to complexities by heterogeneity across host classes and space-time epidemic processes [Waller et al., 1997, Hosseini et al., 2006]. The use of frequentist methods in Biostatistics and epidemiology are common and are therefore extensively utilized in answering varied research questions. In this thesis, we proposed the Hierarchical Bayesian approach to study the spatial and the spatio-temporal pattern of tuberculosis in Kenya [Knorr-Held et al., 1998, Knorr-Held, 1999, L opez-Qu lez and Munoz, 2009, Waller et al., 1997, Julian Besag, 1991]. Space and time interaction of risk (ψ[ij]) is an important factor considered in this thesis. The Markov Chain Monte Carlo (MCMC) method via WinBUGS and R packages were used for simulations [Ntzoufras, 2011, Congdon, 2010, David et al., 1995, Gimenez et al., 2009, Brian, 2003], and the Deviance Information Criterion (DIC), proposed by [Spiegelhalter et al., 2002], used for models comparison and selection. Variation in TB risk is observed among Kenya counties and clustering among counties with high TB relative risk (RR). HIV prevalence is identified as the dominant determinant of TB. We found clustering and heterogeneity of risk among high rate counties and the overall TB risk is slightly decreasing from 2002-2009. Interaction of TB relative risk in space and time is found to be increasing among rural counties that share boundaries with urban counties with high TB risk. This is as a result of the ability of models to borrow strength from neighbouring counties, such that near by counties have similar risk. Although the approaches are less than ideal, we hope that our formulations provide a useful stepping stone in the development of spatial and spatio-temporal methodology for the statistical analysis of risk from TB in Kenya.
Thesis (M.Sc.)-University of KwaZulu-Natal, Pietermaritzburg, 2013.
APA, Harvard, Vancouver, ISO, and other styles
33

Dail, David (David Andrew). "Conditioning of unobserved period-specific abundances to improve estimation of dynamic populations." Thesis, 2012. http://hdl.handle.net/1957/28224.

Full text
Abstract:
Obtaining accurate estimates of animal abundance is made difficult by the fact that most animal species are detected imperfectly. Early attempts at building likelihood models that account for unknown detection probability impose a simplifying assumption unrealistic for many populations, however: no births, deaths, migration or emigration can occur in the population throughout the study (i.e., population closure). In this dissertation, I develop likelihood models that account for unknown detection and do not require assuming population closure. In fact, the proposed models yield a statistical test for population closure. The basic idea utilizes a procedure in three steps: (1) condition the probability of the observed data on the (unobserved) period- specific abundances; (2) multiply this conditional probability by the (prior) likelihood for the period abundances; and (3) remove (via summation) the period- specific abundances from the joint likelihood, leaving the marginal likelihood of the observed data. The utility of this procedure is two-fold: step (1) allows detection probability to be more accurately estimated, and step (2) allows population dynamics such as entering migration rate and survival probability to be modeled. The main difficulty of this procedure arises in the summation in step (3), although it is greatly simplified by assuming abundances in one period depend only the most previous period (i.e., abundances have the Markov property). I apply this procedure to form abundance and site occupancy rate estimators for both the setting where observed point counts are available and the setting where only the presence or absence of an animal species is ob- served. Although the two settings yield very different likelihood models and estimators, the basic procedure forming these estimators is constant in both.
Graduation date: 2012
APA, Harvard, Vancouver, ISO, and other styles
34

Sarwat, Samiha. "Penalized spline modeling of the ex-vivo assays dose-response curves and the HIV-infected patients' bodyweight change." 2015. http://hdl.handle.net/1805/8010.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
A semi-parametric approach incorporates parametric and nonparametric functions in the model and is very useful in situations when a fully parametric model is inadequate. The objective of this dissertation is to extend statistical methodology employing the semi-parametric modeling approach to analyze data in health science research areas. This dissertation has three parts. The first part discusses the modeling of the dose-response relationship with correlated data by introducing overall drug effects in addition to the deviation of each subject-specific curve from the population average. Here, a penalized spline regression method that allows modeling of the smooth dose-response relationship is applied to data in studies monitoring malaria drug resistance through the ex-vivo assays.The second part of the dissertation extends the SiZer map, which is an exploratory and a powerful visualization tool, to detect underlying significant features (increase, decrease, or no change) of the curve at various smoothing levels. Here, Penalized Spline Significant Zero Crossings of Derivatives (PS-SiZer), using a penalized spline regression, is introduced to investigate significant features in correlated data arising from longitudinal settings. The third part of the dissertation applies the proposed PS-SiZer methodology to analyze HIV data. The durability of significant weight change over a period is explored from the PS-SiZer visualization. PS-SiZer is a graphical tool for exploring structures in curves by mapping areas where rate of change is significantly increasing, decreasing, or does not change. PS-SiZer maps provide information about the significant rate of weigh change that occurs in two ART regimens at various level of smoothing. A penalized spline regression model at an optimum smoothing level is applied to obtain an estimated first-time point where weight no longer increases for different treatment regimens.
APA, Harvard, Vancouver, ISO, and other styles
35

"Bootstrap distribution for testing a change in the cox proportional hazard model." 2000. http://library.cuhk.edu.hk/record=b5890302.

Full text
Abstract:
Lam Yuk Fai.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.
Includes bibliographical references (leaves 41-43).
Abstracts in English and Chinese.
Chapter 1 --- Basic Concepts --- p.9
Chapter 1.1 --- Survival data --- p.9
Chapter 1.1.1 --- An example --- p.9
Chapter 1.2 --- Some important functions --- p.11
Chapter 1.2.1 --- Survival function --- p.12
Chapter 1.2.2 --- Hazard function --- p.12
Chapter 1.3 --- Cox Proportional Hazards Model --- p.13
Chapter 1.3.1 --- A special case --- p.14
Chapter 1.3.2 --- An example (continued) --- p.15
Chapter 1.4 --- Extension of the Cox Proportional Hazards Model --- p.16
Chapter 1.5 --- Bootstrap --- p.17
Chapter 2 --- A New Method --- p.19
Chapter 2.1 --- Introduction --- p.19
Chapter 2.2 --- Definition of the test --- p.20
Chapter 2.2.1 --- Our test statistic --- p.20
Chapter 2.2.2 --- The alternative test statistic I --- p.22
Chapter 2.2.3 --- The alternative test statistic II --- p.23
Chapter 2.3 --- Variations of the test --- p.24
Chapter 2.3.1 --- Restricted test --- p.24
Chapter 2.3.2 --- Adjusting for other covariates --- p.26
Chapter 2.4 --- Apply with bootstrap --- p.28
Chapter 2.5 --- Examples --- p.29
Chapter 2.5.1 --- Male mice data --- p.34
Chapter 2.5.2 --- Stanford heart transplant data --- p.34
Chapter 2.5.3 --- CGD data --- p.34
Chapter 3 --- Large Sample Properties and Discussions --- p.35
Chapter 3.1 --- Large sample properties and relationship to goodness of fit test --- p.35
Chapter 3.1.1 --- Large sample properties of A and Ap --- p.35
Chapter 3.1.2 --- Large sample properties of Ac and A --- p.36
Chapter 3.2 --- Discussions --- p.37
APA, Harvard, Vancouver, ISO, and other styles
36

"Optimising aspects of a soybean breeding programme." Thesis, 2008. http://hdl.handle.net/10413/738.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Yang, Lili. "Joint models for longitudinal and survival data." Thesis, 2014. http://hdl.handle.net/1805/4666.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Epidemiologic and clinical studies routinely collect longitudinal measures of multiple outcomes. These longitudinal outcomes can be used to establish the temporal order of relevant biological processes and their association with the onset of clinical symptoms. In the first part of this thesis, we proposed to use bivariate change point models for two longitudinal outcomes with a focus on estimating the correlation between the two change points. We adopted a Bayesian approach for parameter estimation and inference. In the second part, we considered the situation when time-to-event outcome is also collected along with multiple longitudinal biomarkers measured until the occurrence of the event or censoring. Joint models for longitudinal and time-to-event data can be used to estimate the association between the characteristics of the longitudinal measures over time and survival time. We developed a maximum-likelihood method to joint model multiple longitudinal biomarkers and a time-to-event outcome. In addition, we focused on predicting conditional survival probabilities and evaluating the predictive accuracy of multiple longitudinal biomarkers in the joint modeling framework. We assessed the performance of the proposed methods in simulation studies and applied the new methods to data sets from two cohort studies.
National Institutes of Health (NIH) Grants R01 AG019181, R24 MH080827, P30 AG10133, R01 AG09956.
APA, Harvard, Vancouver, ISO, and other styles
38

Leoce, Nicole Marie. "Prognostic Modeling in the Presence of Competing Risks: an Application to Cardiovascular and Cancer Mortality in Breast Cancer Survivors." Thesis, 2016. https://doi.org/10.7916/D89S1R41.

Full text
Abstract:
Currently, there are an estimated 2.8 million breast cancer survivors in the United States. Due to modern screening practices and raised awareness, the majority of these cases will be diagnosed in the early stages of disease where highly effective treatment options are available, leading a large proportion of these patients to fail from causes other than breast cancer. The primary cause of death in the United States today is cardiovascular disease, which can be delayed or prevented with interventions such as lifestyle modifications or medications. In order to identify individuals who may be at high risk for a cardiovascular event or cardiovascular mortality, a number of prognostic models have been developed. The majority of these models were developed on populations free of comorbid conditions, utilizing statistical methods that did not account for the competing risks of death from other causes, therefore it is unclear whether they will be generalizable to a cancer population remaining at an increased risk of death from cancer and other causes. Consequently, the purpose of this work is multi-fold. We will first summarize the major statistical methods available for analyzing competing risk data and include a simulation study comparing them. This will be used to inform the interpretation of the real data analysis, which will be conducted on a large, contemporary cohort of breast cancer survivors. For these women, we will categorize the major causes of death, hypothesizing that it will include cardiovascular failure. Next, we will evaluate the existing cardiovascular disease risk models in our population of cancer survivors, and then propose a new model to simultaneously predict a survivor's risk of death due to her breast cancer or due to cardiovascular disease, while accounting for additional competing causes of death. Lastly, model predicted outcomes will be calculated for the cohort, and evaluation methods will be applied to determine the clinical utility of such a model.
APA, Harvard, Vancouver, ISO, and other styles
39

Cho, Sylvia. "Data Quality Assessment for the Secondary Use of Person-Generated Wearable Device Data: Assessing Self-Tracking Data for Research Purposes." Thesis, 2021. https://doi.org/10.7916/d8-jcmb-gw93.

Full text
Abstract:
The Quantified Self movement has led to an increased routine use of consumer wearables, generating large amounts of person-generated wearable device data. This has become an opportunity to researchers to conduct research with large-scale person-generated wearable device data without having to collect data in a costly and time-consuming way. However, there are known challenges of wearable device data such as missing data or inaccurate data which raises the need to assess the quality of data before conducting research. Currently, there is a lack of in-depth understanding on data quality challenges of using person-generated wearable device data for research purposes, and how data quality assessment should be conducted. Data quality assessment could be especially a burden to those without the domain knowledge on a specific data type, which might be the case for emerging biomedical data sources. The goal of this dissertation is to advance the knowledge on data quality challenges and assessment of person-generated wearable device data and facilitate data quality assessment for those without the domain knowledge on the emerging data type. The dissertation consists of two aims: (1) identifying data quality dimensions important for assessing the quality of person-generated wearable device data for research purposes, (2) designing and evaluating an interactive data quality characterization tool that supports researchers in assessing the fitness-for-use of fitness tracker data. In the first aim, a multi-method approach was taken, conducting literature review, survey, and focus group discussion sessions. We found that intrinsic data quality dimensions applicable to electronic health record data such as conformance, completeness, and plausibility are applicable to person-generated wearable device data. In addition, contextual/fitness-for-use dimensions such as breadth and density completeness, and temporal data granularity were identified given the fact that our focus was on assessing data quality for research purposes. In the second aim, we followed an iterative design process from understanding informational needs to designing a prototype, and evaluating the usability of the final version of a tool. The tool allows users to customize the definition of data completeness (fitness-for-use measures), and provides data summarization on the cohort that meets that definition. We found that the interactive tool that incorporates fitness-for-use measures and allows customization on data completeness, can support assessing fitness-for-use assessment more accurately and in less time than a tool that only presents information on intrinsic data quality measures.
APA, Harvard, Vancouver, ISO, and other styles
40

Лось, Сергій Леонідович. "Алгоритмізація процесу обробки результатів емпіричного дослідження біологічних процесів методами математичної статистики." Магістерська робота, 2020. https://dspace.znu.edu.ua/jspui/handle/12345/5109.

Full text
Abstract:
Лось С. Л. Алгоритмізація процесу обробки результатів емпіричного дослідження біологічних процесів методами математичної статистики : кваліфікаційна робота магістра спеціальності 113 "Прикладна математика" / наук. керівник В. В. Леонтьєва. Запоріжжя : ЗНУ, 2020. 58 с.
UA : Робота викладена на 58 сторінках друкованого тексту, містить 13 рисунків, 9 таблиць, 26 джерел. Об’єкт дослідження – біологічні процеси. Предмет дослідження: математичне моделювання та оптимізація біологічних процесів методами математичної статистики. Метод дослідження – аналітичний. Мета роботи: Дослідити в біологічних процесах живі організми в взаємодії з комплексом різних факторів зовнішнього середовища. Розглянути питання про відповідність нормального розподілу ймовірності. Перевірити відповідність отриманих даних нормальному закону розподілу. За даними температур знайти основні статистичні характеристики. Розрахувати імовірності характеристики погоди. У роботі розглядаються основні поняття та характеристики математичної статистики, медико-біологічні дослідження та їх статистична обробка, проводиться збір даних та підрахунок основних статистичних характеристик, обробка отриманих результатів та прогнозування подальшої метеорологічної ситуації.
EN : The work is presented on 58 pages of printed text, 13 figures, 9 tables, 26 references. Object of the study – biological processes. Method of research – analytical. Aim of the study: To study living organisms in biological processes in interaction with a complex of various environmental factors. Consider the correspondence of the normal probability distribution. Check the compliance of the obtained data with the normal distribution law. According to the temperature find the main statistical characteristics. The paper considers the basic concepts and characteristics of mathematical statistics, medical and biological research and their statistical processing, data collection and calculation of basic statistical characteristics, processing of the results and forecasting the future meteorological situation.
APA, Harvard, Vancouver, ISO, and other styles
41

Li, Zhuokai. "Multivariate semiparametric regression models for longitudinal data." Thesis, 2014. http://hdl.handle.net/1805/6462.

Full text
Abstract:
Multiple-outcome longitudinal data are abundant in clinical investigations. For example, infections with different pathogenic organisms are often tested concurrently, and assessments are usually taken repeatedly over time. It is therefore natural to consider a multivariate modeling approach to accommodate the underlying interrelationship among the multiple longitudinally measured outcomes. This dissertation proposes a multivariate semiparametric modeling framework for such data. Relevant estimation and inference procedures as well as model selection tools are discussed within this modeling framework. The first part of this research focuses on the analytical issues concerning binary data. The second part extends the binary model to a more general situation for data from the exponential family of distributions. The proposed model accounts for the correlations across the outcomes as well as the temporal dependency among the repeated measures of each outcome within an individual. An important feature of the proposed model is the addition of a bivariate smooth function for the depiction of concurrent nonlinear and possibly interacting influences of two independent variables on each outcome. For model implementation, a general approach for parameter estimation is developed by using the maximum penalized likelihood method. For statistical inference, a likelihood-based resampling procedure is proposed to compare the bivariate nonlinear effect surfaces across the outcomes. The final part of the dissertation presents a variable selection tool to facilitate model development in practical data analysis. Using the adaptive least absolute shrinkage and selection operator (LASSO) penalty, the variable selection tool simultaneously identifies important fixed effects and random effects, determines the correlation structure of the outcomes, and selects the interaction effects in the bivariate smooth functions. Model selection and estimation are performed through a two-stage procedure based on an expectation-maximization (EM) algorithm. Simulation studies are conducted to evaluate the performance of the proposed methods. The utility of the methods is demonstrated through several clinical applications.
APA, Harvard, Vancouver, ISO, and other styles
42

Reger, Michael Kent. "Dietary intake and urinary excretion of phytoestrogens in relation to cancer and cardiovascular disease." Thesis, 2014. http://hdl.handle.net/1805/6053.

Full text
Abstract:
Indiana University-Purdue University Indianapolis (IUPUI)
Phytoestrogens that abound in soy products, legumes, and chickpeas can induce biologic responses in animals and humans due to structural similarity to 17β-estradiol. Although experimental studies suggest that phytoestrogen intake may alter the risk of cancer and cardiovascular disease, few epidemiologic studies have investigated this research question. This dissertation investigated the associations of intake of total and individual phytoestrogens and their urinary biomarkers with these chronic conditions using data previously collected from two US national cohort studies (NHANES and PLCO). Utilizing NHANES data with urinary phytoestrogen concentrations and follow-up mortality, Cox proportional hazards regression (HR; 95% CI) were performed to evaluate the association between total cancer, cardiovascular disease, and all-cause mortality and urinary phytoestrogens. After adjustment for confounders, it was found that higher concentrations of lignans were associated with a reduced risk of death from cardiovascular disease (0.48; 0.24-0.97), whereas higher concentrations of isoflavones (2.14; 1.03-4.47) and daidzein (2.05; 1.02-4.11) were associated with an increased risk. A reduction in all-cause mortality was observed for elevated concentrations of lignans (0.65; 0.43-0.96) and enterolactone (0.65; 0.44-0.97). Utilizing PLCO data and dietary phytoestrogens, Cox proportional hazards regression examined the associations between dietary phytoestrogens and the risk of prostate cancer incidence. After adjustment for confounders, a positive association was found between dietary intake of isoflavones (1.58; 1.11-2.24), genistein (1.42; 1.02-1.98), daidzein (1.62; 1.13-2.32), and glycitein (1.53; 1.09-2.15) and the risk of advanced prostate cancer. Conversely, an inverse association existed between dietary intake of genistein and the risk of non-advanced prostate cancer (0.88; 0.78-0.99) and total prostate cancer (0.90; 0.81-1.00). C-reactive protein (CRP) concentration levels rise in response to inflammation and higher levels are a risk factor for some cancers and cardiovascular disease reported in epidemiologic studies. Logistic regression performed on NHANES data evaluated the association between CRP and urinary phytoestrogen concentrations. Higher concentrations of total and individual phytoestrogens were associated with lower concentrations of CRP. In summary, dietary intake of some phytoestrogens significantly modulates prostate cancer risk and cardiovascular disease mortality. It is possible that these associations may be in part mediated through the influence of phytoestrogen intake on circulating levels of C-reactive protein.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography