Dissertations / Theses: 'Computational inference method'

1

Bergmair, Richard. "Monte Carlo semantics : robust inference and logical pattern processing with natural language text." Thesis, University of Cambridge, 2011. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.609713.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Guo, Wenbin. "Computational analysis and method development for high throughput transcriptomics and transcriptional regulatory inference in plants." Thesis, University of Dundee, 2018. https://discovery.dundee.ac.uk/en/studentTheses/3f14dd8e-0c6c-4b46-adb0-bbb10b0cbe19.

Full text

Abstract:

RNA sequencing (RNA-seq) technologies facilitate the characterisation of genes and transcripts in different cell types as well as their expression analysis across various conditions. Due to its ability to provide in-depth insights into transcription and post-transcription mechanisms, RNA-seq has been extensively used in functional genetics and transcriptomics, system biology and developmental biology in animals, plants, diseases, etc. The aim of this project is to use mathematical and computational models to integrate big genomic and transcriptomic data from high-throughput technologies in plant biology and develop new methods to identify which genes or transcripts have significant expression variation across experimental conditions of interest, then to interpret the regulatory causalities of these expression changes by distinguishing the effects from the transcription and alternative splicing. We performed a high resolution ultra-deep RNA-seq time-course experiment to study Arabidopsis in response to cold treatment where plants were grown at 20^oC and then the temperature was reduced to 4^oC. We have developed a high quality Arabidopsis thaliana Reference Transcript Dataset (AtRTD2) transcriptome for accurate transcript and gene quantification. This high quality time-series dataset was used as the benchmark for novel method development and downstream expression analysis. The main outcomes of this project include three parts. i) A pipeline for differential expression (DE) and differential alternative splicing (DAS) analysis at both gene and transcript levels. Firstly, we implemented data pre-processing to reduce the noise/low expression, batch effects and technical biases of read counts. Then we used the limma-voom pipeline to compare the expression at corresponding time-points of 4^oC to the time-points of 20^oC. We identified 8,949 genes with altered expression of which 2,442 showed significant DAS and 1,647 were only regulated by AS. Compared with current publications, 3,039 of these genes were novel cold-responsive genes. In addition, we identified 4,008 differential transcript usage (DTU) transcripts of which the expression changes were significantly different to their cognate DAS genes. ii) A TSIS R package for time-series transcript isoform switch (IS) analysis was developed. IS refers to the time-points when a pair of transcript isoforms from the same gene reverse their relative expression abundances. By using a five metric scheme to evaluate robustly the qualities of each switch point, we identified 892 significant ISs between the high abundance transcripts in the DAS genes and about 57% of these switches occurred very rapidly between 0-6h following transfer to 4^oC. iii) A RLowPC R package for co-expression network construction was generated. The RLowPC method uses a two-step approach to select the high-confidence edges first by reducing the search space by only picking the top ranked genes from an initial partial correlation analysis, and then computes the partial correlations in the confined search space by only removing the linear dependencies from the shared neighbours, largely ignoring the genes showing lower association. In future work, we will construct dynamic transcriptional and AS regulatory networks to interpret the causalities of DE and DAS. We will study the coupling and de-coupling of expression rhythmicity to the Arabidopsis circadian clock in response to cold. We will develop new methods to improve the statistical power of expression comparative analysis, such as by taking into account the missing values of expression and by distinguishing the technical and biological variabilities.

APA, Harvard, Vancouver, ISO, and other styles

3

Strid, Ingvar. "Computational methods for Bayesian inference in macroeconomic models." Doctoral thesis, Handelshögskolan i Stockholm, Ekonomisk Statistik (ES), 2010. http://urn.kb.se/resolve?urn=urn:nbn:se:hhs:diva-1118.

Full text

Abstract:

The New Macroeconometrics may succinctly be described as the application of Bayesian analysis to the class of macroeconomic models called Dynamic Stochastic General Equilibrium (DSGE) models. A prominent local example from this research area is the development and estimation of the RAMSES model, the main macroeconomic model in use at Sveriges Riksbank. Bayesian estimation of DSGE models is often computationally demanding. In this thesis fast algorithms for Bayesian inference are developed and tested in the context of the state space model framework implied by DSGE models. The algorithms discussed in the thesis deal with evaluation of the DSGE model likelihood function and sampling from the posterior distribution. Block Kalman filter algorithms are suggested for likelihood evaluation in large linearised DSGE models. Parallel particle filter algorithms are presented for likelihood evaluation in nonlinearly approximated DSGE models. Prefetching random walk Metropolis algorithms and adaptive hybrid sampling algorithms are suggested for posterior sampling. The generality of the algorithms, however, suggest that they should be of interest also outside the realm of macroeconometrics.

APA, Harvard, Vancouver, ISO, and other styles

4

Warne, David James. "Computational inference in mathematical biology: Methodological developments and applications." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/202835/1/David_Warne_Thesis.pdf.

Full text

Abstract:

Complexity in living organisms occurs on multiple spatial and temporal scales. The function of tissues depends on interactions of cells, and in turn, cell dynamics depends on intercellular and intracellular biochemical networks. A diverse range of mathematical modelling frameworks are applied in quantitative biology. Effective application of models in practice depends upon reliable statistical inference methods for experimental design, model calibration and model selection. In this thesis, new results are obtained for quantification of contact inhibition and cell motility mechanisms in prostate cancer cells, and novel computationally efficient inference algorithms suited for the study of biochemical systems are developed.

APA, Harvard, Vancouver, ISO, and other styles

5

Dahlin, Johan. "Accelerating Monte Carlo methods for Bayesian inference in dynamical models." Doctoral thesis, Linköpings universitet, Reglerteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-125992.

Full text

Abstract:

Making decisions and predictions from noisy observations are two important and challenging problems in many areas of society. Some examples of applications are recommendation systems for online shopping and streaming services, connecting genes with certain diseases and modelling climate change. In this thesis, we make use of Bayesian statistics to construct probabilistic models given prior information and historical data, which can be used for decision support and predictions. The main obstacle with this approach is that it often results in mathematical problems lacking analytical solutions. To cope with this, we make use of statistical simulation algorithms known as Monte Carlo methods to approximate the intractable solution. These methods enjoy well-understood statistical properties but are often computational prohibitive to employ. The main contribution of this thesis is the exploration of different strategies for accelerating inference methods based on sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). That is, strategies for reducing the computational effort while keeping or improving the accuracy. A major part of the thesis is devoted to proposing such strategies for the MCMC method known as the particle Metropolis-Hastings (PMH) algorithm. We investigate two strategies: (i) introducing estimates of the gradient and Hessian of the target to better tailor the algorithm to the problem and (ii) introducing a positive correlation between the point-wise estimates of the target. Furthermore, we propose an algorithm based on the combination of SMC and Gaussian process optimisation, which can provide reasonable estimates of the posterior but with a significant decrease in computational effort compared with PMH. Moreover, we explore the use of sparseness priors for approximate inference in over-parametrised mixed effects models and autoregressive processes. This can potentially be a practical strategy for inference in the big data era. Finally, we propose a general method for increasing the accuracy of the parameter estimates in non-linear state space models by applying a designed input signal.
Borde Riksbanken höja eller sänka reporäntan vid sitt nästa möte för att nå inflationsmålet? Vilka gener är förknippade med en viss sjukdom? Hur kan Netflix och Spotify veta vilka filmer och vilken musik som jag vill lyssna på härnäst? Dessa tre problem är exempel på frågor där statistiska modeller kan vara användbara för att ge hjälp och underlag för beslut. Statistiska modeller kombinerar teoretisk kunskap om exempelvis det svenska ekonomiska systemet med historisk data för att ge prognoser av framtida skeenden. Dessa prognoser kan sedan användas för att utvärdera exempelvis vad som skulle hända med inflationen i Sverige om arbetslösheten sjunker eller hur värdet på mitt pensionssparande förändras när Stockholmsbörsen rasar. Tillämpningar som dessa och många andra gör statistiska modeller viktiga för många delar av samhället. Ett sätt att ta fram statistiska modeller bygger på att kontinuerligt uppdatera en modell allteftersom mer information samlas in. Detta angreppssätt kallas för Bayesiansk statistik och är särskilt användbart när man sedan tidigare har bra insikter i modellen eller tillgång till endast lite historisk data för att bygga modellen. En nackdel med Bayesiansk statistik är att de beräkningar som krävs för att uppdatera modellen med den nya informationen ofta är mycket komplicerade. I sådana situationer kan man istället simulera utfallet från miljontals varianter av modellen och sedan jämföra dessa mot de historiska observationerna som finns till hands. Man kan sedan medelvärdesbilda över de varianter som gav bäst resultat för att på så sätt ta fram en slutlig modell. Det kan därför ibland ta dagar eller veckor för att ta fram en modell. Problemet blir särskilt stort när man använder mer avancerade modeller som skulle kunna ge bättre prognoser men som tar för lång tid för att bygga. I denna avhandling använder vi ett antal olika strategier för att underlätta eller förbättra dessa simuleringar. Vi föreslår exempelvis att ta hänsyn till fler insikter om systemet och därmed minska antalet varianter av modellen som behöver undersökas. Vi kan således redan utesluta vissa modeller eftersom vi har en bra uppfattning om ungefär hur en bra modell ska se ut. Vi kan också förändra simuleringen så att den enklare rör sig mellan olika typer av modeller. På detta sätt utforskas rymden av alla möjliga modeller på ett mer effektivt sätt. Vi föreslår ett antal olika kombinationer och förändringar av befintliga metoder för att snabba upp anpassningen av modellen till observationerna. Vi visar att beräkningstiden i vissa fall kan minska ifrån några dagar till någon timme. Förhoppningsvis kommer detta i framtiden leda till att man i praktiken kan använda mer avancerade modeller som i sin tur resulterar i bättre prognoser och beslut.

APA, Harvard, Vancouver, ISO, and other styles

6

Lienart, Thibaut. "Inference on Markov random fields : methods and applications." Thesis, University of Oxford, 2017. http://ora.ox.ac.uk/objects/uuid:3095b14c-98fb-4bda-affc-a1fa1708f628.

Full text

Abstract:

This thesis considers the problem of performing inference on undirected graphical models with continuous state spaces. These models represent conditional independence structures that can appear in the context of Bayesian Machine Learning. In the thesis, we focus on computational methods and applications. The aim of the thesis is to demonstrate that the factorisation structure corresponding to the conditional independence structure present in high-dimensional models can be exploited to decrease the computational complexity of inference algorithms. First, we consider the smoothing problem on Hidden Markov Models (HMMs) and discuss novel algorithms that have sub-quadratic computational complexity in the number of particles used. We show they perform on par with existing state-of-the-art algorithms with a quadratic complexity. Further, a novel class of rejection free samplers for graphical models known as the Local Bouncy Particle Sampler (LBPS) is explored and applied on a very large instance of the Probabilistic Matrix Factorisation (PMF) problem. We show the method performs slightly better than Hamiltonian Monte Carlo methods (HMC). It is also the first such practical application of the method to a statistical model with hundreds of thousands of dimensions. In a second part of the thesis, we consider approximate Bayesian inference methods and in particular the Expectation Propagation (EP) algorithm. We show it can be applied as the backbone of a novel distributed Bayesian inference mechanism. Further, we discuss novel variants of the EP algorithms and show that a specific type of update mechanism, analogous to the mirror descent algorithm outperforms all existing variants and is robust to Monte Carlo noise. Lastly, we show that EP can be used to help the Particle Belief Propagation (PBP) algorithm in order to form cheap and adaptive proposals and significantly outperform classical PBP.

APA, Harvard, Vancouver, ISO, and other styles

7

Wang, Tengyao. "Spectral methods and computational trade-offs in high-dimensional statistical inference." Thesis, University of Cambridge, 2016. https://www.repository.cam.ac.uk/handle/1810/260825.

Full text

Abstract:

Spectral methods have become increasingly popular in designing fast algorithms for modern highdimensional datasets. This thesis looks at several problems in which spectral methods play a central role. In some cases, we also show that such procedures have essentially the best performance among all randomised polynomial time algorithms by exhibiting statistical and computational trade-offs in those problems. In the first chapter, we prove a useful variant of the well-known Davis{Kahan theorem, which is a spectral perturbation result that allows us to bound of the distance between population eigenspaces and their sample versions. We then propose a semi-definite programming algorithm for the sparse principal component analysis (PCA) problem, and analyse its theoretical performance using the perturbation bounds we derived earlier. It turns out that the parameter regime in which our estimator is consistent is strictly smaller than the consistency regime of a minimax optimal (yet computationally intractable) estimator. We show through reduction from a well-known hard problem in computational complexity theory that the difference in consistency regimes is unavoidable for any randomised polynomial time estimator, hence revealing subtle statistical and computational trade-offs in this problem. Such computational trade-offs also exist in the problem of restricted isometry certification. Certifiers for restricted isometry properties can be used to construct design matrices for sparse linear regression problems. Similar to the sparse PCA problem, we show that there is also an intrinsic gap between the class of matrices certifiable using unrestricted algorithms and using polynomial time algorithms. Finally, we consider the problem of high-dimensional changepoint estimation, where we estimate the time of change in the mean of a high-dimensional time series with piecewise constant mean structure. Motivated by real world applications, we assume that changes only occur in a sparse subset of all coordinates. We apply a variant of the semi-definite programming algorithm in sparse PCA to aggregate the signals across different coordinates in a near optimal way so as to estimate the changepoint location as accurately as possible. Our statistical procedure shows superior performance compared to existing methods in this problem.

APA, Harvard, Vancouver, ISO, and other styles

8

Pardo, Jérémie. "Méthodes d'inférence de cibles thérapeutiques et de séquences de traitement." Electronic Thesis or Diss., université Paris-Saclay, 2022. http://www.theses.fr/2022UPASG011.

Full text

Abstract:

Un enjeu majeur de la médecine des réseaux est l’identification des perturbations moléculaires induites par les maladies complexes et les thérapies afin de réaliser une reprogrammation cellulaire. L’action de la reprogrammation est le résultat de l’application d’un contrôle. Dans cette thèse, nous étendons le contrôle unique des réseaux biologiques en étudiant le contrôle séquentiel des réseaux booléens. Nous présentons un nouveau cadre théorique pour l’étude formelle des séquences de contrôle. Nous considérons le contrôle par gel de noeuds. Ainsi, une variable du réseau booléen peut être fixée à la valeur 0, 1 ou décontrôlée. Nous définissons un modèle de dynamique contrôlée pour le mode de mise à jour synchrone où la modification de contrôle ne se produit que sur un état stable. Nous appelons CoFaSe le problème d’inférence consistant à trouver une séquence de contrôle modifiant la dynamique pour évoluer vers une propriété ou un état souhaité. Les réseaux auxquels sera appliqué CoFaSe auront toujours un ensemble de variables incontrôlables. Nous montrons que ce problème est PSPACE-dur. L’étude des caractéristiques dynamiques du problème CoFaSe nous a permis de constater que les propriétés dynamiques qui impliquent la nécessité d’une séquence de contrôle émergent des fonctions de mise à jour des variables incontrôlables. Nous trouvons que la longueur d’une séquence de contrôle minimale ne peut pas être supérieure à deux fois le nombre de profils des variables incontrôlables. À partir de ce résultat, nous avons construit deux algorithmes inférant des séquences de contrôle minimales sous la dynamique synchrone. Enfin, l’étude des interdépendances entre le contrôle séquentiel et la topologie du graphe d’interaction du réseau booléen nous a permis de découvrir des relations existantes entre structure et contrôle. Celles-ci mettent en évidence une borne maximale plus resserrée pour certaines topologies que celles obtenues par l’étude de la dynamique. L’étude sur la topologie met en lumière l’importance de la présence de cycles non-négatifs dans le graphe d’interaction pour l’émergence de séquences minimales de contrôle de taille supérieure ou égale à deux
Network controllability is a major challenge in network medicine. It consists in finding a way to rewire molecular networks to reprogram the cell fate. The reprogramming action is typically represented as the action of a control. In this thesis, we extended the single control action method by investigating the sequential control of Boolean networks. We present a theoretical framework for the formal study of control sequences.We consider freeze controls, under which the variables can only be frozen to 0, 1 or unfrozen. We define a model of controlled dynamics where the modification of the control only occurs at a stable state in the synchronous update mode. We refer to the inference problem of finding a control sequence modifying the dynamics to evolve towards a desired state or property as CoFaSe. Under this problem, a set of variables are uncontrollable. We prove that this problem is PSPACE-hard. We know from the complexity of CoFaSe that finding a minimal sequence of control by exhaustively exploring all possible control sequences is not practically tractable. By studying the dynamical properties of the CoFaSe problem, we found that the dynamical properties that imply the necessity of a sequence of control emerge from the update functions of uncontrollable variables. We found that the length of a minimal control sequence cannot be larger than twice the number of profiles of uncontrollable variables. From this result, we built two algorithms inferring minimal control sequences under synchronous dynamics. Finally, the study of the interdependencies between sequential control and the topology of the interaction graph of the Boolean network allowed us to investigate the causal relationships that exist between structure and control. Furthermore, accounting for the topological properties of the network gives additional tools for tightening the upper bounds on sequence length. This work sheds light on the key importance of non-negative cycles in the interaction graph for the emergence of minimal sequences of control of size greater than or equal to two

APA, Harvard, Vancouver, ISO, and other styles

9

Angulo, Rafael Villa. "Computational methods for haplotype inference with application to haplotype block characterization in cattle." Fairfax, VA : George Mason University, 2009. http://hdl.handle.net/1920/4558.

Full text

Abstract:

Thesis (Ph.D.)--George Mason University, 2009.
Vita: p. 123. Thesis director: John J. Grefenstette. Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Bioinformatics and Computational Biology. Title from PDF t.p. (viewed Sept. 8, 2009). Includes bibliographical references (p. 114-122). Also issued in print.

APA, Harvard, Vancouver, ISO, and other styles

10

Ruli, Erlis. "Recent Advances in Approximate Bayesian Computation Methods." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3423529.

Full text

Abstract:

The Bayesian approach to statistical inference in fundamentally probabilistic. Exploiting the internal consistency of the probability framework, the posterior distribution extracts the relevant information in the data, and provides a complete and coherent summary of post data uncertainty. However, summarising the posterior distribution often requires the calculation of awkward multidimensional integrals. A further complication with the Bayesian approach arises when the likelihood functions is unavailable. In this respect, promising advances have been made by theory of Approximate Bayesian Computations (ABC). This thesis focuses on computational methods for the approximation of posterior distributions, and it discusses six original contributions. The first contribution concerns the approximation of marginal posterior distributions for scalar parameters. By combining higher-order tail area approximation with the inverse transform sampling, we define the HOTA algorithm which draws independent random sample from the approximate marginal posterior. The second discusses the HOTA algorithm with pseudo-posterior distributions, \eg, posterior distributions obtained by the combination of a pseudo-likelihood with a prior within Bayes' rule. The third contribution extends the use of tail-area approximations to contexts with multidimensional parameters, and proposes a method which gives approximate Bayesian credible regions with good sampling coverage properties. The forth presents an improved Laplace approximation which can be used for computing marginal likelihoods. The fifth contribution discusses a model-based procedure for choosing good summary statistics for ABC, by using composite score functions. Lastly, the sixth contribution discusses the choice of a default proposal distribution for ABC that is based on the notion of quasi-likelihood.
L'approccio bayesiano all'inferenza statistica è fondamentalmente probabilistico. Attraverso il calcolo delle probabilità, la distribuzione a posteriori estrae l'informazione rilevante offerta dai dati e produce una descrizione completa e coerente dell'incertezza condizionatamente ai dati osservati. Tuttavia, la descrizione della distribuzione a posteriori spesso richiede il computo di integrali multivariati e complicati. Un'ulteriore difficoltà dell'approccio bayesiano è legata alla funzione di verosimiglianza e nasce quando quest'ultima è matematicamento o computazionalmente intrattabile. In questa direzione, notevoli sviluppi sono stati compiuti dalla cosiddetta teaoria di Approximate Bayesian Computations (ABC). Questa tesi si focalizza su metodi computazionali per l'approssimazione della distribuzione a posteriori e propone sei contributi originali. Il primo contributo concerne l'approssimazione della distributione a posteriori marginale per un parametro scalare. Combinando l'approssimazione di ordine superiore per tail-area con il metodo della simulazione per inversione, si ottiene l'algorimo denominato HOTA, il quale può essere usato per simulare in modo indipendente da un'approssimazione della distribuzione a posteriori. Il secondo contributo si propone di estendere l'uso dell'algoritmo HOTA in contesti di distributioni pseudo-posterior, ovvero una distribuzione a posteriori ottenuta attraverso la combinazione di una pseudo-verosimiglianza con una prior, tramite il teorema di Bayes. Il terzo contributo estende l'uso dell'approssimazione di tail-area in contesti con parametri multidimensionali e propone un metodo per calcolare delle regioni di credibilità le quali presentano buone proprietà di copertura frequentista. Il quarto contributo presenta un'approssimazione di Laplace di terzo ordine per il calcolo della verosimiglianza marginale. Il quinto contributo si focalizza sulla scelta delle statistiche descrittive per ABC e propone un metodo parametrico, basato sulla funzione di score composita, per la scelta di tali statistiche. Infine, l'ultimo contributo si focalizza sulla scelta di una distribuzione di proposta da defalut per algoritmi ABC, dove la procedura di derivazione di tale distributzione è basata sulla nozione della quasi-verosimiglianza.

APA, Harvard, Vancouver, ISO, and other styles

11

Raynal, Louis. "Bayesian statistical inference for intractable likelihood models." Thesis, Montpellier, 2019. http://www.theses.fr/2019MONTS035/document.

Full text

Abstract:

Dans un processus d’inférence statistique, lorsque le calcul de la fonction de vraisemblance associée aux données observées n’est pas possible, il est nécessaire de recourir à des approximations. C’est un cas que l’on rencontre très fréquemment dans certains champs d’application, notamment pour des modèles de génétique des populations. Face à cette difficulté, nous nous intéressons aux méthodes de calcul bayésien approché (ABC, Approximate Bayesian Computation) qui se basent uniquement sur la simulation de données, qui sont ensuite résumées et comparées aux données observées. Ces comparaisons nécessitent le choix judicieux d’une distance, d’un seuil de similarité et d’un ensemble de résumés statistiques pertinents et de faible dimension.Dans un contexte d’inférence de paramètres, nous proposons une approche mêlant des simulations ABC et les méthodes d’apprentissage automatique que sont les forêts aléatoires. Nous utilisons diverses stratégies pour approximer des quantités a posteriori d’intérêts sur les paramètres. Notre proposition permet d’éviter les problèmes de réglage liés à l’ABC, tout en fournissant de bons résultats ainsi que des outils d’interprétation pour les praticiens. Nous introduisons de plus des mesures d’erreurs de prédiction a posteriori (c’est-à-dire conditionnellement à la donnée observée d’intérêt) calculées grâce aux forêts. Pour des problèmes de choix de modèles, nous présentons une stratégie basée sur des groupements de modèles qui permet, en génétique des populations, de déterminer dans un scénario évolutif les évènements plus ou moins bien identifiés le constituant. Toutes ces approches sont implémentées dans la bibliothèque R abcrf. Par ailleurs, nous explorons des manières de construire des forêts aléatoires dites locales, qui prennent en compte l’observation à prédire lors de leur phase d’entraînement pour fournir une meilleure prédiction. Enfin, nous présentons deux études de cas ayant bénéficié de nos développements, portant sur la reconstruction de l’histoire évolutive de population pygmées, ainsi que de deux sous-espèces du criquet pèlerin Schistocerca gregaria
In a statistical inferential process, when the calculation of the likelihood function is not possible, approximations need to be used. This is a fairly common case in some application fields, especially for population genetics models. Toward this issue, we are interested in approximate Bayesian computation (ABC) methods. These are solely based on simulated data, which are then summarised and compared to the observed ones. The comparisons are performed depending on a distance, a similarity threshold and a set of low dimensional summary statistics, which must be carefully chosen.In a parameter inference framework, we propose an approach combining ABC simulations and the random forest machine learning algorithm. We use different strategies depending on the parameter posterior quantity we would like to approximate. Our proposal avoids the usual ABC difficulties in terms of tuning, while providing good results and interpretation tools for practitioners. In addition, we introduce posterior measures of error (i.e., conditionally on the observed data of interest) computed by means of forests. In a model choice setting, we present a strategy based on groups of models to determine, in population genetics, which events of an evolutionary scenario are more or less well identified. All these approaches are implemented in the R package abcrf. In addition, we investigate how to build local random forests, taking into account the observation to predict during their learning phase to improve the prediction accuracy. Finally, using our previous developments, we present two case studies dealing with the reconstruction of the evolutionary history of Pygmy populations, as well as of two subspecies of the desert locust Schistocerca gregaria

APA, Harvard, Vancouver, ISO, and other styles

12

Groves, Adrian R. "Bayesian learning methods for modelling functional MRI." Thesis, University of Oxford, 2009. http://ora.ox.ac.uk/objects/uuid:fe46e696-a1a6-4a9d-9dfe-861b05b1ed33.

Full text

Abstract:

Bayesian learning methods are the basis of many powerful analysis techniques in neuroimaging, permitting probabilistic inference on hierarchical, generative models of data. This thesis primarily develops Bayesian analysis techniques for magnetic resonance imaging (MRI), which is a noninvasive neuroimaging tool for probing function, perfusion, and structure in the human brain. The first part of this work fits nonlinear biophysical models to multimodal functional MRI data within a variational Bayes framework. Simultaneously-acquired multimodal data contains mixtures of different signals and therefore may have common noise sources, and a method for automatically modelling this correlation is developed. A Gaussian process prior is also used to allow spatial regularization while simultaneously applying informative priors on model parameters, restricting biophysically-interpretable parameters to reasonable values. The second part introduces a novel data fusion framework for multivariate data analysis which finds a joint decomposition of data across several modalities using a shared loading matrix. Each modality has its own generative model, including separate spatial maps, noise models and sparsity priors. This flexible approach can perform supervised learning by using target variables as a modality. By inferring the data decomposition and multivariate decoding simultaneously, the decoding targets indirectly influence the component shapes and help to preserve useful components. The same framework is used for unsupervised learning by placing independent component analysis (ICA) priors on the spatial maps. Linked ICA is a novel approach developed to jointly decompose multimodal data, and is applied to combined structural and diffusion images across groups of subjects. This allows some of the benefits of tensor ICA and spatially-concatenated ICA to be combined, and allows model comparison between different configurations. This joint decomposition framework is particularly flexible because of its separate generative models for each modality and could potentially improve modelling of functional MRI, magnetoencephalography, and other functional neuroimaging modalities.

APA, Harvard, Vancouver, ISO, and other styles

13

Bon, Joshua J. "Advances in sequential Monte Carlo methods." Thesis, Queensland University of Technology, 2022. https://eprints.qut.edu.au/235897/1/Joshua%2BBon%2BThesis%284%29.pdf.

Full text

Abstract:

Estimating parameters of complex statistical models and their uncertainty from data is a challenging task in statistics and data science. This thesis developed novel statistical algorithms for efficiently performing statistical estimation, established the validity of these algorithms, and explored their properties with mathematical analysis. The new algorithms and their associated analysis are significant since they permit principled and robust fitting of statistical models that were previously intractable and will thus facilitate new scientific discoveries.

APA, Harvard, Vancouver, ISO, and other styles

14

Wallman, Kaj Mikael Joakim. "Computational methods for the estimation of cardiac electrophysiological conduction parameters in a patient specific setting." Thesis, University of Oxford, 2013. http://ora.ox.ac.uk/objects/uuid:2d5573b9-5115-4434-b9c8-60f8d0531f86.

Full text

Abstract:

Cardiovascular disease is the primary cause of death globally. Although this group encompasses a heterogeneous range of conditions, many of these diseases are associated with abnormalities in the cardiac electrical propagation. In these conditions, structural abnormalities in the form of scars and fibrotic tissue are known to play an important role, leading to a high individual variability in the exact disease mechanisms. Because of this, clinical interventions such as ablation therapy and CRT that work by modifying the electrical propagation should ideally be optimized on a patient specific basis. As a tool for optimizing these interventions, computational modelling and simulation of the heart have become increasingly important. However, in order to construct these models, a crucial step is the estimation of tissue conduction properties, which have a profound impact on the cardiac activation sequence predicted by simulations. Information about the conduction properties of the cardiac tissue can be gained from electrophysiological data, obtained using electroanatomical mapping systems. However, as in other clinical modalities, electrophysiological data are often sparse and noisy, and this results in high levels of uncertainty in the estimated quantities. In this dissertation, we develop a methodology based on Bayesian inference, together with a computationally efficient model of electrical propagation to achieve two main aims: 1) to quantify values and associated uncertainty for different tissue conduction properties inferred from electroanatomical data, and 2) to design strategies to optimise the location and number of measurements required to maximise information and reduce uncertainty. The methodology is validated in several studies performed using simulated data obtained from image-based ventricular models, including realistic fibre orientation and conduction heterogeneities. Subsequently, by using the developed methodology to investigate how the uncertainty decreases in response to added measurements, we derive an a priori index for placing electrophysiological measurements in order to optimise the information content of the collected data. Results show that the derived index has a clear benefit in minimising the uncertainty of inferred conduction properties compared to a random distribution of measurements, suggesting that the methodology presented in this dissertation provides an important step towards improving the quality of the spatiotemporal information obtained using electroanatomical mapping.

APA, Harvard, Vancouver, ISO, and other styles

15

Parat, Florence [Verfasser], Aurélien [Akademischer Betreuer] Tellier, Chris-Carolin [Gutachter] Schön, and Aurélien [Gutachter] Tellier. "Inference of the Demographic History of Domesticated Species Using Approximate Bayesian Computation and Likelihood-based Methods / Florence Parat ; Gutachter: Chris-Carolin Schön, Aurélien Tellier ; Betreuer: Aurélien Tellier." München : Universitätsbibliothek der TU München, 2020. http://d-nb.info/1213026083/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Higson, Edward John. "Bayesian methods and machine learning in astrophysics." Thesis, University of Cambridge, 2019. https://www.repository.cam.ac.uk/handle/1810/289728.

Full text

Abstract:

This thesis is concerned with methods for Bayesian inference and their applications in astrophysics. We principally discuss two related themes: advances in nested sampling (Chapters 3 to 5), and Bayesian sparse reconstruction of signals from noisy data (Chapters 6 and 7). Nested sampling is a popular method for Bayesian computation which is widely used in astrophysics. Following the introduction and background material in Chapters 1 and 2, Chapter 3 analyses the sampling errors in nested sampling parameter estimation and presents a method for estimating them numerically for a single nested sampling calculation. Chapter 4 introduces diagnostic tests for detecting when software has not performed the nested sampling algorithm accurately, for example due to missing a mode in a multimodal posterior. The uncertainty estimates and diagnostics in Chapters 3 and 4 are implemented in the $\texttt{nestcheck}$ software package, and both chapters describe an astronomical application of the techniques introduced. Chapter 5 describes dynamic nested sampling: a generalisation of the nested sampling algorithm which can produce large improvements in computational efficiency compared to standard nested sampling. We have implemented dynamic nested sampling in the $\texttt{dyPolyChord}$ and $\texttt{perfectns}$ software packages. Chapter 6 presents a principled Bayesian framework for signal reconstruction, in which the signal is modelled by basis functions whose number (and form, if required) is determined by the data themselves. This approach is based on a Bayesian interpretation of conventional sparse reconstruction and regularisation techniques, in which sparsity is imposed through priors via Bayesian model selection. We demonstrate our method for noisy 1- and 2-dimensional signals, including examples of processing astronomical images. The numerical implementation uses dynamic nested sampling, and uncertainties are calculated using the methods introduced in Chapters 3 and 4. Chapter 7 applies our Bayesian sparse reconstruction framework to artificial neural networks, where it allows the optimum network architecture to be determined by treating the number of nodes and hidden layers as parameters. We conclude by suggesting possible areas of future research in Chapter 8.

APA, Harvard, Vancouver, ISO, and other styles

17

Simpson, Edwin Daniel. "Combined decision making with multiple agents." Thesis, University of Oxford, 2014. http://ora.ox.ac.uk/objects/uuid:f5c9770b-a1c9-4872-b0dc-1bfa28c11a7f.

Full text

Abstract:

In a wide range of applications, decisions must be made by combining information from multiple agents with varying levels of trust and expertise. For example, citizen science involves large numbers of human volunteers with differing skills, while disaster management requires aggregating information from multiple people and devices to make timely decisions. This thesis introduces efficient and scalable Bayesian inference for decision combination, allowing us to fuse the responses of multiple agents in large, real-world problems and account for the agents’ unreliability in a principled manner. As the behaviour of individual agents can change significantly, for example if agents move in a physical space or learn to perform an analysis task, this work proposes a novel combination method that accounts for these time variations in a fully Bayesian manner using a dynamic generalised linear model. This approach can also be used to augment agents’ responses with continuous feature data, thus permitting decision-making when agents’ responses are in limited supply. Working with information inferred using the proposed Bayesian techniques, an information-theoretic approach is developed for choosing optimal pairs of tasks and agents. This approach is demonstrated by an algorithm that maintains a trustworthy pool of workers and enables efficient learning by selecting informative tasks. The novel methods developed here are compared theoretically and empirically to a range of existing decision combination methods, using both simulated and real data. The results show that the methodology proposed in this thesis improves accuracy and computational efficiency over alternative approaches, and allows for insights to be determined into the behavioural groupings of agents.

APA, Harvard, Vancouver, ISO, and other styles

18

OSAKI, Miho, Takeshi FURUHASHI, Tomohiro YOSHIKAWA, Yoshinobu WATANABE, 美穂大崎, 武. 古橋, 大弘吉川, and 芳信渡辺. "対話型進化計算における実評価数可変型評価値推論法の適用." 日本知能情報ファジィ学会, 2008. http://hdl.handle.net/2237/20681.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Catanach, Thomas Anthony. "Computational Methods for Bayesian Inference in Complex Systems." Thesis, 2017. https://thesis.library.caltech.edu/10263/1/catanach_thesis_deposit.pdf.

Full text

Abstract:

Bayesian methods are critical for the complete understanding of complex systems. In this approach, we capture all of our uncertainty about a system’s properties using a probability distribution and update this understanding as new information becomes available. By taking the Bayesian perspective, we are able to effectively incorporate our prior knowledge about a model and to rigorously assess the plausibility of candidate models based upon observed data from the system. We can then make probabilistic predictions that incorporate uncertainties, which allows for better decision making and design. However, while these Bayesian methods are critical, they are often computationally intensive, thus necessitating the development of new approaches and algorithms.

In this work, we discuss two approaches to Markov Chain Monte Carlo (MCMC). For many statistical inference and system identification problems, the development of MCMC made the Bayesian approach possible. However, as the size and complexity of inference problems has dramatically increased, improved MCMC methods are required. First, we present Second-Order Langevin MCMC (SOL-MC), a stochastic dynamical system-based MCMC algorithm that uses the damped second-order Langevin stochastic differential equation (SDE) to sample a desired posterior distribution. Since this method is based on an underlying dynamical system, we can utilize existing work in the theory for dynamical systems to develop, implement, and optimize the sampler's performance. Second, we present advances and theoretical results for Sequential Tempered MCMC (ST-MCMC) algorithms. Sequential Tempered MCMC is a family of parallelizable algorithms, based upon Transitional MCMC and Sequential Monte Carlo, that gradually transform a population of samples from the prior to the posterior through a series of intermediate distributions. Since the method is population-based, it can easily be parallelized. In this work, we derive theoretical results to help tune parameters within the algorithm. We also introduce a new sampling algorithm for ST-MCMC called the Rank-One Modified Metropolis Algorithm (ROMMA). This algorithm improves sampling efficiency for inference problems where the prior distribution constrains the posterior. In particular, this is shown to be relevant for problems in geophysics.

We also discuss the application of Bayesian methods to state estimation, disturbance detection, and system identification problems in complex systems. We introduce a Bayesian perspective on learning models and properties of physical systems based upon a layered architecture that can learn quickly and flexibly. We then apply this architecture to detecting and characterizing changes in physical systems with applications to power systems and biology. In power systems, we develop a new formulation of the Extended Kalman Filter for estimating dynamic states described by differential algebraic equations. This filter is then used as the basis for sub-second fault detection and classification. In synthetic biology, we use a Bayesian approach to detect and identify unknown chemical inputs in a biosensor system implemented in a cell population. This approach uses the tools of Bayesian model selection.

APA, Harvard, Vancouver, ISO, and other styles

20

(9805406), Md Rahat Hossain. "A novel hybrid method for solar power prediction." Thesis, 2013. https://figshare.com/articles/thesis/A_novel_hybrid_method_for_solar_power_prediction/13432601.

Full text

Abstract:

Renewable energy sources, particularly solar energy, play a vital role for generating environment-friendly electricity. Foremost advantages of solar energy sources are: nonpolluting, free in terms of availability and renewable. The renewable green-energy sources are becoming more cost-effective and sustainable substitutes to conventional fossil fuels. Nonetheless, power generation from Photovoltaic (PV) systems is unpredictable due to its reliance on meteorological conditions. The effective use of this fluctuating solar energy source obliges reliable and robust forecast information for management and operation of a contemporary power grid. Due to the remarkable proliferation of solar power generation, the prediction of solar power yields becomes more and more imperative. Large-scale penetration of solar power in the electricity grid provides numerous challenges to the grid operator, mainly due to the intermittency of the sun. Since the power produced by a PV depends decisively on the unpredictability of the sun, unexpected variations of a PV output may increase operating costs for the electricity system as well as set potential threats to the reliability of electricity supply. Nevertheless, the prediction accuracy level of the existing prediction methods for solar power is not up to the mark that is very much required to deal with the forthcoming sophisticated and advanced power grid like Smart Grid. Therefore, accurate solar power prediction methods become very substantial. The main goal of this thesis is to produce a novel hybrid prediction method for more accurate, reliable and robust solar power prediction using modern Computational Intelligence (CI). The hybrid prediction method which is mainly composed of multiple regressive machine learning techniques will be as accurate and reliable as possible, to accommodate the needs of any future systems that depend upon it for generator or load scheduling, or grid stability control applications. In this thesis, research on the experimental analysis and development of hybrid machine learning for solar power prediction has been presented. The thesis makes the following major contributions: 1) It investigates heterogeneous machine learning techniques for hybrid prediction methods for solar power 2) It applies feature selection methods to individually improve the prediction accuracy of previous machine learning techniques 3) It investigates possible parameter optimisation of computational intelligence techniques to make sure that individual predictions are as accurate as possible 4) It proposes hybrid prediction by non-linearly integrating the discrete prediction results from various machine-learning techniques. Performance characteristics of the hybrid machine learning over individuals was carried out through experimental analysis and the results are justified by various statistical tests and error validation metrics which confirmed the maximum achievable accuracy of the developed hybrid method for solar power prediction. It is expected that the outcome of the research will provide noteworthy contribution to the relevant research field as well as to Australian power industries in the near future.

APA, Harvard, Vancouver, ISO, and other styles

21

Huang, Chengbang. "Multiscale computational methods for morphogenesis and algorithms for protein-protein interaction inference." 2005. http://etd.nd.edu/ETD-db/theses/available/etd-07212005-085435/.

Full text

Abstract:

Thesis (Ph. D.)--University of Notre Dame, 2005.
Thesis directed by Jesús A. Izaguirre for the Department of Computer Science and Engineering. "November 2005." Includes bibliographical references (leaves 133-139).

APA, Harvard, Vancouver, ISO, and other styles

22

Lunagomez, Simon. "A Geometric Approach for Inference on Graphical Models." Diss., 2009. http://hdl.handle.net/10161/1354.

Full text

Abstract:

We formulate a novel approach to infer conditional independence models or Markov structure of a multivariate distribution. Specifically, our objective is to place informative prior distributions over graphs (decomposable and unrestricted) and sample efficiently from the induced posterior distribution. We also explore the idea of factorizing according to complete sets of a graph; which implies working with a hypergraph that cannot be retrieved from the graph alone. The key idea we develop in this paper is a parametrization of hypergraphs using the geometry of points in $R^m$. This induces informative priors on graphs from specified priors on finite sets of points. Constructing hypergraphs from finite point sets has been well studied in the fields of computational topology and random geometric graphs. We develop the framework underlying this idea and illustrate its efficacy using simulations.
Dissertation

APA, Harvard, Vancouver, ISO, and other styles

23

Hong, Eun-Jong, and Tomás Lozano-Pérez. "Protein side-chain placement: probabilistic inference and integer programming methods." 2003. http://hdl.handle.net/1721.1/3869.

Full text

Abstract:

The prediction of energetically favorable side-chain conformations is a fundamental element in homology modeling of proteins and the design of novel protein sequences. The space of side-chain conformations can be approximated by a discrete space of probabilistically representative side-chain conformations (called rotamers). The problem is, then, to find a rotamer selection for each amino acid that minimizes a potential energy function. This is called the Global Minimum Energy Conformation (GMEC) problem. This problem is an NP-hard optimization problem. The Dead-End Elimination theorem together with the A* algorithm (DEE/A*) has been successfully applied to this problem. However, DEE fails to converge for some complex instances. In this paper, we explore two alternatives to DEE/A* in solving the GMEC problem. We use a probabilistic inference method, the max-product (MP) belief-propagation algorithm, to estimate (often exactly) the GMEC. We also investigate integer programming formulations to obtain the exact solution. There are known ILP formulations that can be directly applied to the GMEC problem. We review these formulations and compare their effectiveness using CPLEX optimizers. We also present preliminary work towards applying the branch-and-price approach to the GMEC problem. The preliminary results suggest that the max-product algorithm is very effective for the GMEC problem. Though the max-product algorithm is an approximate method, its speed and accuracy are comparable to those of DEE/A* in large side-chain placement problems and may be superior in sequence design.
Singapore-MIT Alliance (SMA)

APA, Harvard, Vancouver, ISO, and other styles

24

Viscardi, Cecilia. "Approximate Bayesian Computation and Statistical Applications to Anonymized Data: an Information Theoretic Perspective." Doctoral thesis, 2021. http://hdl.handle.net/2158/1236316.

Full text

Abstract:

Realistic statistical modelling of complex phenomena often leads to considering several latent variables and nuisance parameters. In such cases, the Bayesian approach to inference requires the computation of challenging integrals or summations over high dimensional spaces. Monte Carlo methods are a class of widely used algorithms for performing simulated inference. In this thesis, we consider the problem of sample degeneracy in Monte Carlo methods focusing on Approximate Bayesian Computation (ABC), a class of likelihood-free algorithms allowing inference when the likelihood function is analytically intractable or computationally demanding to evaluate. In the ABC framework sample degeneracy arises when proposed values of the parameters, once given as input to the generative model, rarely lead to simulations resembling the observed data and are hence discarded. Such "poor" parameter proposals, i.e., parameter values having an (exponentially) small probability of producing simulation outcomes close to the observed data, do not contribute at all to the representation of the parameter's posterior distribution. This leads to a very large number of required simulations and/or a waste of computational resources, as well as to distortions in the computed posterior distribution. To mitigate this problem, we propose two algorithms, referred to as the Large Deviations Approximate Bayesian Computation algorithms (LD-ABC), where the ABC typical rejection step is avoided altogether. We adopt an information theoretic perspective resorting to the Method of Types formulation of Large Deviations, thus first restricting our attention to models for i.i.d. discrete random variables and then extending the method to parametric finite state Markov chains. We experimentally evaluate our method through proof-of-concept implementations. Furthermore, we consider statistical applications to anonymized data. We adopt the point of view of an evaluator interested in publishing data about individuals in an ananonymized form that allows balancing the learner’s utility against the risk posed by an attacker, potentially targeting individuals in the dataset. Accordingly, we present a unified Bayesian model applying to data anonymized employing group-based schemes and a related MCMC method to learn the population parameters. This allows relative threat analysis, i.e., an analysis of the risk for any individual in the dataset to be linked to a specific sensitive value beyond what is implied for the general population. Finally, we show the performance of the ABC methods in this setting and test LD-ABC at work on a real-world obfuscated dataset.

APA, Harvard, Vancouver, ISO, and other styles

25

Mudgal, Richa. "Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and Applications." Thesis, 2015. http://etd.iisc.ac.in/handle/2005/3877.

Full text

Abstract:

Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold and biological function of the two proteins would be similar. Homology based information transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks. Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.

APA, Harvard, Vancouver, ISO, and other styles

26

Mudgal, Richa. "Inferences on Structure and Function of Proteins from Sequence Data : Development of Methods and Applications." Thesis, 2015. http://etd.iisc.ernet.in/2005/3877.

Full text

Abstract:

Structural and functional annotation of sequences of putative proteins encoded in the newly sequenced genomes pose an important challenge. While much progress has been made towards high throughput experimental techniques for structure determination and functional assignment to proteins, most of the current genome-wide annotation systems rely on computational methods to derive cues on structure and function based on relationship with related proteins of known structure and/or function. Evolutionary pressure on proteins, forces the retention of sequence features that are important for structure and function. Thus, if it can be established that two proteins have descended from a common ancestor, then it can be inferred that the structural fold and biological function of the two proteins would be similar. Homology based information transfer from one protein to another has played a central role in the understanding of evolution of protein structures, functions and interactions. Many algorithmic improvements have been developed over the past two decades to recognize homologues of a protein from sequence-based searches alone, but there are still a large number of proteins without any functional annotation. The sensitivity of the available methods can be further enhanced by indirect comparisons with the help of intermediately-related sequences which link related families. However, sequence-based homology searches in the current protein sequence space are often restricted to the family members, due to the paucity of natural intermediate sequences that can act as linkers in detecting remote homologues. Thus a major goal of this thesis is to develop computational methods to fill up the sparse regions in the protein sequence space with computationally designed protein-like sequences and thereby create a continuum of protein sequences, which could aid in detecting remote homologues. Such designed sequences are further assessed for their effectiveness in detection of distant evolutionary relationships and functional annotation of proteins with unknown structure and function. Another important aspect in structural bioinformatics is to gain a good understanding of protein sequence - structure - function paradigm. Functional annotations by comparisons of protein sequences can be further strengthened with the addition of structural information; however, instances of functional divergence and convergence may lead to functional mis-annotations. Therefore, a systematic analysis is performed on the fold–function associations using binding site information and their inter-relationships using binding site similarity networks. Chapter 1 provides a background on proteins, their evolution, classification and structural and functional features. This chapter also describes various methods for detection of remote similarities and the role of protein sequence design methods in detection of distant relatives for protein annotation. Pitfalls in prediction of protein function from sequence and structure are also discussed followed by an outline of the thesis. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering. Chapter 2 addresses the problem of paucity of available protein sequences that can act as linkers between distantly related proteins/families and help in detection of distant evolutionary relationships. Previous efforts in protein sequence design for remote homology detection and design of sequences corresponding to specific protein families are discussed. This chapter describes a novel methodology to computationally design intermediately-related protein sequences between two related families and thus fill-in the gaps in the sequence space between the related families. Protein families as defined in SCOP database are represented as position specific scoring matrices (PSSMs) and these profiles of related protein families within a fold are aligned using AlignHUSH -a profile-profile alignment method. Guided by this alignment, the frequency distribution of the amino acids in the two families are combined and for each aligned position a residue is selected based on the combined probability to occur in the alignment positions of two families. Each computationally designed sequence is then subjected to RPS-BLAST searches against an all profile pool representing all protein families. Artificial sequences that detect both the parent profiles with no hits corresponding to other folds qualify as ‘designed intermediate sequences’. Various scoring schemes and divergence levels for the design of protein-like sequences are investigated such that these designed sequences intersperse between two related families, thereby creating a continuum in sequence space. The method is then applied on a large scale for all folds with two or more families and resulted in the design of 3,611,010 intermediately-related sequences for 27,882 profile-profile alignments corresponding to 374 folds. Such designed sequences are generic in nature and can be augmented in any sequence database of natural protein sequences. Such enriched databases can then be queried using any sequence-based remote homology detection method to detect distant relatives. The next chapter (Chapter 3) explores the ability of these designed intermediate sequences to act as linkers of two related families and aid in detection of remote homologues. To assess the applicability of these designed sequences two types of databases have been generated, namely a CONTROL database containing protein sequences from natural sequence databases and an AUGMENTED database in which designed sequences are included in the database of natural sequences. Detailed assessments of the utility of such designed sequences using traditional sequence-based searches in the AUGMENTED database showed an enhanced detection of remote homologues for almost 74% of the folds. For over 3,000 queries, it is demonstrated that designed sequences are positioned as suitable linkers, which mediate connections between distantly related proteins. Using examples from known distant evolutionary relationships, we demonstrate that homology searches in augmented databases show an increase of up to 22% in the number of /correct evolutionary relationships "discovered". Such connections are reported with high sensitivities and very low false positive rates. Interestingly, they fill-in void and sparse regions in sequence space and relate distant proteins not only through multiple routes but also through SCOP-NrichD database, SUPFAM+ database, SUPERFAMILY database, protein domain library queried by pDomTHREADER and HHsearch against HMM library of SCOP families. This approach detected evolutionary relationships for almost 20% of all the families with no known structure or function. Detailed report of predictions for 614 DUFs, their fold and species distribution are provided in this chapter. These predictions are then enriched with GO terms and enzyme information wherever available. A detailed discussion is provided for few of the interesting assignments: DUF1636, DUF1572 and DUF2092 which are functionally annotated as thioredoxin-like 2Fe-2S ferredoxin, putative metalloenzyme and lipoprotein localization factors respectively. These 614 novel structure-function relationships of which 193 are supported by consensus between at least two of the five methods, can be accessed from http://proline.biochem.iisc.ernet.in/RHD_DUFS/. Protein functions can be appreciated better in the light of evolutionary information from their structures. Chapter 6 describes a database of evolutionary relationships identified between Pfam families. The grouping of Pfam families is important to obtain a better understanding on evolutionary relationships and in obtaining clues to functions of proteins in families of yet unknown function. Many structural genomics initiative projects have made considerable efforts in solving structures and bridging the growing gap between protein sequences and their structures. The results of such experiments suggest that often the newly solved structure using X-ray crystallography or NMR methods has structural similarity to a protein with already known structure. These relationships often remain undetected due to unavailability of structural information. Therefore, SUPFAM+ database aims to detect such distant relationships between Pfam families by mapping the Pfam families and SCOP domain families. The work presented in this chapter describes the generation of SUPFAM+ database using a sensitive AlignHUSH method to uncover hidden relationships. Firstly, Pfam families are queried against a profile database of SCOP families to derived Pfam-SCOP associations, and then Pfam families are queried against Pfam database to derive Pfam-Pfam relationships. Pfam families that remain without a mapping to a SCOP family are mapped indirectly to a SCOP family by identifying relationships between such Pfam families and other Pfam families that are already mapped to a SCOP family. The criteria are kept stringent for these mappings to minimize the rate of false positives. In case of a Pfam family mapping to two or more SCOP superfamilies, a decision tree is implemented to assign the Pfam family to a single SCOP superfamily. Using these direct and indirect evolutionary relationships present in the SCOP database, associations between Pfam families are derived. Therefore, relationship between two Pfam families that do not have significant sequence similarity can be identified if both are related to same SCOP superfamily. Almost 36% of the Pfam families could be mapped to SCOP families through direct or indirect association. These Pfam-SCOP associations are grouped into 1,646 different superfamilies and cataloguing changes that occur in the binding sites between two functions, which are analysed in this study to trace possible routes between different functions in evolutionarily related enzymes. The main conclusions of the entire thesis are summarized in Chapter 8, contributing in the area of remote homology detection from sequence information alone and understanding the ‘sequence-structure-function’ paradigm from a binding site perspective. The chapter illustrates the importance of the work presented here in the post-genomic era. The development of the algorithm for the design of ‘intermediately-related sequences’ that could serve as effective linkers in remote homology detection, its subsequent large scale assessment and amenability to be augmented into any protein sequence database and exploration by any sequence-based search method is highlighted. Databases in the NrichD resource are made available in the public domain along with a portal to design artificial sequence for or between protein families. This thesis also provides useful and meaningful predictions for protein families with yet unknown structure and function using NrichD database as well as four other state-of-the-art sequence-based remote homology detection methods. A different aspect addressed in this thesis provides a fundamental understanding of the relationships between protein structure and functions. Evolutionary relationships between functional families are identified using the inherent structural information for these families and fold-function relationships are studied from a perspective of similarities in their binding sites. Such studies help in the area of functional annotation, polypharmacology and protein engineering.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Computational inference method'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles