Dissertations / Theses: 'Sequence data'

1

Chui, Chun-kit, and 崔俊傑. "OLAP on sequence data." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2010. http://hub.hku.hk/bib/B45823996.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Pray, Keith A. "Apriori Sets And Sequences: Mining Association Rules from Time Sequence Attributes." Link to electronic thesis, 2004. http://www.wpi.edu/Pubs/ETD/Available/etd-0506104-150831/.

Full text

Abstract:

Thesis (M.S.) -- Worcester Polytechnic Institute.
Keywords: mining complex data; temporal association rules; computer system performance; stock market analysis; sleep disorder data. Includes bibliographical references (p. 79-85).

APA, Harvard, Vancouver, ISO, and other styles

3

Brine, A. "Direct sequence data transmission systems." Thesis, University of Kent, 1987. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.379274.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Zhang, Minghua, and 張明華. "Sequence mining algorithms." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B44570119.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Ibeh, Neke. "Inferring Viral Dynamics from Sequence Data." Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/35317.

Full text

Abstract:

One of the primary objectives of infectious disease research is uncovering the direct link that exists between viral population dynamics and molecular evolution. For RNA viruses in particular, evolution occurs at such a rapid pace that epidemiological processes become ingrained into gene sequences. Conceptually, this link is easy to make: as RNA viruses spread throughout a population, they evolve with each new host infection. However, developing a quantitative understanding of this connection is difficult. Thus, the emerging discipline of phylodynamics is centered on reconciling epidemiology and phylogenetics using genetic analysis. Here, we present two research studies that draw on phylodynamic principles in order to characterize the progression and evolution of the Ebola virus and the human immunodefficiency virus (HIV). In the first study, the interplay between selection and epistasis in the Ebola virus genome is elucidated through the ancestral reconstruction of a critical region in the Ebola virus glycoprotein. Hence, we provide a novel mechanistic account of the structural changes that led up to the 2014 Ebola virus outbreak. The second study applies an approximate Bayesian computation (ABC) approach to the inference of epidemiological parameters. First, we demonstrate the accuracy of this approach with simulated data. Then, we infer the dynamics of the Swiss HIV-1 epidemic, illustrating the applicability of this statistical method to the public health sector. Altogether, this thesis unravels some of the complex dynamics that shape epidemic progression, and provides potential avenues for facilitating viral surveillance efforts.

APA, Harvard, Vancouver, ISO, and other styles

6

Parsons, Jeremy David. "Computer analysis of molecular sequences." Thesis, University of Cambridge, 1993. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.282922.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Hamby, Stephen Edward. "Data mining techniques for protein sequence analysis." Thesis, University of Nottingham, 2010. http://eprints.nottingham.ac.uk/11498/.

Full text

Abstract:

This thesis concerns two areas of bioinformatics related by their role in protein structure and function: protein structure prediction and post translational modification of proteins. The dihedral angles Ψ and Φ are predicted using support vector regression. For the prediction of Ψ dihedral angles the addition of structural information is examined and the normalisation of Ψ and Φ dihedral angles is examined. An application of the dihedral angles is investigated. The relationship between dihedral angles and three bond J couplings determined from NMR experiments is described by the Karplus equation. We investigate the determination of the correct solution of the Karplus equation using predicted Φ dihedral angles. Glycosylation is an important post translational modification of proteins involved in many different facets of biology. The work here investigates the prediction of N-linked and O-linked glycosylation sites using the random forest machine learning algorithm and pairwise patterns in the data. This methodology produces more accurate results when compared to state of the art prediction methods. The black box nature of random forest is addressed by using the trepan algorithm to generate a decision tree with comprehensible rules that represents the decision making process of random forest. The prediction of our program GPP does not distinguish between glycans at a given glycosylation site. We use farthest first clustering, with the idea of classifying each glycosylation site by the sugar linking the glycan to protein. This thesis demonstrates the prediction of protein backbone torsion angles and improves the current state of the art for the prediction of glycosylation sites. It also investigates potential applications and the interpretation of these methods.

APA, Harvard, Vancouver, ISO, and other styles

8

Chung, Jimmy Hok Leung. "Application of sequence prediction to data compression." Thesis, Manchester Metropolitan University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.322411.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Maydt, Jochen. "Analysis of recombination in molecular sequence data." Aachen Shaker, 2008. http://d-nb.info/993318045/04.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Layton, Martin Ian. "Augmented statistical models for classifying sequence data." Thesis, University of Cambridge, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.613094.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Ragonnet-Cronin, Manon Lily. "Transmission networks inferred from HIV sequence data." Thesis, University of Edinburgh, 2015. http://hdl.handle.net/1842/16151.

Full text

Abstract:

HIV in the UK in the 1980s was concentrated within men who have sex with men (MSM) and people who inject drugs (PWID) but heterosexual sex is now the most frequently reported risk behaviour. As these risk groups are associated with different virus populations, this is reflected in the subtype diversification of the UK epidemic, which was historically dominated by subtype B. I have made use of a national database of HIV sequences collected during routine clinical care, which also contains data on age, sex, route of exposure & ethnicity. The 2014 release of the UK HIV Drug Resistance Database contained data from over 60,000 patients. In this thesis, I first describe the development of novel tools that rapidly and automatically identify HIV clusters within phylogenetic trees containing tens of thousands of sequences because they represent transmission chains within the larger infected population. I use these tools to compare the HIV subtype B epidemics in the UK and Switzerland, which had both been described separately but using different approaches. Working with Swiss colleagues, I was able to analyse the epidemics in exactly the same way without having to share sensitive data. I found clustering in the UK to be much higher at relaxed thresholds than in Switzerland (34% vs 16%) indicating that the UK database is more likely to capture transmission chains. Down sampling revealed that this pattern is driven by the larger size of the UK epidemic. At tighter cluster thresholds, the epidemics were very similar. I next use these tools to analyse the spread of emerging subtypes A1, C, D and G in the UK. I found both risk group and cluster size to be predictive of cluster growth, which I tested using simulations and a GLM. Growth of MSM and crossover clusters was significantly higher than expected for subtypes A1 and C, indicating that crossover from heterosexuals to MSM has contributed to their expansion within the UK. Numbers were small for subtypes D and G but the proportion of new diagnoses linking to MSM and crossover clusters was similar to A1 and C, suggesting that the same pattern may be emerging for D and G. I conclude by evaluating the accuracy of a method previously described by our group to generate transmission networks from HIV sequences. The interpretation of clustering patterns from phylogenetic trees is difficult because of the absence of a standardised statistical framework. In contrast, a body of work exists that relates disease transmission to networks. Using large simulated datasets, I developed algorithms which eliminate improbable links. I then reconstructed improved UK transmission networks for subtypes A1, B and C and compare network metrics (such as the degree distribution) between risk groups. Together with other evidence, this thesis demonstrates that the UK HIV epidemic continues to be driven by transmission among MSM. The UK epidemic is no longer compartmentalised and the crossing over of subtypes across risk groups has been facilitated by MSM also having sex with women.

APA, Harvard, Vancouver, ISO, and other styles

12

Chan, Wing-yan Sarah, and 陳詠欣. "Emerging substrings for sequence classification." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2003. http://hub.hku.hk/bib/B2971672X.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

PustuÅ‚ka-Hunt, ElzÌ‡bieta Katarzyna. "Biological sequence indexing using persistent Java." Thesis, University of Glasgow, 2001. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.270957.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Huang, Tzu-Kuo. "Exploiting Non-Sequence Data in Dynamic Model Learning." Research Showcase @ CMU, 2013. http://repository.cmu.edu/dissertations/561.

Full text

Abstract:

Virtually all methods of learning dynamic models from data start from the same basic assumption: that the learning algorithm will be provided with a single or multiple sequences of data generated from the dynamic model. However, in quite a few modern time series modeling tasks, the collection of reliable time series data turns out to be a major challenge, due to either slow progression of the dynamic process of interest, or inaccessibility of repetitive measurements of the same dynamic process over time. In most of those situations, however, we observe that it is easier to collect a large amount of non-sequence samples, or random snapshots of the dynamic process of interest without time information. This thesis aims to exploit such non-sequence data in learning a few widely used dynamic models, including fully observable, linear and nonlinear models as well as Hidden Markov Models (HMMs). For fully observable models, we point out several issues on model identifiability when learning from non-sequence data, and develop EM-type learning algorithms based on maximizing approximate likelihood. We also consider the setting where a small amount of sequence data are available in addition to non-sequence data, and propose a novel penalized least square approach that uses non-sequence data to regularize the model. For HMMs, we draw inspiration from recent advances in spectral learning of latent variable models and propose spectral algorithms that provably recover the model parameters, under reasonable assumptions on the generative process of non-sequence data and the true model. To the best of our knowledge, this is the first formal guarantee on learning dynamic models from non-sequence data. We also consider the case where little sequence data are available, and propose learning algorithms that, as in the fully observable case, use non-sequence data to provide regularization, but does so in combination with spectral methods. Experiments on synthetic data and several real data sets, including gene expression and cell image time series, demonstrate the effectiveness of our proposed methods. In the last part of the thesis we return to the usual setting of learning from sequence data, and consider learning bi-clustered vector auto-regressive models, whose transition matrix is both sparse, revealing significant interactions among variables, and bi-clustered, identifying groups of variables that have similar interactions with other variables. Such structures may aid other learning tasks in the same domain that have abundant non-sequence data by providing better regularization in our proposed non-sequence methods.

APA, Harvard, Vancouver, ISO, and other styles

15

Parry-Smith, David John. "Algorithms and data structures for protein sequence analysis." Thesis, University of Leeds, 1990. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.277404.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Pei, Shermin. "Identification of functional RNA structures in sequence data." Thesis, Boston College, 2016. http://hdl.handle.net/2345/bc-ir:107275.

Full text

Abstract:

Thesis advisor: Michelle M. Meyer
Thesis advisor: Peter Clote
Structured RNAs have many biological functions ranging from catalysis of chemical reactions to gene regulation. Many of these homologous structured RNAs display most of their conservation at the secondary or tertiary structure level. As a result, strategies for natural structured RNA discovery rely heavily on identification of sequences sharing a common stable secondary structure. However, correctly identifying the functional elements of the structure continues to be challenging. In addition to studying natural RNAs, we improve our ability to distinguish functional elements by studying sequences derived from in vitro selection experiments to select structured RNAs that bind specific proteins. In this thesis, we seek to improve methods for distinguishing functional RNA structures from arbitrarily predicted structures in sequencing data. To do so, we developed novel algorithms that prioritize the structural properties of the RNA that are under selection. In order to identify natural structured ncRNAs, we bring concepts from evolutionary biology to bear on the de novo RNA discovery process. Since there is selective pressure to maintain the structure, we apply molecular evolution concepts such as neutrality to identify functional RNA structures. We hypothesize that alignments corresponding to structured RNAs should consist of neutral sequences. During the course of this work, we developed a novel measure of neutrality, the structure ensemble neutrality (SEN), which calculates neutrality by averaging the magnitude of structure retained over all single point mutations to a given sequence. In order to analyze in vitro selection data for RNA-protein binding motifs, we developed a novel framework that identifies enriched substructures in the sequence pool. Our method accounts for both sequence and structure components by abstracting the overall secondary structure into smaller substructures composed of a single base-pair stack. Unlike many current tools, our algorithm is designed to deal with the large data sets coming from high-throughput sequencing. In conclusion, our algorithms have similar performance to existing programs. However, unlike previous methods, our algorithms are designed to leverage the evolutionary selective pressures in order to emphasize functional structure conservation
Thesis (PhD) — Boston College, 2016
Submitted to: Boston College. Graduate School of Arts and Sciences
Discipline: Biology

APA, Harvard, Vancouver, ISO, and other styles

17

Swenson, Hugo. "Detection of artefacts in FFPE-sample sequence data." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-392623.

Full text

Abstract:

Next generation sequencing is increasingly used as a diagnostic tool in the clinical setting. This is driven by the vast increase in molecular targeted therapy, which requires detailed information on what genetic variants are present in patient samples. In the hospital setting, most cancer diagnostics are based on Formalin Fixed Paraffin Embedded (FFPE) samples. The FFPE routine is very beneficial for logistical purposes and for some histopathological analyses, but creates problems for molecular diagnostics based on DNA. These problems derive from sample immersion informalin, which results in DNA fragmentation, interstrand DNA crosslinking and sequence artefacts due to hydrolytic deamination. Distinguishing such artefacts from true somatic variants can be challenging, thus affecting both research and clinical analyses. In order to identify FFPE-artefacts from true variants in next generation sequencing data from FFPE samples, I developed the novelprogram FUSAC (FFPE tissue UMI based Sequence Artefact Classifier) for the facility Clinical Genomics in Uppsala. FUSAC utilizes UniqueMolecular Identifiers (UMI's) to identify and group sequencing reads based on their molecule of origin. By using UMI's to collapse duplicate paired reads into consensus reads, FFPE-artefacts are classified through comparative analysis of the positive and negative strand sequences. My findings indicate that FUSAC can succesfully classify UMI-tagged next generation sequencing reads with FFPE-artefacts, from sequencing reads with true variants. FUSAC thus presents a novel approach in bioinformatic pipelines for studying FFPE-artefacts.

APA, Harvard, Vancouver, ISO, and other styles

18

Winarko, Edi, and edwin@ugm ac id. "The Discovery and Retrieval of Temporal Rules in Interval Sequence Data." Flinders University. Informatics and Engineering, 2007. http://catalogue.flinders.edu.au./local/adt/public/adt-SFU20080107.164033.

Full text

Abstract:

Data mining is increasingly becoming important tool in extracting interesting knowledge from large databases. Many industries are now using data mining tools for analysing their large collections of databases and making business decisions. Many data mining problems involve temporal aspects, with examples ranging from engineering to scientific research, finance and medicine. Temporal data mining is an extension of data mining which deals with temporal data. Mining temporal data poses more challenges than mining static data. While the analysis of static data sets often comes down to the question of data items, with temporal data there are many additional possible relations. One of the tasks in temporal data mining is the pattern discovery task, whose objective is to discover time-dependent correlations, patterns or rules between events in large volumes of data. To date, most temporal pattern discovery research has focused on events existing at a point in time rather than over a temporal interval. In comparison to static rules, mining with respect to time points provides semantically richer rules. However, accommodating temporal intervals offers rules that are richer still. This thesis addresses several issues related to the pattern discovery from interval sequence data. Despite its importance, this area of research has received relatively little attention and there are still many issues that need to be addressed. Three main issues that this thesis considers include the definition of what constitutes an interesting pattern in interval sequence data, the efficient mining for patterns in the data, and the identification of interesting patterns from a large number of discovered patterns. In order to deal with these issues, this thesis formulates the problem of discovering rules, which we term richer temporal association rules, from interval sequence databases. Furthermore, this thesis develops an efficient algorithm, ARMADA, for discovering richer temporal association rules. The algorithm does not require candidate generation. It utilizes a simple index, and only requires at most two database scans. In this thesis, a retrieval system is proposed to facilitate the selection of interesting rules from a set of discovered richer temporal association rules. To this end, a high-level query language specification, TAR-QL, is proposed to specify the criteria of the rules to be retrieved from the rule sets. Three low-level methods are developed to evaluate queries involving rule format conditions. In order to improve the performance of the methods, signature file based indexes are proposed. In addition, this thesis proposes the discovery of inter-transaction relative temporal association rules from event sequence databases.

APA, Harvard, Vancouver, ISO, and other styles

19

Harshbarger, Stuart D. "Measured noise performance of a data clock circuit derived from the local M-sequence in direct-sequence spread spectrum systems." Thesis, Monterey, California : Naval Postgraduate School, 1990. http://handle.dtic.mil/100.2/ADA238335.

Full text

Abstract:

Thesis (M.S. in Electrical Engineering)--Naval Postgraduate School, September 1990.
Thesis Advisor(s): Myers, Glen. Second Reader: Ha, Tri. "September 1990." Description based on title screen as viewed on December 21, 2009. DTIC Identifiers: Direct sequence spread spectrum, data clocks, delay lock loops, sequence generators. Author(s) subject terms: Direct-sequence spread spectrum, communications, data clock recovery, M-sequence, delay-lock loop, spread spectrum, binary sequence generation. Includes bibliographical references (p. 40). Also available in print.

APA, Harvard, Vancouver, ISO, and other styles

20

Chen, Liangzhe. "Segmenting, Summarizing and Predicting Data Sequences." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/83573.

Full text

Abstract:

Temporal data is ubiquitous nowadays and can be easily found in many applications. Consider the extensively studied social media website Twitter. All the information can be associated with time stamps, and thus form different types of data sequences: a sequence of feature values of users who retweet a message, a sequence of tweets from a certain user, or a sequence of the evolving friendship networks. Mining these data sequences is an important task, which reveals patterns in the sequences, and it is a very challenging task as it usually requires different techniques for different sequences. The problem becomes even more complicated when the sequences are correlated. In this dissertation, we study the following two types of data sequences, and we show how to carefully exploit within-sequence and across-sequence correlations to develop more effective and scalable algorithms. 1. Multi-dimensional value sequences: We study sequences of multi-dimensional values, where each value is associated with a time stamp. Such value sequences arise in many domains such as epidemiology (medical records), social media (keyword trends), etc. Our goals are: for individual sequences, to find a segmentation of the sequence to capture where the pattern changes; for multiple correlated sequences, to use the correlations between sequences to further improve our segmentation; and to automatically find explanations of the segmentation results. 2. Social media post sequences: Driven by applications from popular social media websites such as Twitter and Weibo, we study the modeling of social media post sequences. Our goal is to understand how the posts (like tweets) are generated and how we can gain understanding of the users behind these posts. For individual social media post sequences, we study a prediction problem to find the users' latent state changes over the sequence. For dependent post sequences, we analyze the social influence among users, and how it affects users in generating posts and links. Our models and algorithms lead to useful discoveries, and they solve real problems in Epidemiology, Social Media and Critical Infrastructure Systems. Further, most of the algorithms and frameworks we propose can be extended to solve sequence mining problems in other domains as well.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

21

Maydt, Jochen [Verfasser]. "Analysis of Recombination in Molecular Sequence Data / Jochen Maydt." Aachen : Shaker, 2009. http://d-nb.info/1126378321/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Myers, Simon R. "The detection of recombination events using DNA sequence data." Thesis, University of Oxford, 2002. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.289117.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Henderson, Daniel Adrian. "Modelling and analysis of non-coding DNA sequence data." Thesis, University of Newcastle Upon Tyne, 1999. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.299427.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Frusher, Marie J. "Predicting protein-protein interactions from sequence and structure data." Thesis, University of Essex, 2005. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.412108.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Menke, Matthew Ewald 1978. "Predicting the beta-trefoil fold from protein sequence data." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/30093.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004.
Includes bibliographical references (p. 45-47).
A method is presented that uses [beta]-strand interactions at both the sequence and the atomic level, to predict the beta-structural motifs in protein sequences. A program called Wrap-and-Pack implements this method, and is shown to recognize β-trefoils, an important class of globular β-structures, in the Protein Data Bank with 92% specificity and 92.3% sensitivity in cross-validation. It is demonstrated that Wrap-and-Pack learns each of the ten known SCOP β-trefoil families, when trained primarily on β-structures that are not β-trefoils, together with 3D structures of known β-trefoils from outside the family. Wrap-and-Pack also predicts many proteins of unknown structure to be β-trefoils. The computational method used here may generalize to other β-structures for which strand topology and profiles of residue accessibility are well conserved.
by Matthew Ewald Menke.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

26

Powell, David Richard 1973. "Algorithms for sequence alignment." Monash University, School of Computer Science and Software Engineering, 2001. http://arrow.monash.edu.au/hdl/1959.1/8051.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Tang, Fung Michael, and 鄧峰. "Sequence classification and melody tracks selection." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2001. http://hub.hku.hk/bib/B29742973.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Tang, Fung Michael. "Sequence classification and melody tracks selection /." Hong Kong : University of Hong Kong, 2001. http://sunzi.lib.hku.hk/hkuto/record.jsp?B25017470.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Ho, Ngai-lam, and 何毅林. "Algorithms on constrained sequence alignment." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B30201949.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Hung, Rong-I. "Computational studies of protein sequence and structure." Thesis, University of Oxford, 1999. http://ora.ox.ac.uk/objects/uuid:9905c946-86dd-4bb3-8824-7c50df136913.

Full text

Abstract:

This thesis explores aspects protein function, structure and sequence by computational approaches. A comparative study of definitions of protein secondary structures was performed. Disagreements in assignment resulting from three different algorithms were observed. The causes of inaccuracies in structure assignments were discussed and possibilities of projecting protein secondary structures by different structural descriptors were tested. The investigation of inconsistent assignments of protein secondary structure led to a study of a more specific issue concerning protein structure/function relationships, namely cis/trans isomerisation of a peptide bond. Surveys were carried out at the level of protein molecules to detect the occurrences of the cis peptide bond, and at the level of protein domains to explore the possible biological implications of the occurrences of the structural motif. Research was then focussed on andalpha;-helical integral membrane proteins. A detailed analysis of sequences and putative transmembrane helical structures was conducted on the ABC transporters from different organisms. Interesting relationships between protein sequences, putative a-helical structures and transporter functions were identified. Applications of molecular dynamics simulations to the transmembrane helices of a specific human ABC transporter, cystic flbrosis transmembrane conductance regulator (CFTR), explored some of these relationships at the atomic resolution. Functional and structural implications of individual residues within membrane-spanning helices were revealed by these simulations studies.

APA, Harvard, Vancouver, ISO, and other styles

31

Liu, Kai. "Detecting stochastic motifs in network and sequence data for human behavior analysis." HKBU Institutional Repository, 2014. https://repository.hkbu.edu.hk/etd_oa/60.

Full text

Abstract:

With the recent advent of Web 2.0, mobile computing, and pervasive sensing technologies, human activities can readily be logged, leaving digital traces of di.erent forms. For instance, human communication activities recorded in online social networks allow user interactions to be represented as “network” data. Also, human daily activities can be tracked in a smart house, where the log of sensor triggering events can be represented as “sequence” data. This thesis research aims to develop computational data mining algorithms using the generative modeling approach to extract salient patterns (motifs) embedded in such network and sequence data, and to apply them for human behavior analysis. Motifs are de.ned as the recurrent over-represented patterns embedded in the data, and have been known to be e.ective for characterizing complex networks. Many motif extraction methods found in the literature assume that a motif is either present or absent. In real practice, such salient patterns can appear partially due to their stochastic nature and/or the presence of noise. Thus, the probabilistic approach is adopted in this thesis to model motifs. For network data, we use a probability matrix to represent a network motif and propose a mixture model to extract network motifs. A component-wise EM algorithm is adopted where the optimal number of stochastic motifs is automatically determined with the help of a minimum message length criterion. Considering also the edge occurrence ordering within a motif, we model a motif as a mixture of .rst-order Markov chains for the extraction. Using a probabilistic approach similar to the one for network motif, an optimal set of stochastic temporal network motifs are extracted. We carried out rigorous experiments to evaluate the performance of the proposed motif extraction algorithms using both synthetic data sets and real-world social network data sets and mobile phone usage data sets, and obtained promising results. Also, we found that some of the results can be interpreted using the social balance and social status theories which are well-known in social network analysis. To evaluate the e.ectiveness of adopting stochastic temporal network motifs for not only characterizing human behaviors, we incorporate stochastic temporal network motifs as local structural features into a factor graph model for followee recommendation prediction (essentially a link prediction problem) in online social networks. The proposed motif-based factor graph model is found to outperform signi.cantly the existing state-of-the-art methods for the prediction task. For extract motifs from sequence data, the probabilistic framework proposed for the stochastic temporal network motif extraction is also applicable. One possible way is to make use of the edit distance in the probabilistic framework so that the subsequences with minor ordering variations can .rst be grouped to form the initial set of motif candidates. A mixture model can then be used to determine the optimal set of temporal motifs. We applied this approach to extract sequence motifs from a smart home data set which contains sensor triggering events corresponding to some activities performed by residents in the smart home. The unique behavior extracted for each resident based on the detected motifs is also discussed. Keywords: Stochastic network motifs, .nite mixture models, expectation maximization algorithms, social networks, stochastic temporal network motifs, mixture of Markov chains, human behavior analysis, followee recommendation, signed social networks, activity of daily living, smart environments

APA, Harvard, Vancouver, ISO, and other styles

32

Fritz, Markus Hsi-Yang. "Exploiting high throughput DNA sequencing data for genomic analysis." Thesis, University of Cambridge, 2012. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.610819.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Peng, Yu, and 彭煜. "Iterative de Bruijn graph assemblers for second-generation sequencing reads." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hub.hku.hk/bib/B50534051.

Full text

Abstract:

The recent advance of second-generation sequencing technologies has made it possible to generate a vast amount of short read sequences from a DNA (cDNA) sample. Current short read assemblers make use of the de Bruijn graph, in which each vertex is a k-mer and each edge connecting vertex u and vertex v represents u and v appearing in a read consecutively, to produce contigs. There are three major problems for de Bruijn graph assemblers: (1) branch problem, due to errors and repeats; (2) gap problem, due to low or uneven sequencing depth; and (3) error problem, due to sequencing errors. A proper choice of k value is a crucial tradeoff in de Bruijn graph assemblers: a low k value leads to fewer gaps but more branches; a high k value leads to fewer branches but more gaps. In this thesis, I first analyze the fundamental genome assembly problem and then propose an iterative de Bruijn graph assembler (IDBA), which iterates from low to high k values, to construct a de Bruijn graph with fewer branches and fewer gaps than any other de Bruijn graph assembler using a fixed k value. Then, the second-generation sequencing data from metagenomic, single-cell and transcriptome samples is investigated. IDBA is then tailored with special treatments to handle the specific issues for each kind of data. For metagenomic sequencing data, a graph partition algorithm is proposed to separate de Bruijn graph into dense components, which represent similar regions in subspecies from the same species, and multiple sequence alignment is used to produce consensus of each component. For sequencing data with highly uneven depth such as single-cell and metagenomic sequencing data, a method called local assembly is designed to reconstruct missing k-mers in low-depth regions. Then, based on the observation that short and relatively low-depth contigs are more likely erroneous, progressive depth on contigs is used to remove errors in both low-depth and high-depth regions iteratively. For transcriptome sequencing data, a variant of the progressive depth method is adopted to decompose the de Bruijn graph into components corresponding to transcripts from the same gene, and then the transcripts are found in each component by considering the reads and paired-end reads support. Plenty of experiments on both simulated and real data show that IDBA assemblers outperform the existing assemblers by constructing longer contigs with higher completeness and similar or better accuracy. The running time of IDBA assemblers is comparable to existing algorithms, while the memory cost is usually less than the others.
published_or_final_version
Computer Science
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

34

Ozarar, Mert. "Prediction Of Protein Subcellular Localization Based On Primary Sequence Data." Master's thesis, METU, 2003. http://etd.lib.metu.edu.tr/upload/1082320/index.pdf.

Full text

Abstract:

Subcellular localization is crucial for determining the functions of proteins. A system called prediction of protein subcellular localization (P2SL) that predicts the subcellular localization of proteins in eukaryotic organisms based on the amino acid content of primary sequences using amino acid order is designed. The approach for prediction is to nd the most frequent motifs for each protein in a given class based on clustering via self organizing maps and then to use these most frequent motifs as features for classication by the help of multi layer perceptrons. This approach allows a classication independent of the length of the sequence. In addition to these, the use of a new encoding scheme is described for the amino acids that conserves biological function based on point of accepted mutations (PAM) substitution matrix. The statistical test results of the system is presented on a four class problem. P2SL achieves slightly higher prediction accuracy than the similar studies.

APA, Harvard, Vancouver, ISO, and other styles

35

Shaolong, Chen. "Efficient data management strategies for sequence alignment on heterogeneous clusters." Doctoral thesis, Universitat Autònoma de Barcelona, 2019. http://hdl.handle.net/10803/667227.

Full text

Abstract:

Entre los sistemas de computación de alto rendimiento, el Intel Xeon Phi es un acelerador que resulta ser una alternativa muy atractiva para mejorar el rendimiento de aplicaciones con necesidades de cómputo intensas que tradicionalmente se ejecutan en sistemas basados en servidores multinúcleo. Esas aplicaciones se pueden migrar de un servidor multinúcleo a un acelerador con un bajo esfuerzo de codificación porque ambos sistemas se basan en núcleos con una misma arquitectura básica. En nuestro estudio, centramos nuestra atención en BWA, uno de los alineadores de secuencia más populares, y hemos analizado diferentes modos de ejecución de BWA en varios sistemas informáticos heterogéneos que incorporan un acelerador. La alineación de secuencias es una fase fundamental en el análisis de variantes genómicas y tiene un alto coste computacional. Aunque su codificación para ejecutarse en un sistema de múltiples núcleos puede ser simple, lograr un buen rendimiento no es fácil en este tipo de sistemas, como muestran nuestros resultados. Hemos desarrollado y evaluado diferentes estrategias que se han aplicado en BWA y, de todas ellas, llegamos a la conclusión de que la variante MDPR, que combina la paralelización de datos y la replicación de datos, es la que proporciona los mejores resultados en todos los sistemas evaluados. MDPR tiene un diseño genérico que permite su uso en diferentes sistemas heterogéneos. Por un lado, lo hemos aplicado en un sistema que consta de un servidor con procesadores multinúcleo Intel Xeon y un acelerador Xeon Phi. Y, por otro lado, también lo hemos evaluado en otros sistemas heterogéneos basados en servidores multinúcleo equipados con procesadores AMD e Intel. En todas estas configuraciones de hardware, hemos probado dos modos dinámicos y un modo estático de distribución de datos en MDPR. Nuestros resultados experimentales muestran que los mejores resultados para MDPR se obtienen cuando se aplica el modo estático de distribución de datos. La estrategia dinámica basada en “round robin” logra un rendimiento similar sin el sobrecoste inicial que requiere el modo estático. Aunque nuestra propuesta se aplicó a BWA utilizando muestras de datos del genoma humano, esta estrategia se puede aplicar fácilmente a otros datos de secuencia y a otras herramientas de alineación que tienen principios operativos similares a los del alineador BWA.
Among the high performance computing systems, the Intel Xeon Phi is an accelerator that turns out to be a very attractive alternative to improve the performance of applications with intense computing needs that are traditionally executed in systems based on multicore servers. These applications can be migrated from a multicore server to an accelerator with a low coding effort because both systems are based on nuclei with the same basic architecture. In our study, we focused our attention on BWA, one of the most popular sequence aligners, and we have analyzed different modes of execution of BWA in various heterogeneous computing systems that incorporate an accelerator. The alignment of sequences is a fundamental phase in the analysis of genomic variants and has a high computational cost. Although its coding to run in a multicore system can be simple, achieving good performance is not easy in this type of systems, as our results show. We have developed and evaluated different strategies that have been applied on BWA and, of all of them, we conclude that the MDPR variant, which combines data parallelization and data replication, is the one that provides the best results in all systems evaluated. MDPR has a generic design that allows it to be used in different heterogeneous systems. On the one hand, we have applied it in a system consisting of a server with Intel Xeon multicore processors and a Xeon Phi accelerator. And, on the other hand, we have also evaluated it in other heterogeneous systems based on multicore servers equipped with AMD and Intel processors. In all these hardware configurations, we have tested two dynamic modes and one static mode of data distribution in MDPR. Our experimental results show that the best results for MDPR are obtained when the static mode of data distribution is applied. The dynamic strategy based on round robin achieves a similar performance without the off-line overhead incurred by the static mode. Although our proposal was applied to BWA using human genome data samples, this strategy can be easily applied to other sequence data and other alignment tools that have operating principles similar to those of the BWA aligner.

APA, Harvard, Vancouver, ISO, and other styles

36

Raza, Atif [Verfasser]. "Metaheuristics for Pattern Mining in Big Sequence Data / Atif Raza." Mainz : Universitätsbibliothek der Johannes Gutenberg-Universität Mainz, 2021. http://d-nb.info/1231992875/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Scanlon, Eben Louis 1974. "Predicting the triple beta-spiral fold from primary sequence data." Thesis, Massachusetts Institute of Technology, 2004. http://hdl.handle.net/1721.1/16617.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science; and, (M.B.A.)--Massachusetts Institute of Technology Sloan School of Management, 2004.
Includes bibliographical references (leaves 118-125).
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
The Triple β-Spiral is a novel protein structure that plays a role in viral attachment and pathogenesis. At present, there are two Triple β-Spiral structures with solved crystallographic coordinates - one from Adenovirus and the other from Reovirus. There is evidence that the fold also occurs in Bacteriophage SF6. In this thesis, we present a computational analysis of the Triple β-Spiral fold. Our goal is to discover new instances of the fold in protein sequence databases. In Chapter 2, we present a series of sequence-based methods for the discovery of the fold. The final method in this Chapter is an iterative profile-based search that outperforms existing sequence-based algorithms. In Chapter 3, we introduce specific knowledge of the protein's structure into our prediction algorithms. Although this additional information does not improve the profile-based methods in Chapter 2, it does provide insight into the important forces that drive the Triple β-Spiral folding process. In Chapter 4, we employ logistic regression to integrate the score information from the previous Chapter into a single unified framework. This framework outperforms all previous methods in cross-validation tests. We do not discover a great number of additional instances of the Triple β-Spiral fold outside of the Adenovirus and Reovirus families. The results of our profile based templates and score integration tools, however, suggest that these methods might well succeed for other protein structures.
by Eben Louis Scanlon.
M.B.A.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

38

Simpson, Jared Thomas. "Efficient sequence assembly and variant calling using compressed data structures." Thesis, University of Cambridge, 2013. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.607828.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Szalay, Tamas. "Improved Analysis of Nanopore Sequence Data and Scanning Nanopore Techniques." Thesis, Harvard University, 2016. http://nrs.harvard.edu/urn-3:HUL.InstRepos:33493548.

Full text

Abstract:

The field of nanopore research has been driven by the need to inexpensively and rapidly sequence DNA. In order to help realize this goal, this thesis describes the PoreSeq algorithm that identifies and corrects errors in real-world nanopore sequencing data and improves the accuracy of \textit{de novo} genome assembly with increasing coverage depth. The approach relies on modeling the possible sources of uncertainty that occur as DNA advances through the nanopore and then using this model to find the sequence that best explains multiple reads of the same region of DNA. PoreSeq increases nanopore sequencing read accuracy of M13 bacteriophage DNA from 85\% to 99\% at 100X coverage. We also use the algorithm to assemble \textit{E. coli} with 30X coverage and the $\lambda$ genome at a range of coverages from 3X to 50X. Additionally, we classify sequence variants at an order of magnitude lower coverage than is possible with existing methods. This thesis also reports preliminary progress towards controlling the motion of DNA using two nanopores instead of one. The speed at which the DNA travels through the nanopore needs to be carefully controlled to facilitate the detection of individual bases. A second nanopore in close proximity to the first could be used to slow or stop the motion of the DNA in order to enable a more accurate readout. The fabrication process for a new pyramidal nanopore geometry was developed in order to facilitate the positioning of the nanopores. This thesis demonstrates that two of them can be placed close enough to interact with a single molecule of DNA, which is a prerequisite for being able to use the driving force of the pores to exert fine control over the motion of the DNA. Another strategy for reading the DNA is to trap it completely with one pore and to move the second nanopore instead. To that end, this thesis also shows that a single strand of immobilized DNA can be captured in a scanning nanopore and examined for a full hour, with data from many scans at many different voltages obtained in order to detect a bound protein placed partway along the molecule.
Engineering and Applied Sciences - Applied Physics

APA, Harvard, Vancouver, ISO, and other styles

40

Di, Nardo Antonello. "Phylodynamic modelling of foot-and-mouth disease virus sequence data." Thesis, University of Glasgow, 2016. http://theses.gla.ac.uk/7558/.

Full text

Abstract:

The under-reporting of cases of infectious diseases is a substantial impediment to the control and management of infectious diseases in both epidemic and endemic contexts. Information about infectious disease dynamics can be recovered from sequence data using time-varying coalescent approaches, and phylodynamic models have been developed in order to reconstruct demographic changes of the numbers of infected hosts through time. In this study I have demonstrated the general concordance between empirically observed epidemiological incidence data and viral demography inferred through analysis of foot-and-mouth disease virus VP1 coding sequences belonging to the CATHAY topotype over large temporal and spatial scales. However a more precise and robust relationship between the effective population size (N_e) of a virus population and the number of infected hosts (or 'host units') (N) has proven elusive. The detailed epidemiological data from the exhaustively-sampled UK 2001 foot-and-mouth (FMD) epidemic combined with extensive amounts of whole genome sequence data from viral isolates from infected premises presents an excellent opportunity to study this relationship in more detail. Using a combination of real and simulated data from the outbreak I explored the relationship between N_e, as estimated through a Bayesian skyline analysis, and the empirical number of infected cases. I investigated the nature of this scaling defining prevalence according to different possible timings of FMD disease progression, and attempting to account for complex variability in the population structure. I demonstrated that the variability in the number of secondary cases per primary infection R_t and the population structure greatly impact on effective scaling of N_e. I further explored how the demographic signal carried by sequence data becomes imprecise and weaker when reducing the number of samples are described, including how the extent of the size and structure of the sampled dataset impact on the accuracy of a reconstructed viral demography at any level of the transmission process. Methods drawn from phylodynamic inference combine powerful epidemiological and population genetic tools which can provide valuable insights into the dynamics of viral disease. However, the strict and sensitive dependency of the majority of these models on their assumptions makes estimates very fragile when these assumptions are violated. It is therefore essential that for these methods to be applied as reliable tools supporting control programs, more focused theoretical research is undertaken to model the epidemiological dynamics of infected populations using sequence data.

APA, Harvard, Vancouver, ISO, and other styles

41

Shrestha, Ram Krishna. "Management and analysis of HIV -1 ultra-deep sequence data." University of the Western Cape, 2014. http://hdl.handle.net/11394/8466.

Full text

Abstract:

Philosophiae Doctor - PhD
The continued success of antiretroviral programmes in the treatment of HIV is dependent on access to a cost-effective HIV drug resistance test (HIV-DRT). HIVDRT involves sequencing a fragment of the HIV genome and characterising the presence/absence of mutations that confer resistance to one or more drugs. HIV-DRT using conventional DNA sequencing is prohibitively expensive (~US$150 per patient) for routine use in resource-limited settings such as many African countries. While the advent of ultra deep pyrosequencing (UDPS) approaches have considerably reduced (3-5 fold reduction) the cost of generating the sequence data, there has been an even more significant increase in the volume of data generated and the complexity involved in its analysis. In order to address this issue we have developed Seq2Res, a computational pipeline for HIV drug resistance test from UDPS genotypic data. We have developed QTrim, software that undertakes high throughput quality trimming of UDPS sequencing data to ensure that subsequently analyzed data is of high quality. The comparison of QTrim to other widely used tools showed that it is equivalent to the next best method at trimming good quality data but outperforms all methods at trimming poor quality data. Further, we have developed, and evaluated, a computational approach for the analysis of UDPS sequence data generated using the novel Primer ID that enables the generation of a consensus sequence from all sequence reads originating from the same viral template, thus reducing the presence of PCR and sequencing induced errors in the dataset as well as reducing. We see that while the Primer ID approach does undoubtedly reduce the prevalence of PCR and sequencing induced errors, it artificially reduces the diversity of the subsequently analysed data due to the large volume of data that is discarded as a result of there being an insufficient number of sequences for consensus sequence generation. We validated the sensitivity of the Seq2Res pipeline using two real biological datasets from the Stanford HIV Database and five simulated datasets The Seq2Res results correlated fully with that of the Stanford database as well as identifying a drug resistance mutations (DRM) that had been incorrectly interpreted by the Stanford approach. Further, the analysis of the simulated datasets showed that Seq2Res is capable of accurately identifying DRMs at all prevalence levels down to at least 1% of the sequence data generated from a viral population. Finally, we applied Seq2Res to UDPS resistance data generated from as many as 641 individuals as part of the CIPRA-SA study to evaluate the effectiveness of UDPS HIV drug resistance genotyping in resource limited settings with a high burden of HIV infections. We find that, despite the FLX coverage being almost three times as much as that of the Junior platform, resistance genotyping results are directly comparable between both of the approaches at a range of prevalence levels to as low as 1%. Further, we find no significant difference between UDPS sequencing and the "gold standard" Sanger based approach, thus indicating that pooling as many as 48 patient's data and sequencing using the Roche/454 Junior platform is a viable approach for HIV drug resistance genotyping. Further, we explored the presence of resistant minor variants in individual's viral populations and find that the identification of minor resistant variants in individuals exposed to nevirapine through PMTCT correlates with the time since exposure. We conclude that HIV resistance genotyping is now a viable prospect for resource limited setting with a high burden of HIV infections and that UDPS approaches are at least as sensitive as the currently used Sanger-based sequencing approaches. Further, the development of Seq2Res has provided a sensitive, easy to use and scalable technology that facilitates the routine use of UDPS for HIV drug resistance genotyping.

APA, Harvard, Vancouver, ISO, and other styles

42

Thorell, Stina. "The transaldolase family : structure, function and evolution /." Stockholm, 2001. http://diss.kib.ki.se/2001/91-628-4923-9/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Li, Yaoman, and 李耀满. "Efficient methods for improving the sensitivity and accuracy of RNA alignments and structure prediction." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2013. http://hdl.handle.net/10722/195977.

Full text

Abstract:

RNA plays an important role in molecular biology. RNA sequence comparison is an important method to analysis the gene expression. Since aligning RNA reads needs to handle gaps, mutations, poly-A tails, etc. It is much more difficult than aligning other sequences. In this thesis, we study the RNA-Seq align tools, the existing gene information database and how to improve the accuracy of alignment and predict RNA secondary structure. The known gene information database contains a lot of reliable gene information that has been discovered. And we note most DNA align tools are well developed. They can run much faster than existing RNA-Seq align tools and have higher sensitivity and accuracy. Combining with the known gene information database, we present a method to align RNA-Seq data by using DNA align tools. I.e. we use the DNA align tools to do alignment and use the gene information to convert the alignment to genome based. The gene information database, though updated daily, there are still a lot of genes and alternative splicings that hadn't been discovered. If our RNA align tool only relies on the known gene database, then there may be a lot reads that come from unknown gene or alternative splicing cannot be aligned. Thus, we show a combinational method that can cover potential alternative splicing junction sites. Combining with the original gene database, the new align tools can cover most alignments which are reported by other RNA-Seq align tools. Recently a lot of RNA-Seq align tools have been developed. They are more powerful and faster than the old generation tools. However, the RNA read alignment is much more complicated than other sequence alignment. The alignments reported by some RNA-Seq align tools have low accuracy. We present a simple and efficient filter method based on the quality score of the reads. It can filter most low accuracy alignments. At last, we present a RNA secondary prediction method that can predict pseudoknot(a type of RNA secondary structure) with high sensitivity and specificity.
published_or_final_version
Computer Science
Master
Master of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

44

Wang, Yi, and 王毅. "Binning and annotation for metagenomic next-generation sequencing reads." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2014. http://hdl.handle.net/10722/208040.

Full text

Abstract:

The development of next-generation sequencing technology enables us to obtain a vast number of short reads from metagenomic samples. In metagenomic samples, the reads from different species are mixed together. So, metagenomic binning has been introduced to cluster reads from the same or closely related species and metagenomic annotation is introduced to predict the taxonomic information of each read. Both metagenomic binning and annotation are critical steps in downstream analysis. This thesis discusses the difficulties of these two computational problems and proposes two algorithmic methods, MetaCluster 5.0 and MetaAnnotator, as solutions. There are six major challenges in metagenomic binning: (1) the lack of reference genomes; (2) uneven abundance ratios; (3) short read lengths; (4) a large number of species; (5) the existence of species with extremely-low-abundance; and (6) recovering low-abundance species. To solve these problems, I propose a two-round binning method, MetaCluster 5.0. The improvement achieved by MetaCluster 5.0 is based on three major observations. First, the short q-mer (length-q substring of the sequence with q = 4, 5) frequency distributions of individual sufficiently long fragments sampled from the same genome are more similar than those sampled from different genomes. Second, sufficiently long w-mers (length-w substring of the sequence with w ≈ 30) are usually unique in each individual genome. Third, the k-mer (length-k substring of the sequence with k ≈ 16) frequencies from reads of a species are usually linearly proportional to that of the species’ abundance. The metagenomic annotation methods in the literatures often suffer from five major drawbacks: (1) unable to annotate many reads; (2) less precise annotation for reads and more incorrect annotation for contigs; (3) unable to deal with novel clades with limited references genomes well; (4) performance affected by variable genome sequence similarities between different clades; and (5) high time complexity. In this thesis, a novel tool, MetaAnnotator, is proposed to tackle these problems. There are four major contributions of MetaAnnotator. Firstly, instead of annotating reads/contigs independently, a cluster of reads/contigs are annotated as a whole. Secondly, multiple reference databases are integrated. Thirdly, for each individual clade, quadratic discriminant analysis is applied to capture the similarities between reference sequences in the clade. Fourthly, instead of using alignment tools, MetaAnnotator perform annotation using k-mer exact match which is more efficient. Experiments on both simulated datasets and real datasets show that MetaCluster 5.0 and MetaAnnotator outperform existing tools with higher accuracy as well as less time and space cost.
published_or_final_version
Computer Science
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

45

Stapert, R. P. "A segmental mixture model, maximising data use with time sequence information." Thesis, Swansea University, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.639099.

Full text

Abstract:

Here, time sequence information is explored as a means of increasing the amount of speaker specific information to be gained from limited data. One of the popular approaches in speaker recognition at the time of writing is called Gaussian mixture modeling which does not use time sequence information as it is implemented here. In this thesis, an attempt is made to use time sequence information without any prior linguistic knowledge or labelling of the databases. This is achieved by embedding dynamic time warping into a Gaussian mixture model structure. The story that is told here covers the main points that need to be investigated in order to create a viable foundation for the inclusion of dynamic time warping in a Gaussian mixture model. The experimental results show that temporal constraints offer better speaker discrimination than unconstrained nearest neighbour decisions. It is also shown that using speech segments shorter than the actual utterance, in combination with dynamic time warping, can provide additional error reduction. This foundation work prompts the work on Gaussian mixture models, which reveals that the combination of dynamic time warping and Gaussian mixture models can improve identification results significantly in a text independent environment. The term segmental mixture model is used to identify the combination of two techniques. It is tested on twenty speakers of the BT Millar database, which is a multi-session digit database, and on one thousand speakers of the Welsh SpeechDat database, which is a large text independent database. In both instances the segmental mixture model demonstrates its potential for enhancing the discrimination between speakers.

APA, Harvard, Vancouver, ISO, and other styles

46

Zhang, Qi Wang Wei. "Mining emerging massive scientific sequence data using block-wise decomposition methods." Chapel Hill, N.C. : University of North Carolina at Chapel Hill, 2009. http://dc.lib.unc.edu/u?/etd,2530.

Full text

Abstract:

Thesis (Ph. D.)--University of North Carolina at Chapel Hill, 2009.
Title from electronic title page (viewed Oct. 5, 2009). "... in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer Science." Discipline: Computer Science; Department/School: Computer Science.

APA, Harvard, Vancouver, ISO, and other styles

47

Mumpower, Eric J. P. "FITSL : a language for directed exploration and analysis of sequence data." Thesis, Massachusetts Institute of Technology, 2007. http://hdl.handle.net/1721.1/41653.

Full text

Abstract:

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.
Includes bibliographical references (p. 81-84).
This thesis describes a sequence-data processing toolkit for analysis of Intelligent Tutoring System (ITS) log data, that unlike other tools allows directed exploration of sequence patterns. This system provides a powerful yet straightforward abstraction for sequence-data processing, and a set of high-level manipulation primitives which allow arbitrarily complex transformations of such data. Using this language, very sophisticated queries can be performed using only a few lines of code. Furthermore, queries can be constructed interactively, allowing for rapid development, refinement, and comparison of hypotheses. Importantly, this system is not limited to ITS logs, but is equally applicable to the manipulation of any form of (potentially multidimensional) sequence data.
by Eric J.P. Mumpower.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

48

Svärd, Karl. "Developing new methods for estimating population divergence times from sequence data." Thesis, Uppsala universitet, Institutionen för medicinsk biokemi och mikrobiologi, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-450123.

Full text

Abstract:

Methods for estimating past demographic events of populations are powerful tools in order to get insights of otherwise hidden pasts. The genetic data of people is a valuable resource for these purposes as patterns of variation can inform of the past evolutionary forces and historical events that generated them. There is, however, a lack of methods within the field that uses this information to its full extent. That is why this project has looked at developing a set of new alternatives for estimating demographic events. The work done has been based on modifying the purely sequence based method TTo (Two-Two-outgroup) for estimating divergence times of two populations. The modifications consisted of using beta distributions to model the polymorphic diversity of the ancestral population in order to increase the max sample size possible. The finished project resulted in two implemented methods: TT-beta and a partial variant of MM. TT-beta was able to produce estimations in the same region as TTo and showed that the usage of beta distributions had real potential. For MM there only was a partial implementation able to be done, but this one also showed promise and the ability to use varying sample sizes to estimate demographic values.

APA, Harvard, Vancouver, ISO, and other styles

49

Bajalan, Amanj. "Improved methods for virus detection and discovery in metagenomic sequence data." Thesis, Uppsala universitet, Institutionen för biologisk grundutbildning, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-412478.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Álvarez-Carretero, Sandra. "BACTpipe : Characterization of bacterial isolates based on whole-genome sequence data." Thesis, Högskolan i Skövde, Institutionen för biovetenskap, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-15033.

Full text

Abstract:

The technological advances have led to faster and more cost-effective sequencing platforms, making it quicker and more affordable to generate genomic sequence data. For the study of bacterial genome, two main methods can be used, whole-genome sequencing and metagenomic shotgun sequencing, of which the first is the mostly used in the past years. As a consequence of these advances, a vast amount of data is currently available and the need of bioinformatics tools to efficiently analyse and interpret it has dramatically increased. At present, there is a great quantity of tools to use in each step of bacterial genome characterization: (1) pre-processing, (2) de novo assembly, (3) annotation, and (4) taxonomic and functional comparisons. Therefore, it is difficult to decide which tools are better to use and the analysis is slowed down when changing from one tool to another. In order to tackle this, the pipeline BACTpipe was developed. This pipeline concatenates both bioinformatics tools selected based on a previous testing and additional scripts to perform the whole bacterial analysis at once. The most relevant output generated by BACTpipe are the annotated de novo assembled genomes, the newick file containing the phylogenetic relationships between species, and the gene presence-absence matrix, which the users can then filter according to their interests. After testing BACTpipe with a set of bacterial whole-genome sequence data, 60 genes out of the 18195 found in all the Lactobacillus species analysed were classified as core genes, i.e. genes shared among all these species. Housekeeping genes or genes involved in the replication, transcription, or translation processes were identified

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Sequence data'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles