Dissertations / Theses on the topic 'Dati NGS'

To see the other types of publications on this topic, follow the link: Dati NGS.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Dati NGS.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

LAMONTANARA, ANTONELLA. "Sviluppo ed applicazione di pipilines bioinformatiche per l'analisi di dati NGS." Doctoral thesis, Università Cattolica del Sacro Cuore, 2015. http://hdl.handle.net/10280/6068.

Full text
Abstract:
Lo sviluppo delle tecnologie di sequenziamento ha portato alla nascita di strumenti in grado di produrre gigabasi di dati di sequenziamento in una singola corsa. Queste tecnologie, comunemente indicate come Next Generation Sequencing o NGS, producono grandi e complessi dataset la cui analisi comporta diversi problemi a livello bioinformatico. L'analisi di questo tipo di dati richiede la messa a punto di pipelines computazionali il cui sviluppo richiede un lavoro di scripting necessario per concatenare i softwares già esistenti. Questa tesi tratta l'aspetto metodologico dell'analisi di dati NGS ottenuti con tecnologia Illumina. In particolare in essa sono state sviluppate tre pipelines bioinformatiche applicate ai seguenti casi studio: 1) uno studio di espressione genica mediante RNA-seq in "Olea europaea" finalizzato all’indagine dei meccanismi molecolari alla base dell’acclimatazione al freddo in questa specie; 2) uno studio mediante RNA-seq finalizzato all’identificazione dei polimorfismi di sequenza nel trascrittoma di due razze bovine mirato a produrre un ampio catalogo di marcatori di tipo SNPs; 3) il sequenziamento, l’assemblaggio e l’annotazione del genoma di un ceppo di Lactobacillus plantarum che mostrava potenziali proprietà probiotiche.
The advance in sequencing technologies has led to the birth of sequencing platforms able to produce gigabases of sequencing data in a single run. These technologies commonly referred to as Next Generation Sequencing or NGS produce millions of short sequences called “reads” generating large and complex datasets that pose several challenges for Bioinformatics. The analysis of large omics dataset require the development of bioinformatics pipelines that are the organization of the bioinformatics tools in computational chains in which the output of one analysis is the input of the subsequent analysis. A work of scripting is needed to chain together a group of existing software tools.This thesis deals with the methodological aspect of the data analysis in NGS sequencing performed with the Illumina technology. In this thesis three bioinformatics pipelines were developed.to the following cases of study: 1) a global transcriptome profiling of “Oleaeuropeae” during cold acclimation, aimed to unravel the molecular mechanisms of cold acclimation in this species; 2) a SNPs profiling in the transcriptome of two cattle breeds aimed to produce an extensive catalogue of SNPs; 3) the genome sequencing, the assembly and annotation of the genome of a Lactobacillus plantarum strain showing probiotic properties.
APA, Harvard, Vancouver, ISO, and other styles
2

Giannini, Simone. "Strumenti statistici per elaborazione dati su sequenziamenti di genoma umano." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amslaurea.unibo.it/12059/.

Full text
Abstract:
L'analisi del DNA è una delle chiavi per la comprensione della vita e dei suoi funzionamenti. Le tecniche di sequenziamento di nuova generazione NGS permettono una analisi parallela di molte sequenze che hanno reso possibili i sequenziamenti di genomi interi e l'impiego di questi dati in una vasta gamma di studi. In questa tesi verranno descritte le principali tecniche di sequenziamento NGS. Per quanto riguarda il genoma umano si tratteranno alcune tematiche di studio di varianti affrontate dal gruppo 1000Genomes. Nella fase conclusiva si introdurranno definizioni di statistica utili nell'affrontare l'elaborazione dei dati. Inoltre vengono descritti alcuni strumenti che permettono di svolgere questo tipo di analisi.
APA, Harvard, Vancouver, ISO, and other styles
3

DENTI, LUCA. "Algorithms for analyzing genetic variability from Next-Generation Sequencing data." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2020. http://hdl.handle.net/10281/263551.

Full text
Abstract:
Il DNA contiene l'informazione genetica che è essenziale per il corretto sviluppo di qualsiasi organismo. Essere in grado di analizzare il DNA risulta indispensabile per comprendere le cause di malattie e tumori e per migliorare la qualità delle nostre vite. Lo sviluppo delle tecniche di sequenziamento del DNA ha rivoluzionato il modo in cui queste analisi sono eseguite. A causa dell'immensa quantità di dati biologici disponibili, oggigiorno l'informatica gioca un ruolo fondamentale nella loro analisi. Fortunatamente in molte applicazioni l'informazione biologica contenuta in una molecola di DNA può essere rappresentata come una stringa nella quale ogni carattere rappresenta un nucleotide. Il concetto di stringa è molto studiato in informatica ed è possibile sfruttare l'estesa letteratura relativa alla memorizzazione e all'analisi di stringhe per migliorare lo studio del DNA. In questo contesto, questa tesi si focalizza su due problemi che emergono dall'analisi di dati di sequenziamento: lo studio della variabilità trascrittomica dovuta allo splicing alternativo e l'analisi della variabilità genetica dovuta a variazioni genetiche quali Single Nucleotide Polymorphisms e indels. Riguardo entrambi i problemi, investighiamo due originali approcci computazionali e ne dimostriamo l'efficacia confrontandoli con i tools più utilizzati nel relativo stato dell'arte. Il nostro obiettivo è lo sviluppo di tool bioinformatici che combinano algoritmi accurati con strutture dati efficienti. Il primo problema che affrontiamo è l'identificazione di eventi di splicing alternativo a partire da dati RNA-Seq. Lo splicing alternativo gioca un ruolo fondamentale in molti aspetti della vita, dal corretto sviluppo di un individuo al sorgere di malattie. Diversamente dagli approcci proposti in letteratura che si basano sulla quantificazione di trascritti o sull'allineamento spliced contro un genoma di riferimento, proponiamo un approccio algoritmico alternativo che sfrutta l'originale concetto di allineamento a un grafo di splicing. Abbiamo implementato il nostro approccio nel tool ASGAL che allinea un sample di RNA-Seq contro il grafo di splicing di un gene e identifica gli eventi di splicing alternativo supportati dal sample andando a confrontare questi ultimi con l'annotazione del gene. ASGAL è il primo tool che allinea RNA-Seq reads a un grafo di splicing e che è in grado di identificare eventi novel di splicing anche quando un singolo trascritto per gene è supportato dal sample in input. I risultati della nostra sperimentazione dimostrano l'utilità di allineare a un grafo di splicing e la capacità del nostro tool nell'identificare eventi di splicing alternativo. Il secondo problema che affrontiamo è la genotipizzazione di un insieme di varianti note (SNPs e indels) a partire da dati di sequenziamento. Un'approfondita analisi di queste variazioni è indispensabile per comprendere la variabilità genetica fra gli individui di una popolazione e il loro fattore di rischio genetico. Gli approcci proposti in letteratura per identificare e genotipizzare varianti includono l'allineamento delle reads, una procedura che risulta computazionalmente troppo onerosa per le tipiche applicazioni cliniche. Quando non si è interessati alla scoperta di nuove varianti, è possibile evitare lo step di allineamento andando a genotipizzare solo un insieme di varianti già note e per le quali è stata già dimostrata una certa rilevanza medica. Per risolvere questo problema, abbiamo ideato un nuovo approccio alignment-free e lo abbiamo implementato nel tool MALVA. MALVA è il primo approccio alignment-free che è in grado di genotipizzare SNPs, indels e varianti multi-alleliche. Grazie alla strategia alignment-free, MALVA è molto più veloce degli approcci basati sull'allineamento, esibendo comunque un'accuratezza simile. Inoltre, rispetto agli approcci più utilizzati in letteratura, MALVA risulta molto più accurato nella genotipizzazione degli indels.
DNA contains the genetic information that is essential for the correct development of any organism. Being able to investigate DNA is of utmost importance for analyzing the reasons behind diseases and for improving the quality of life. Development of DNA sequencing technologies has revolutionized the way this kind of investigation is performed. Due to the huge amount of sequencing data available, nowadays computer science plays a key role in their analysis. Luckily, in many applications, the biological information contained in a DNA molecule can be represented as a string in which each character represents a nucleotide. Strings are a well-known and well-studied notion in computer science and therefore it is possible to exploit the huge literature related to storing and processing strings for improving the analysis of DNA. Within this context, this thesis focuses on two specific problems arising from the analysis of sequencing data: the study of transcript variability due to alternative splicing and the investigation of genetic variability among different individuals due to small variations such as Single Nucleotide Polymorphisms and indels. Regarding both these problems, we investigate two novel computational approaches by devising original strategies and we prove their efficacy by comparing them with the most used state-of-the-art approaches. In both these areas, our focus is on the development of bioinformatics tools that combine accurate algorithms with efficient data structures. The first problem we tackle is the detection of alternative splicing events from RNA-Seq data. Alternative splicing plays an important role in many different life aspects, from the correct evolution of an individual to the development of diseases. Differently from current techniques that rely on the reconstruction of transcripts or on the spliced alignment of RNA-Seq reads against a reference genome, we investigate an alternative algorithmic approach that exploits the novel notion of alignment against a splicing graph. We implemented such an approach in a tool, called ASGAL, that aligns a RNA-Seq sample against the splicing graph of a gene and then detects the alternative splicing events supported by the sample by comparing the alignments with the gene annotation. ASGAL is the first tool that aligns reads against a splicing graph and that is able to detect novel alternative splicing events even when only a single transcript per gene is supported by the sample. The results of our experiments show the usefulness of aligning reads against a splicing graph and prove the ability of the proposed approach in detecting alternative splicing events. The second problem we tackle is the genotyping of a set of known Single Nucleotide Polymorphisms and indels from sequencing data. An in-depth analysis of these variants allows to understand genetic variability among different individuals of a population and their genetic risks factors for diseases. Standard pipelines for variant discovery and genotyping include read alignment, a computationally expensive procedure that is too time consuming for typical clinical applications. When variant discovery is not desired, it is possible to avoid read alignment by genotyping only the set of known variants that are already established to be of medical relevance. To solve this problem, we devised a novel alignment-free algorithmic approach and we implemented it in a bioinformatic tool, called MALVA. MALVA is the first alignment-free approach that is able to genotype SNPs, indels, and multi-allelic variants. Thanks to its alignment-free strategy, MALVA requires one order of magnitude less time than alignment-based pipelines to genotype a donor individual while achieving similar accuracy. Remarkably, on indels it provides even better results than the most widely adopted approaches.
APA, Harvard, Vancouver, ISO, and other styles
4

Bombonato, Juliana Rodrigues. "Dados filogenômicos para inferência de relações evolutivas entre espécies do gênero Cereus Mill. (Cactaceae, Cereeae)." Universidade de São Paulo, 2018. http://www.teses.usp.br/teses/disponiveis/59/59139/tde-08062018-160032/.

Full text
Abstract:
Estudos filogenômicos usando Sequenciamento de Próxima Geração (do inglês, Next Generation Sequencing - NGS) estão se tornando cada vez mais comuns. O uso de marcadores oriundos do sequenciamento de DNA de uma biblioteca genômica reduzida, neste caso ddRADSeq (do inglês, Double Digestion Restriction Site Associated DNA Sequencing), para este fim é promissor, pelo menos considerando sua relação custo-benefício em grandes conjuntos de dados de grupos não-modelo, bem como a representação genômica recuperada. Aqui usamos ddRADSeq para inferir a filogenia em nível de espécie do gênero Cereus (Cactaceae). Esse gênero compreende em cerca de 25 espécies reconhecidas predominantemente sul-americanas distribuídas em quatro subgêneros. Nossa amostra inclui representantes de Cereus, além de espécies dos gêneros próximos, Cipocereus e Praecereus, além de grupos externos. A biblioteca ddRADSeq foi preparada utilizando as enzimas EcoRI e HPAII. Após o controle de qualidade (tamanho e quantificação dos fragmentos), a biblioteca foi sequenciada no Illumina HiSeq 2500. O processamento de bioinformática a partir de arquivos FASTQ incluiu o controle da presença de adaptadores, filtragem por qualidade (softwares FastQC, MultiQC e SeqyClean) e chamada de SNPs (software iPyRAD). Três cenários de permissividade a dados faltantes foram realizados no iPyRAD, recuperando conjuntos de dados com 333 (até 40% de dados perdidos), 1440 (até 60% de dados perdidos) e 6141 (até 80% de dados faltantes) loci. Para cada conjunto de dados, árvores de Máxima Verossimilhança (MV) foram geradas usando duas supermatrizes: SNPs ligados e Loci. Em geral, observamos algumas inconsistências entre as árvores ML geradas em softwares distintos (IQTree e RaxML) ou baseadas no tipo de matriz distinta (SNPs ligados e Loci). Por outro lado, a precisão e a resolução, foram melhoradas usando o maior conjunto de dados (até 80% de dados perdidos). Em geral, apresentamos uma filogenia com resolução inédita para o gênero Cereus, que foi resolvido como um provável grupo monofilético, composto por quatro clados principais e com alto suporte em suas relações internas. Além disso, nossos dados contribuem para agregar informações sobre o debate sobre o aumento de dados faltantes para conduzir a análise filogenética com loci RAD.
Phylogenomics studies using Next Generation Sequencing (NGS) are becoming increasingly common. The use of Double Digest Restriction Site Associated DNA Sequencing (ddRADSeq) markers to this end is promising, at least considering its cost-effectiveness in large datasets of non-model groups as well as the genome-wide representation recovered in the data. Here we used ddRADSeq to infer the species level phylogeny of genus Cereus (Cactaceae). This genus comprises about 25 species recognized predominantly South American species distributed into four subgenera. Our sample includes representatives of Cereus, in addition to species from the closely allied genera Cipocereus and Praecereus, besides outgroups. The ddRADSeq library was prepared using EcoRI and HPAII enzymes. After the quality control (fragments size and quantification) the library was sequenced in Illumina HiSeq 2500. The bioinformatic processing on raw FASTQ files included adapter trimming, quality filtering (FastQC, MultiQC and SeqyClean softwares) and SNPs calling (iPyRAD software). Three scenarios of permissiveness to missing data were carry out in iPyRAD, recovering datasets with 333 (up tp 40% missing data), 1440 (up to 60% missing data) and 6141 (up to 80% missing data) loci. For each dataset, Maximum Likelihood (ML) trees were generated using two supermatrices: SNPs linked and Loci. In general, we observe few inconsistences between ML trees generated in distinct softwares (IQTree and RaxML) or based in distinctive matrix type (SNP linked and Loci). On the other hand, the accuracy and resolution were improved using the larger dataset (up to 80% missing data). Overall, we present a phylogeny with unprecedent resolution for genus Cereus, which was resolved as a likely monophyletic group, composed by four main clades and with high support in their internal relationships. Further, our data contributes to aggregate information on the debate about to increasing missing data to conduct phylogenetic analysis with RAD loci.
APA, Harvard, Vancouver, ISO, and other styles
5

Alic, Andrei Stefan. "Improved Error Correction of NGS Data." Doctoral thesis, Universitat Politècnica de València, 2016. http://hdl.handle.net/10251/67630.

Full text
Abstract:
[EN] The work done for this doctorate thesis focuses on error correction of Next Generation Sequencing (NGS) data in the context of High Performance Computing (HPC). Due to the reduction in sequencing cost, the increasing output of the sequencers and the advancements in the biological and medical sciences, the amount of NGS data has increased tremendously. Humans alone are not able to keep pace with this explosion of information, therefore computers must assist them to ease the handle of the deluge of information generated by the sequencing machines. Since NGS is no longer just a research topic (used in clinical routine to detect cancer mutations, for instance), requirements in performance and accuracy are more stringent. For sequencing to be useful outside research, the analysis software must work accurately and fast. This is where HPC comes into play. NGS processing tools should leverage the full potential of multi-core and even distributed computing, as those platforms are extensively available. Moreover, as the performance of the individual core has hit a barrier, current computing tendencies focus on adding more cores and explicitly split the computation to take advantage of them. This thesis starts with a deep analysis of all these problems in a general and comprehensive way (to reach out to a very wide audience), in the form of an exhaustive and objective review of the NGS error correction field. We dedicate a chapter to this topic to introduce the reader gradually and gently into the world of sequencing. It presents real problems and applications of NGS that demonstrate the impact this technology has on science. The review results in the following conclusions: the need of understanding of the specificities of NGS data samples (given the high variety of technologies and features) and the need of flexible, efficient and accurate tools for error correction as a preliminary step of any NGS postprocessing. As a result of the explosion of NGS data, we introduce MuffinInfo. It is a piece of software capable of extracting information from the raw data produced by the sequencer to help the user understand the data. MuffinInfo uses HTML5, therefore it runs in almost any software and hardware environment. It supports custom statistics to mould itself to specific requirements. MuffinInfo can reload the results of a run which are stored in JSON format for easier integration with third party applications. Finally, our application uses threads to perform the calculations, to load the data from the disk and to handle the UI. In continuation to our research and as a result of the single core performance limitation, we leverage the power of multi-core computers to develop a new error correction tool. The error correction of the NGS data is normally the first step of any analysis targeting NGS. As we conclude from the review performed within the frame of this thesis, many projects in different real-life applications have opted for this step before further analysis. In this sense, we propose MuffinEC, a multi-technology (Illumina, Roche 454, Ion Torrent and PacBio -experimental), any-type-of-error handling (mismatches, deletions insertions and unknown values) corrector. It surpasses other similar software by providing higher accuracy (demonstrated by three type of tests) and using less computational resources. It follows a multi-steps approach that starts by grouping all the reads using a k-mers based metric. Next, it employs the powerful Smith-Waterman algorithm to refine the groups and generate Multiple Sequence Alignments (MSAs). These MSAs are corrected by taking each column and looking for the correct base, determined by a user-adjustable percentage. This manuscript is structured in chapters based on material that has been previously published in prestigious journals indexed by the Journal of Citation Reports (on outstanding positions) and relevant congresses.
[ES] El trabajo realizado en el marco de esta tesis doctoral se centra en la corrección de errores en datos provenientes de técnicas NGS utilizando técnicas de computación intensiva. Debido a la reducción de costes y el incremento en las prestaciones de los secuenciadores, la cantidad de datos disponibles en NGS se ha incrementado notablemente. La utilización de computadores en el análisis de estas muestras se hace imprescindible para poder dar respuesta a la avalancha de información generada por estas técnicas. El uso de NGS transciende la investigación con numerosos ejemplos de uso clínico y agronómico, por lo que aparecen nuevas necesidades en cuanto al tiempo de proceso y la fiabilidad de los resultados. Para maximizar su aplicabilidad clínica, las técnicas de proceso de datos de NGS deben acelerarse y producir datos más precisos. En este contexto es en el que las técnicas de comptuación intensiva juegan un papel relevante. En la actualidad, es común disponer de computadores con varios núcleos de proceso e incluso utilizar múltiples computadores mediante técnicas de computación paralela distribuida. Las tendencias actuales hacia arquitecturas con un mayor número de núcleos ponen de manifiesto que es ésta una aproximación relevante. Esta tesis comienza con un análisis de los problemas fundamentales del proceso de datos en NGS de forma general y adaptado para su comprensión por una amplia audiencia, a través de una exhaustiva revisión del estado del arte en la corrección de datos de NGS. Esta revisión introduce gradualmente al lector en las técnicas de secuenciación masiva, presentando problemas y aplicaciones reales de las técnicas de NGS, destacando el impacto de esta tecnología en ciencia. De este estudio se concluyen dos ideas principales: La necesidad de analizar de forma adecuada las características de los datos de NGS, atendiendo a la enorme variedad intrínseca que tienen las diferentes técnicas de NGS; y la necesidad de disponer de una herramienta versátil, eficiente y precisa para la corrección de errores. En el contexto del análisis de datos, la tesis presenta MuffinInfo. La herramienta MuffinInfo es una aplicación software implementada mediante HTML5. MuffinInfo obtiene información relevante de datos crudos de NGS para favorecer el entendimiento de sus características y la aplicación de técnicas de corrección de errores, soportando además la extensión mediante funciones que implementen estadísticos definidos por el usuario. MuffinInfo almacena los resultados del proceso en ficheros JSON. Al usar HTML5, MuffinInfo puede funcionar en casi cualquier entorno hardware y software. La herramienta está implementada aprovechando múltiples hilos de ejecución por la gestión del interfaz. La segunda conclusión del análisis del estado del arte nos lleva a la oportunidad de aplicar de forma extensiva técnicas de computación de altas prestaciones en la corrección de errores para desarrollar una herramienta que soporte múltiples tecnologías (Illumina, Roche 454, Ion Torrent y experimentalmente PacBio). La herramienta propuesta (MuffinEC), soporta diferentes tipos de errores (sustituciones, indels y valores desconocidos). MuffinEC supera los resultados obtenidos por las herramientas existentes en este ámbito. Ofrece una mejor tasa de corrección, en un tiempo muy inferior y utilizando menos recursos, lo que facilita además su aplicación en muestras de mayor tamaño en computadores convencionales. MuffinEC utiliza una aproximación basada en etapas multiples. Primero agrupa todas las secuencias utilizando la métrica de los k-mers. En segundo lugar realiza un refinamiento de los grupos mediante el alineamiento con Smith-Waterman, generando contigs. Estos contigs resultan de la corrección por columnas de atendiendo a la frecuencia individual de cada base. La tesis se estructura por capítulos cuya base ha sido previamente publicada en revistas indexadas en posiciones dest
[CAT] El treball realitzat en el marc d'aquesta tesi doctoral se centra en la correcció d'errors en dades provinents de tècniques de NGS utilitzant tècniques de computació intensiva. A causa de la reducció de costos i l'increment en les prestacions dels seqüenciadors, la quantitat de dades disponibles a NGS s'ha incrementat notablement. La utilització de computadors en l'anàlisi d'aquestes mostres es fa imprescindible per poder donar resposta a l'allau d'informació generada per aquestes tècniques. L'ús de NGS transcendeix la investigació amb nombrosos exemples d'ús clínic i agronòmic, per la qual cosa apareixen noves necessitats quant al temps de procés i la fiabilitat dels resultats. Per a maximitzar la seua aplicabilitat clínica, les tècniques de procés de dades de NGS han d'accelerar-se i produir dades més precises. En este context és en el que les tècniques de comptuación intensiva juguen un paper rellevant. En l'actualitat, és comú disposar de computadors amb diversos nuclis de procés i inclús utilitzar múltiples computadors per mitjà de tècniques de computació paral·lela distribuïda. Les tendències actuals cap a arquitectures amb un nombre més gran de nuclis posen de manifest que és esta una aproximació rellevant. Aquesta tesi comença amb una anàlisi dels problemes fonamentals del procés de dades en NGS de forma general i adaptat per a la seua comprensió per una àmplia audiència, a través d'una exhaustiva revisió de l'estat de l'art en la correcció de dades de NGS. Esta revisió introduïx gradualment al lector en les tècniques de seqüenciació massiva, presentant problemes i aplicacions reals de les tècniques de NGS, destacant l'impacte d'esta tecnologia en ciència. D'este estudi es conclouen dos idees principals: La necessitat d'analitzar de forma adequada les característiques de les dades de NGS, atenent a l'enorme varietat intrínseca que tenen les diferents tècniques de NGS; i la necessitat de disposar d'una ferramenta versàtil, eficient i precisa per a la correcció d'errors. En el context de l'anàlisi de dades, la tesi presenta MuffinInfo. La ferramenta MuffinInfo és una aplicació programari implementada per mitjà de HTML5. MuffinInfo obté informació rellevant de dades crues de NGS per a afavorir l'enteniment de les seues característiques i l'aplicació de tècniques de correcció d'errors, suportant a més l'extensió per mitjà de funcions que implementen estadístics definits per l'usuari. MuffinInfo emmagatzema els resultats del procés en fitxers JSON. A l'usar HTML5, MuffinInfo pot funcionar en gairebé qualsevol entorn maquinari i programari. La ferramenta està implementada aprofitant múltiples fils d'execució per la gestió de l'interfície. La segona conclusió de l'anàlisi de l'estat de l'art ens porta a l'oportunitat d'aplicar de forma extensiva tècniques de computació d'altes prestacions en la correcció d'errors per a desenrotllar una ferramenta que suport múltiples tecnologies (Illumina, Roche 454, Ió Torrent i experimentalment PacBio). La ferramenta proposada (MuffinEC), suporta diferents tipus d'errors (substitucions, indels i valors desconeguts). MuffinEC supera els resultats obtinguts per les ferramentes existents en este àmbit. Oferix una millor taxa de correcció, en un temps molt inferior i utilitzant menys recursos, la qual cosa facilita a més la seua aplicació en mostres més gran en computadors convencionals. MuffinEC utilitza una aproximació basada en etapes multiples. Primer agrupa totes les seqüències utilitzant la mètrica dels k-mers. En segon lloc realitza un refinament dels grups per mitjà de l'alineament amb Smith-Waterman, generant contigs. Estos contigs resulten de la correcció per columnes d'atenent a la freqüència individual de cada base. La tesi s'estructura per capítols la base de la qual ha sigut prèviament publicada en revistes indexades en posicions destacades de l'índex del Journal of Citation Repor
Alic, AS. (2016). Improved Error Correction of NGS Data [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/67630
TESIS
APA, Harvard, Vancouver, ISO, and other styles
6

Spáčil, Michael. "Zálohování dat a datová úložiště." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2021. http://www.nusl.cz/ntk/nusl-444686.

Full text
Abstract:
The diploma thesis is focused on the design of a backup system to increase the efficiency of working with stored data and increase the security of stored data. The analysis of the current state describes the company itself and also the backup system using the audit portal Zefis.cz. The following part describes the design of a new backup system that focuses on complexity using the cloud, magnetic tapes, and high server availability.
APA, Harvard, Vancouver, ISO, and other styles
7

Hriadeľ, Ondřej. "Návrh a implementace plánu zálohování dat společnosti." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2019. http://www.nusl.cz/ntk/nusl-399540.

Full text
Abstract:
This diploma thesis is focused on the development of a new backup plan and its implementation. In introductory part of the thesis I explore the theorethical backround of data backup and data management. Next part is dedicated to analysis of current state and investor requierements. Last part is aimed to implementation of new backup plan with focusing on economic and quality point of view. Besides concept and realization of backup plan the concept of the backup directive is created .
APA, Harvard, Vancouver, ISO, and other styles
8

Janíček, Libor. "Zálohování dat a datová úložiště." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2020. http://www.nusl.cz/ntk/nusl-417707.

Full text
Abstract:
The master´s thesis focuses on problematics associated with data backup and storages. It deals with the realistic data backup issue at a real municipal office. Part of the work is a thorough analysis of the current state and suggestions for improvement.
APA, Harvard, Vancouver, ISO, and other styles
9

Chen, Dao-Peng. "Statistical power for RNA-seq data to detect two epigenetic phenomena." The Ohio State University, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=osu1357248975.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

SAGGESE, IGOR. "NGS data analysis approaches for clinical applications." Doctoral thesis, Università del Piemonte Orientale, 2017. http://hdl.handle.net/11579/86924.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Otto, Raik. "Distance-based methods for the analysis of Next-Generation sequencing data." Doctoral thesis, Humboldt-Universität zu Berlin, 2021. http://dx.doi.org/10.18452/23267.

Full text
Abstract:
Die Analyse von NGS Daten ist ein zentraler Aspekt der modernen genomischen Forschung. Bei der Extraktion von Daten aus den beiden am häufigsten verwendeten Quellorganismen bestehen jedoch vielfältige Problemstellungen. Im ersten Kapitel wird ein neuartiger Ansatz vorgestellt welcher einen Abstand zwischen Krebszellinienkulturen auf Grundlage ihrer kleinen genomischen Varianten bestimmt um die Kulturen zu identifizieren. Eine Voll-Exom sequenzierte Kultur wird durch paarweise Vergleiche zu Referenzdatensätzen identifiziert so ein gemessener Abstand geringer ist als dies bei nicht verwandten Kulturen zu erwarten wäre. Die Wirksamkeit der Methode wurde verifiziert, jedoch verbleiben Einschränkung da nur das Sequenzierformat des Voll-Exoms unterstützt wird. Daher wird im zweiten Kapitel eine publizierte Modifikation des Ansatzes vorgestellt welcher die Unterstützung der weitläufig genutzten Bulk RNA sowie der Panel-Sequenzierung ermöglicht. Die Ausweitung der Technologiebasis führt jedoch zu einer Verstärkung von Störeffekten welche zu Verletzungen der mathematischen Konditionen einer Abstandsmetrik führen. Daher werden die entstandenen Verletzungen durch statistische Verfahren zuerst quantifiziert und danach durch dynamische Schwellwertanpassungen erfolgreich kompensiert. Das dritte Kapitel stellt eine neuartige Daten-Aufwertungsmethode (Data-Augmentation) vor welche das Trainieren von maschinellen Lernmodellen in Abwesenheit von neoplastischen Trainingsdaten ermöglicht. Ein abstraktes Abstandsmaß wird zwischen neoplastischen Entitäten sowie Entitäten gesundem Ursprungs mittels einer transkriptomischen Dekonvolution hergestellt. Die Ausgabe der Dekonvolution erlaubt dann das effektive Vorhersagen von klinischen Eigenschaften von seltenen jedoch biologisch vielfältigen Krebsarten wobei die prädiktive Kraft des Verfahrens der des etablierten Goldstandard ebenbürtig ist.
The analysis of NGS data is a central aspect of modern Molecular Genetics and Oncology. The first scientific contribution is the development of a method which identifies Whole-exome-sequenced CCL via the quantification of a distance between their sets of small genomic variants. A distinguishing aspect of the method is that it was designed for the computer-based identification of NGS-sequenced CCL. An identification of an unknown CCL occurs when its abstract distance to a known CCL is smaller than is expected due to chance. The method performed favorably during benchmarks but only supported the Whole-exome-sequencing technology. The second contribution therefore extended the identification method by additionally supporting the Bulk mRNA-sequencing technology and Panel-sequencing format. However, the technological extension incurred predictive biases which detrimentally affected the quantification of abstract distances. Hence, statistical methods were introduced to quantify and compensate for confounding factors. The method revealed a heterogeneity-robust benchmark performance at the trade-off of a slightly reduced sensitivity compared to the Whole-exome-sequencing method. The third contribution is a method which trains Machine-Learning models for rare and diverse cancer types. Machine-Learning models are subsequently trained on these distances to predict clinically relevant characteristics. The performance of such-trained models was comparable to that of models trained on both the substituted neoplastic data and the gold-standard biomarker Ki-67. No proliferation rate-indicative features were utilized to predict clinical characteristics which is why the method can complement the proliferation rate-oriented pathological assessment of biopsies. The thesis revealed that the quantification of an abstract distance can address sources of erroneous NGS data analysis.
APA, Harvard, Vancouver, ISO, and other styles
12

Qiao, Dandi. "Statistical Approaches for Next-Generation Sequencing Data." Thesis, Harvard University, 2012. http://dissertations.umi.com/gsas.harvard:10689.

Full text
Abstract:
During the last two decades, genotyping technology has advanced rapidly, which enabled the tremendous success of genome-wide association studies (GWAS) in the search of disease susceptibility loci (DSLs). However, only a small fraction of the overall predicted heritability can be explained by the DSLs discovered. One possible explanation for this ”missing heritability” phenomenon is that many causal variants are rare. The recent development of high-throughput next-generation sequencing (NGS) technology provides the instrument to look closely at these rare variants with precision and efficiency. However, new approaches for both the storage and analysis of sequencing data are in imminent needs. In this thesis, we introduce three methods that could be utilized in the management and analysis of sequencing data. In Chapter 1, we propose a novel and simple algorithm for compressing sequencing data that leverages on the scarcity of rare variant data, which enables the storage and analysis of sequencing data efficiently in current hardware environment. We also provide a C++ implementation that supports direct and parallel loading of the compressed format without requiring extra time for decompression. Chapter 2 and 3 focus on the association analysis of sequencing data in population-based design. In Chapter 2, we present a statistical methodology that allows the identification of genetic outliers to obtain a genetically homogeneous subpopulation, which reduces the false positives due to population substructure. Our approach is computationally efficient that can be applied to all the genetic loci in the data and does not require pruning of variants in linkage disequilibrium (LD). In Chapter 3, we propose a general analysis framework in which thousands of genetic loci can be tested simultaneously for association with complex phenotypes. The approach is built on spatial-clustering methodology, assuming that genetic loci that are associated with the target phenotype cluster in certain genomic regions. In contrast to standard methodology for multi-loci analysis, which has focused on the dimension reduction of data, the proposed approach profits from the availability of large numbers of genetic loci. Thus it will be especially relevant for whole-genome sequencing studies which commonly record several thousand loci per gene.
APA, Harvard, Vancouver, ISO, and other styles
13

Prieto, Barja Pablo 1986. "NGS applications in genome evolution and adaptation : A reproducible approach to NGS data analysis and integration." Doctoral thesis, Universitat Pompeu Fabra, 2017. http://hdl.handle.net/10803/565601.

Full text
Abstract:
In this PhD I have used NGS technologies in different organisms and scenarios such as in ENCODE, comparing the conservation and evolution of long non-coding RNA sequences between human and mouse, using experimental evidences from genome, transcriptome and chromatin. A similar approach was followed in other organisms such as the mesoamerican common bean and in chicken. Other analysis carried with NGS data involved the well known parasite, Leishmania Donovani, the causative agent of Leishmaniasis. I used NGS data obtained from genome and transcriptome to study the fate of its genome in survival strategies for adaptation and long term evolution. All this work was approached while working in tools and strategies to efficiently design and implement the bioinformatics analysis also known as pipelines or workflows, in order to make them easy to use, easily deployable, accessible and highly performing. This work has provided several strategies in order to avoid lack of reproducibility and inconsistency in scientific research with real biological applications towards sequence analysis and genome evolution.
En aquest doctorat he utilitzat tecnologies NGS en diferents organismes i projectes com l'ENCODE, comparant la conservació i evolució de seqüències de RNA llargs no codificant entre el ratolí i l'humà, utilitzant evidències experimentals del genoma, transcriptoma i cromatina. He seguit una estratègia similar en altres organismes com són la mongeta mesoamericana i el pollastre. En altres anàlisis he hagut d'utilitzar dades NGS en l'estudi del conegut paràsit leishmània Donovani, l'agent causatiu de la malaltia Leishmaniosis. Utilitzant dades NGS obtingudes del genoma i transcriptoma he estudiat les conseqüències del genoma en estratègies d'adaptació i evolució a llarg termini. Aquest treball es va realitzar mentre treballava en eines i estratègies per dissenyar eficientment i implementar els anàlisis bioinformàtics coneguts com a diagrames de treball, per tal de fer-los fàcils d'utilitzar, fàcilment realitzables, accessibles i amb un alt rendiment. Aquest treball present diverses estratègies per tal d'evitar la falta de reproductibilitat i consistència en la investigació científica amb aplicacions reals a la biologia de l'anàlisi de seqüències i evolució de genomes.
APA, Harvard, Vancouver, ISO, and other styles
14

Ranciati, Saverio <1988&gt. "Statistical modelling of spatio-temporal dependencies in NGS data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7680/1/thesis_ranciati_saverio.pdf.

Full text
Abstract:
Next-generation sequencing (NGS) has rapidly become the current standard in genetic related analysis. This switch from microarray to NGS required new statistical strategies to address the research questions inherent to the considered phenomena. First and foremost, NGS dataset usually consist of discrete observations characterized by overdispersion - that is, discrepancy between expected and observed variability - and an abundance of zeros, measured across a huge number of regions of the genome. With respect to chromatin immunoprecipitation sequencing (ChIP-Seq), a class of NGS data, it is of primary focus to discover the underlying (unobserved) pattern of `enrichment': more particularly, there is interest in the interactions between genes (or broader regions of the genome) and proteins, as they describe the mechanism of regulation under different conditions such as healthy or damaged tissue. Another interesting research question involves the clustering of these observations into groups that have practical relevance and interpretability, considering in particular that a single unit could potentially be allocated into more than one of these clusters, as it is reasonable to assume that its participation is not exclusive to one and only biological function and/or mechanism. Many of these complex processes, indeed, could also be described by sets of ordinary differential equations (ODE's), which are mathematical representations of the changes of a system through time, following a dynamic that is governed by some parameters we are interested in. In this thesis, we address the aforementioned tasks and research questions employing different statistical strategies, such as model-based clustering, graphical models, penalized smoothing and regression. We propose extensions of the existing approaches to better fit the problem at hand and we elaborate the methodology in a Bayesian environment, with the focus on incorporating the structural dependencies - both spatial and temporal - of the data at our disposal.
APA, Harvard, Vancouver, ISO, and other styles
15

Ranciati, Saverio <1988&gt. "Statistical modelling of spatio-temporal dependencies in NGS data." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2016. http://amsdottorato.unibo.it/7680/.

Full text
Abstract:
Next-generation sequencing (NGS) has rapidly become the current standard in genetic related analysis. This switch from microarray to NGS required new statistical strategies to address the research questions inherent to the considered phenomena. First and foremost, NGS dataset usually consist of discrete observations characterized by overdispersion - that is, discrepancy between expected and observed variability - and an abundance of zeros, measured across a huge number of regions of the genome. With respect to chromatin immunoprecipitation sequencing (ChIP-Seq), a class of NGS data, it is of primary focus to discover the underlying (unobserved) pattern of `enrichment': more particularly, there is interest in the interactions between genes (or broader regions of the genome) and proteins, as they describe the mechanism of regulation under different conditions such as healthy or damaged tissue. Another interesting research question involves the clustering of these observations into groups that have practical relevance and interpretability, considering in particular that a single unit could potentially be allocated into more than one of these clusters, as it is reasonable to assume that its participation is not exclusive to one and only biological function and/or mechanism. Many of these complex processes, indeed, could also be described by sets of ordinary differential equations (ODE's), which are mathematical representations of the changes of a system through time, following a dynamic that is governed by some parameters we are interested in. In this thesis, we address the aforementioned tasks and research questions employing different statistical strategies, such as model-based clustering, graphical models, penalized smoothing and regression. We propose extensions of the existing approaches to better fit the problem at hand and we elaborate the methodology in a Bayesian environment, with the focus on incorporating the structural dependencies - both spatial and temporal - of the data at our disposal.
APA, Harvard, Vancouver, ISO, and other styles
16

Kavánková, Iva. "Zálohování dat a datová úložiště." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2021. http://www.nusl.cz/ntk/nusl-444687.

Full text
Abstract:
This diploma thesis is about data backup and following data archiving in real environment of concrete IT company engaged in software development. Theoretical knowledge concerning the area of data backup and data storages is described here. It also describes the current situation of data backup and problems with the current solution. There are suggestions for improving the current situation, including economic evaluation, to achieve efficient and most importantly secure data backup.
APA, Harvard, Vancouver, ISO, and other styles
17

Russo, A. "DIET-SPECIFIC EPIGENETIC SIGNATURE REVEALED BY H3K4ME3 AND H3K27ME3 DATA ANALYSIS IN C57BL6 MICE." Doctoral thesis, Università degli Studi di Milano, 2016. http://hdl.handle.net/2434/365343.

Full text
Abstract:
Increasing evidences demonstrate that adapting to different environmental conditions is mediated by epigenetic changes, which can participate in cellular processes. In particular, the adaptation to the different caloric intakes is of great relevance as it is crucial for the organism’s fitness. Moreover, the phenotypic remodeling induced by different diets determine the susceptibility to life-threatening diseases. For example, refined sugar, fat and meat enriched diet, typical of Western countries, is thought to be responsible for about 30-35% of cancer cases, in addition to increased incidence of type 2 diabetes and cardiovascular diseases. On the other hand, caloric restriction has been shown to be the most powerful way to prolong lifespan and reduce cancer incidence in different experimental models. Based on the hypothesis that epigenetic changes represents the mechanistic link between diet and disease risk, the aim of this work is to investigate chromatin modifications induced by different diets in murine models to identify specific epigenetic profiles associated with fat enriched diets and caloric restriction. For this purpose, 8 weeks old C57Bl/6 female mice were divided in three groups and fed for 10 months with 3 different diets: Standard laboratory mouse Diet, Calorie Restriction without malnutrition, High Fat Diet. Then, livers were extracted and investigated by chromatin immunoprecipitation (anti-H3K4me3, anti-H3K27me3) and transcriptomic approach for gene expression analysis. Despite the presence of moderate technical and biological variability, data analysis demonstrated that specific epigenetic profiles were associated to different diets. In particular, the distribution and frequency of H3K4me3 enabled the clustering of samples by diet-group. Moreover, functional annotation of genes showing an increased signal of H3K4me3 for HF or CR respect to SD on their promoter regions, resulted in significantly enriched “Type II diabetes mellitus”, for which obesity represents a critical risk factor, and “Circadian Rhythm” pathways, whose known to affect longevity. At mechanistic level, two DNA motifs related to the transcription and chromatin regulators ZSCAN4 and REST/NRSF were found enriched in correspondence of the regulative regions of the genes of the aforementioned pathways, suggesting these factors mediate the effects of diet on chromatin and gene expression.
APA, Harvard, Vancouver, ISO, and other styles
18

Robitaille, Alexis. "Detection and identification of papillomavirus sequences in NGS data of human DNA samples : a bioinformatic approach." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1358.

Full text
Abstract:
Les papillomavirus humains (HPV) constituent une famille de petits virus à double brin d’ADN qui ont un tropisme pour les cellules épithéliales de la peau et des muqueuses. Plus de 200 types d’HPV ont été découverts, et classifiés en plusieurs genres taxonomiques en fonction de la constitution de leur séquence ADN. De part le rôle de certains HPV dans les maladies affectant les humains, allant de l’apparition de verrues anogénitales bénignes jusqu’au développement d’un cancer, il est nécessaire de développer des méthodes de détection et de caractérisation de la population d’HPV dans un échantillon d’ADN. Elles sont nécessaires à la clarification du rôle de l’HPV dans les différentes étapes de la progression de la maladie. Cette détection d’HPV lors d’approches ciblées en laboratoire a principalement reposé sur des méthodes de PCR couplées avec du séquençage Sanger. Avec l’introduction des nouvelles technologies de séquençage haut débit (NGS), ces approches peuvent être revisitées afin d’intégrer la puissance de séquençage de ces technologies. Alors que des outils d’analyse in-silico ont été développés pour la recherche de virus, connus ou nouveaux, à partir de données de NGS, aucun outil approprié n’est disponible pour la classification et l’identification de nouvelles séquences virales à partir de données produites par des méthodes de séquençage d’amplicons. Dans cette thèse, la première partie présente cinq nouveaux génomes d’HPV isolés via l’utilisation d’amorces d’amplification dégénérées ciblant le gène L1 à partir d’échantillons de peau humaine. Puis, dans une seconde partie, nous présentons PVAmpliconFinder, un outil d’analyse de données conçu pour identifier et classifier rapidement des séquences connues et potentiellement nouvelles de la famille Papillomaviridae, à partir de données de NGS d’amplicons générées par PCR via l’utilisation d’oligonucleotides dégénérés ciblants les HPV. Enfin, les caractéristiques de PVAmpliconFinder sont présentées, ainsi que plusieurs applications sur des données biologiques obtenues lors du séquençage d’amplicons de spécimens humains. Ces applications ont permis la découverte de nouveaux types d’HPV
Human Papillomaviruses (HPV) are a family of small double-stranded DNA viruses that have a tropism for the mucosal and cutaneous epithelia. More than 200 types of HPV have been discovered so far and are classified into several genera based on their DNA sequence. Due to the role of some HPV types in human disease, ranging from benign anogenital warts to cancer, methods to detect and characterize HPV population in DNA sample have been developed. These detection methods are needed to clarify the implications of HPV at the various stages of the disease. The detection of HPV from targeted wet-lab approaches has traditionally used PCR- based methods coupled with cloning and Sanger sequencing. With the introduction of next generation sequencing (NGS) these approaches can be improved by integrating the sequencing power of NGS. While computational tools have been developed for metagenomic approaches to search for known or novel viruses in NGS data, no appropriate bioinformatic tool has been available for the classification and identification of novel viral sequences from data produced by amplicon-based methods. In this thesis, we initially describe five fully reconstructed novel HPV genomes detected from skin samples after amplification using degenerate L1 primers. Then, is the second part, we present PVAmpliconFinder, a data analysis workflow designed to rapidly identify and classify known and potentially new Papillomaviridae sequences from NGS amplicon sequencing with degenerate PV primers. This thesis describes the features of PVAmpliconFinder and presents several applications using biological data obtained from amplicon sequencing of human specimens, leading to the identification of new HPV types
APA, Harvard, Vancouver, ISO, and other styles
19

Chen, Xi. "Bayesian Integration and Modeling for Next-generation Sequencing Data Analysis." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/71706.

Full text
Abstract:
Computational biology currently faces challenges in a big data world with thousands of data samples across multiple disease types including cancer. The challenging problem is how to extract biologically meaningful information from large-scale genomic data. Next-generation Sequencing (NGS) can now produce high quality data at DNA and RNA levels. However, in cells there exist a lot of non-specific (background) signals that affect the detection accuracy of true (foreground) signals. In this dissertation work, under Bayesian framework, we aim to develop and apply approaches to learn the distribution of genomic signals in each type of NGS data for reliable identification of specific foreground signals. We propose a novel Bayesian approach (ChIP-BIT) to reliably detect transcription factor (TF) binding sites (TFBSs) within promoter or enhancer regions by jointly analyzing the sample and input ChIP-seq data for one specific TF. Specifically, a Gaussian mixture model is used to capture both binding and background signals in the sample data; and background signals are modeled by a local Gaussian distribution that is accurately estimated from the input data. An Expectation-Maximization algorithm is used to learn the model parameters according to the distributions on binding signal intensity and binding locations. Extensive simulation studies and experimental validation both demonstrate that ChIP-BIT has a significantly improved performance on TFBS detection over conventional methods, particularly on weak binding signal detection. To infer cis-regulatory modules (CRMs) of multiple TFs, we propose to develop a Bayesian integration approach, namely BICORN, to integrate ChIP-seq and RNA-seq data of the same tissue. Each TFBS identified from ChIP-seq data can be either a functional binding event mediating target gene transcription or a non-functional binding. The functional bindings of a set of TFs usually work together as a CRM to regulate the transcription processes of a group of genes. We develop a Gibbs sampling approach to learn the distribution of CRMs (a joint distribution of multiple TFs) based on their functional bindings and target gene expression. The robustness of BICORN has been validated on simulated regulatory network and gene expression data with respect to different noise settings. BICORN is further applied to breast cancer MCF-7 ChIP-seq and RNA-seq data to identify CRMs functional in promoter or enhancer regions. In tumor cells, the normal regulatory mechanism may be interrupted by genome mutations, especially those somatic mutations that uniquely occur in tumor cells. Focused on a specific type of genome mutation, structural variation (SV), we develop a novel pattern-based probabilistic approach, namely PSSV, to identify somatic SVs from whole genome sequencing (WGS) data. PSSV features a mixture model with hidden states representing different mutation patterns; PSSV can thus differentiate heterozygous and homozygous SVs in each sample, enabling the identification of those somatic SVs with a heterozygous status in the normal sample and a homozygous status in the tumor sample. Simulation studies demonstrate that PSSV outperforms existing tools. PSSV has been successfully applied to breast cancer patient WGS data for identifying somatic SVs of key factors associated with breast cancer development. In this dissertation research, we demonstrate the advantage of the proposed distributional learning-based approaches over conventional methods for NGS data analysis. Distributional learning is a very powerful approach to gain biological insights from high quality NGS data. Successful applications of the proposed Bayesian methods to breast cancer NGS data shed light on underlying molecular mechanisms of breast cancer, enabling biologists or clinicians to identify major cancer drivers and develop new therapeutics for cancer treatment.
Ph. D.
APA, Harvard, Vancouver, ISO, and other styles
20

Favero, Francesco. "Development of two new approaches for NGS data analysis of DNA and RNA molecules and their application in clinical and research fields." Doctoral thesis, Università del Piemonte Orientale, 2019. http://hdl.handle.net/11579/102446.

Full text
Abstract:
The aim of this study is focused on two main areas of NGS analysis data: RNA-seq(with a specific interest in meta-transcriptomics) and DNA somatic mutations detection. We developed a simple and efficient pipeline for the analysis of NGS data derived from gene panels to identify DNA somatic point mutations. In particular we optimized a somatic variant calling procedure that was tested on simulated datasets and on real data. The performance of our system has been compared with currently available tools for variant calling reviewed in literature. For RNA-seq analysis, in this work we tested and optimized STAble, an algorithm developed originally in our laboratory for the de novo reconstruction of transcripts from non reference based RNA-seq data. At the beginning of this study, the first module of STAble was already been written. The first module is the one which reconstructs a list of transcripts starting from RNA-seq data. The aim of this study, particularly, consisted in adding a new module to STAble, developed in collaboration with Cambridge University, based on the flux-balance analysis in order to link the metatranscriptomic analysis to a metabolic approach. This goal has been achieved in order to study the metabolic fluxes of microbiota starting from metatranscriptomic data.
APA, Harvard, Vancouver, ISO, and other styles
21

Wan, Mohamad Nazarie Wan Fahmi Bin. "Network-based visualisation and analysis of next-generation sequencing (NGS) data." Thesis, University of Edinburgh, 2017. http://hdl.handle.net/1842/28923.

Full text
Abstract:
Next-generation sequencing (NGS) technologies have revolutionised research into nature and diversity of genomes and transcriptomes. Since the initial description of these technology platforms over a decade ago, massively parallel RNA sequencing (RNA-seq) has driven many advances in the characterization and quantification of transcriptomes. RNA-seq is a powerful gene expression profiling technology enabling transcript discovery and provides a far more precise measure of the levels of transcripts and their isoforms than other methods e.g. microarray. However, the analysis of RNA-seq data remains a significant challenge for many biologists. The data generated is large and the tools for its assembly, analysis and visualisation are still under development. Assemblies of reads can be inspected using tools such as the Integrative Genomics Viewer (IGV) where visualisation of results involves ‘stacking’ the reads onto a reference genome. Whilst sufficient for many needs, when the underlying variance of the genome or transcript assemblies is complex, this visualisation method can be limiting; errors in assembly can be difficult to spot and visualisation of splicing events may be challenging. Data visualisation is increasingly recognised as an essential component of genomic and transcriptomic data analysis, enabling large and complex datasets to be better understood. An approach that has been gaining traction in biological research is based on the application of network visualisation and analysis methods. Networks consist of nodes connected by edges (lines), where nodes usually represent an entity and edge a relationship between them. These are now widely used for plotting experimentally or computationally derived relationships between genes and proteins. The overall aim of this PhD project was to explore the use of network-based visualisation in the analysis and interpretation of RNA-seq data. In chapter 2, I describe the development of a data pipeline that has been designed to go from ‘raw’ RNA-seq data to a file format which supports data visualisation as a ‘DNA assembly graph’. In DNA assembly graphs, nodes represent sequence reads and edges denote a homology between reads above a defined threshold. Following the mapping of reads to a reference sequence and defining which reads a map to a given loci, pairwise sequence alignments are performed between reads using MegaBLAST. This provides a weighted similarity score that is used to define edges between reads. Visualisation of the resulting networks is then carried out using BioLayout Express3D that can render large networks in 3-D, thereby allowing a better appreciation of the often-complex network structure. This pipeline has formed the basis for my subsequent work on the exploring and analysing alternative splicing in human RNA-seq data. In the second half of this chapter, I provide a series of tutorials aimed at different types of users allowing them to perform such analyses. The first tutorial is aimed at computational novices who might want to generate networks using a web-browser and pre-prepared data. Other tutorials are designed for use by more advanced users who can access the code for the pipeline through GitHub or via an Amazon Machine Image (AMI). In chapter 3, the utility of network-based visualisations of RNA-seq data is explored using data processed through the pipeline described in Chapter 2. The aim of the work described in this chapter was to better understand the basic principles and challenges associated with network visualisation of RNA-seq data, in particular how it could be used to visualise transcript structure and splice-variation. These analyses were performed on data generated from four samples of human fibroblasts taken at different time points during their entry into cell division. One of the first challenges encountered was the fact that the existing network layout algorithm (Fruchterman- Reingold) implemented within BioLayout Express3D did not result in an optimal layout of the unusual graph structures produced by these analyses. Following the implementation of the more advanced layout algorithm FMMM within the tool, network structure could be far better appreciated. Using this layout method, the majority of genes sequenced to an adequate depth assemble into networks with a linear ‘corkscrew’ appearance and when representing single isoform transcripts add little to existing views of these data. However, in a small number of cases (~5%), the networks generated from transcripts expressed in human fibroblasts possess more complex structures, with ‘loops’, ‘knots’ and multiple ends being observed. In a majority of cases examined, these loops were associated with alternative splicing events, a fact confirmed by RT-PCR analyses. Other DNA assembly networks representing the mRNAs for genes such as MKI67 showed knot-like structures, which was found to be due to the presence of repetitive sequence within an exon of the gene. In another case, CENPO the unusual structure observed was due to reads derived from an overlapping gene of ADCY3 gene present on the opposite strand with reads being wrongly mapped to CENPO. Finally, I explored the use of a network reduction strategy as an approach to visualising highly expressed genes such as GAPDH and TUBA1C. Having successfully demonstrated the utility of networks in analysing transcript isoforms in data derived from a single cell type I set out to explore its utility in analysing transcript variation in tissue data where multiple isoforms expressed by different cells within the tissue might be present in a given sample. In chapter 4, I explore the analysis of transcript variation in an RNA-seq dataset derived from human tissue. The first half of this chapter describes the quality control of these data again using a network-based approach but this time based the correlation in expression between genes and samples. Of the 95 samples derived from 27 human tissues, 77 passed the quality control. A network was constructed using a correlation threshold of r ≥ 0.9, which comprised 6,109 nodes (genes) and 1,091,477 edges (correlations) and clustered. Subsequently, the profile and gene content of each cluster was examined and enrichment of GO terms analysed. In the second half of this chapter, the aim was to detect and analyse alternative splicing events between different tissues using the rMATS tool. By using a false-discovery rate (FDR) cut-off of < 0.01, I found that in comparisons of brain vs. heart, brain vs. liver and heart vs. liver, the program reported 4,992, 4,804 and 3,990 splicing events, respectively. Of these events, only 78 splicing events (52 genes) with more than 50% of exon inclusion level and expression level more than FPKM 30. To further explore the sometimes-complex structure of transcripts diversity derived from tissue, RNAseq assembly networks for KLC1, SORBS2, GUK1, and TPM1 were explored. Each of these networks showed different types of alternative splicing events and it was sometimes difficult to determine the isoforms expressed between tissues using other approaches. For instance, there is an issue in visualising the read assembly of long genes such as KLC1 and SORBS2, using a Sashimi plots or even Vials, just because of the number of exons and the size of their genomic loci. In another case of GUK1, tissue-specific isoform expression was observed when a network of three tissues was combined. Arguably the most complex analysis is the network of TPM1 where the uniquification step was employed for this highly expressed gene. In chapter 5, I perform a usability testing for NGS Graph Generator web application and visualising RNA-seq assemblies as a network using BioLayout Express3D. This test was important to ensure that the application is well received and utilised by the user.
Almost all participants of this usability test agree that this application would encourage biologists to visualise and understand the alternative splicing together with existing tools. The participants agreed that Sashimi plots rather difficult to view and visualise and perhaps would lose something interesting features. However, there were also reviews of this application that need improvements such as the capability to analyse big network in a short time, side-by-side analysis of network with Sashimi plot and Ensembl. Additional information of the network would be necessary to improve the understanding of the alternative splicing. In conclusion, this work demonstrates the utility of network visualisation of RNAseq data, where the unusual structure of these networks can be used to identify issues in assembly, repetitive sequences within transcripts and splice variation. As such, this approach has the potential to significantly improve our understanding of transcript complexity. Overall, this thesis demonstrates that network-based visualisation provides a new and complementary approach to characterise alternative splicing from RNA-seq data and has the potential to be useful for the analysis and interpretation of other kinds of sequencing data.
APA, Harvard, Vancouver, ISO, and other styles
22

Dwivedi, Ankit. "Functional analysis of genomic variations associated with emerging artemisinin resistant P. falciparum parasite populations and human infecting piroplasmida B. microti." Thesis, Montpellier, 2016. http://www.theses.fr/2016MONTT073/document.

Full text
Abstract:
Le programme d’élimination du paludisme de l’OMS est menacé par l’émergence etla propagation potentielle de parasites de l’espèce Plasmodium falciparum résistants à l’artémisinine. Récemment il a été montré que (a) des SNPs dans une région du chromosome 13 subissaient une forte sélection positive récente au Cambodge,(b) plusieurs sous-populations de parasites de P. falciparum résistants et sensibles à l’artémisinine étaient présentes au Cambodge, (c) des mutations dans le domaine Kelch du gène k13 sont des déterminants majeurs de la résistance à l’artémisinine dans la population parasitaire cambodgien et (d) des parasites de sous-populations du nord du Cambodge près de la Thaïlande et du Laos sont résistants à la méfloquine et portent l’allèle R539T du gène de k13.Il est donc nécessaire d’identifier la base génétique de la résistance dans le but de surveiller et de contrôler la transmission de parasites résistants au reste du monde, pour comprendre le métabolisme des parasites et pour le développement de nouveaux médicaments. Ce travail a porté sur la caractérisation de la structure de la population de P. falciparum au Cambodge et la description des propriétés métaboliques des sous-populations présentes ainsi que des flux de gènes entre ces sous-populations. Le but est d’identifier les bases génétiques associées à la transmission et l’acquisition de résistance à l’artémisinine dans le pays.La première approche par code-barre a été développée pour identifier des sous-populations à l’aide d’un petit nombre de loci. Une approche moléculaire de PCR-LDR-FMA multiplexée et basée sur la technologie LUMINEX a été mise au point pour identifier les SNP dans 537 échantillons de sang (2010 - 2011) provenant de 16centres de santé au Cambodge. La présence de sous-populations le long des frontières du pays a été établie grâce à l’analyse de 282 échantillons. Les flux de gènes ont été décrits à partir des 11 loci du code-barre. Le code-barre permet d’identifier les sous-populations de parasites associées à la résistance à l’artémisinine et à la méfloquine qui ont émergé récemment.La seconde approche de caractérisation de la structure de la population de P.falciparum au Cambodge a été définie sur la base de l’analyse de 167 génomes de parasites (données NGS de 2008 à 2011) provenant de quatre localités au Cambodge et récupérés à partir de la base de données ENA. Huit sous-populations de parasites ont pu être décrites à partir d’un jeu de 21257 SNPs caractérisés dans cette étude. La présence de sous-populations mixtes de parasite apparait comme un risque majeur pour la transmission de la résistance à l’artémisinine. L’analyse fonctionnelle montre qu’il existe un fond génétique commun aux isolats dans les populations résistantes et a confirmé l’importance de la voie PI3K dans l’acquisition de la résistance en aidant le parasite à rester sous forme de stade anneau.Nos résultats remettent en question l’origine et la persistance des sous-populations de P. falciparum au Cambodge, fournissent des preuves de flux génétique entre les sous-populations et décrivent un modèle d’acquisition de résistance à l’artémisinine.Le processus d’identification des SNPs fiables a été ensuite appliqué au génome de Babesia microti. Ce parasite est responsable de la babésiose humain (un syndrome de type malaria) et est endémique dans le nord-est des Etats-Unis. L’objectif était de valider la position taxonomique de B. microti en tant que groupe externe aux piroplasmes et d’améliorer l’annotation fonctionnelle du génome en incluant la variabilité génétique, l’expression des gènes et la capacité antigénique des protéines. Nous avons ainsi identifié de nouvelles protéines impliquées dans les interactions hôte-parasite
The undergoing WHO Malaria elimination program is threatened by the emergenceand potential spread of the Plasmodium falciparum artemisinin resistant parasite.Recent reports have shown (a) SNPs in region of chromosome 13 to be understrong recent positive selection in Cambodia, (b) presence of P. falciparum parasiteresistant and sensitive subpopulations in Cambodia, (c) the evidence that mutationsin the Kelch propeller domain of the k13 gene are major determinants ofartemisinin resistance in Cambodian parasite population and (d) parasite subpopulations in Northern Cambodia near Thailand and Laos with mefloquine drugresistance and carrying R539T allele of the k13 gene.Identifying the genetic basis of resistance is important to monitor and control thetransmission of resistant parasites and to understand parasite metabolism for the development of new drugs. This thesis focuses on analysis of P. falciparum population structure in Cambodia and description of metabolic properties of these subpopulations and gene flow among them. This could help in identifying the genetic evidence associated to transmission and acquisition of artemisinin resistance over the country.First, a barcode approach was used to identify parasite subpopulations using smallnumber of loci. A mid-throughput PCR-LDR-FMA approach based on LUMINEXtechnology was used to screen for SNPs in 537 blood samples (2010 - 2011) from 16health centres in Cambodia. Based on successful typing of 282 samples, subpopulations were characterized along the borders of the country. Gene flow was described based on the gradient of alleles at the 11 loci in the barcode. The barcode successfully identifies recently emerging parasite subpopulations associated to artemisinin and mefloquine resistance.In the second approach, the parasite population structure was defined based on167 parasite NGS genomes (2008 - 2011) originating from four locations in Cambodia,recovered from the ENA database. Based on calling of 21257 SNPs, eight parasite subpopulations were described. Presence of admixture parasite subpopulation couldbe supporting artemisinin resistance transmission. Functional analysis based on significant genes validated similar background for resistant isolates and revealed PI3K pathway in resistant populations supporting acquisition of resistance by assisting the parasite in ring stage form.Our findings question the origin and the persistence of the P. falciparum subpopulations in Cambodia, provide evidence of gene flow among subpopulations anddescribe a model of artemisinin resistance acquisition.The variant calling approach was also implemented on the Babesia microti genome.This is a malaria like syndrome, and is endemic in the North-Eastern USA. Theobjective was to validate the taxonomic position of B. microti as out-group amongpiroplasmida and improve the functional genome annotation based on genetic variation, gene expression and protein antigenicity. We identified new proteins involved in parasite host interactions
APA, Harvard, Vancouver, ISO, and other styles
23

Batra, Rajbir Nath. "Decoding the regulatory role and epiclonal dynamics of DNA methylation in 1482 breast tumours." Thesis, University of Cambridge, 2018. https://www.repository.cam.ac.uk/handle/1810/274923.

Full text
Abstract:
Breast cancer is a clinically and molecularly heterogeneous disease displaying distinct therapeutic responses. Although recent studies have explored the genomic and transcriptomic landscapes of breast cancer, the epigenetic architecture has received less attention. To address this, an optimised Reduced Representation Bisulfite Sequencing protocol was performed on 1482 primary breast tumours (and 237 matched adjacent normal tissues). This constitutes the largest breast cancer methylome yet, and this thesis describes the bioinformatics and statistical analysis of this study. Noticeable epigenetic drift (both gain and loss of homogeneous DNA methylation patterns) was observed in breast tumours when compared to normal tissues, with markedly higher differences in late replicating genomic regions. The extent of epigenetic drift was also found to be highly heterogeneous between the breast tumours and was sharply correlated with the tumour’s mitotic index, indicating that epigenetic drift is largely a consequence of the accumulation of passive cell division related errors. A novel algorithm called DMARC (Directed Methylation Altered Regions in Cancer) was developed that utilised the tumour-specific drift rates to discriminate between methylation alterations attained as a consequence of stochastic cell division errors (background) and those reflecting a more instructive biological process (directed). Directed methylation alterations were significantly enriched for gene expression changes in breast cancer, compared to background alterations. Characterising these methylation aberrations with gene expression led to the identification of breast cancer subtype-specific epigenetic genes with consequences on transcription and prognosis. Cancer genes may be deregulated by multiple mechanisms. By integrating with existing copy number and gene expression profiles for these tumours, DNA methylation alterations were revealed as the predominant mechanism correlated with differentially expressed genes in breast cancer. The crucial role of DNA methylation as a mechanism to target the silencing of specific genes within copy number amplifications is also explored which led to the identification of a putative tumour suppressor gene, THSZ2. Finally, the first genome-wide assessment of epigenomic evolution in breast cancer is conducted. Both, the level of intratumoural heterogeneity, and the extent of epiallelic burden were found to be prognostic, and revealed an extraordinary distinction in the role of epiclonal dynamics in different breast cancer subtypes. Collectively, the results presented in this thesis have shed light on the somatic DNA methylation basis of inter-patient as well as intra-tumour heterogeneity in breast cancer. This complements our genetic knowledge of the disease, and will help move us towards tailoring treatments to the patient's molecular profile.
APA, Harvard, Vancouver, ISO, and other styles
24

Schipani, Angela <1994&gt. "Comprehensive characterization of SDH-deficient GIST using NGS data and iPSC models." Doctoral thesis, Alma Mater Studiorum - Università di Bologna, 2022. http://amsdottorato.unibo.it/10190/1/Schipani_Angela_thesis.pdf.

Full text
Abstract:
Gastrointestinal stromal tumors (GIST) are the most common di tumors of the gastrointestinal tract, arising from the interstitial cells of Cajal (ICCs) or their precursors. The vast majority of GISTs (75–85% of GIST) harbor KIT or PDGFRA mutations. A small percentage of GIST (about 10‐15%) do not harbor any of these driver mutations and have historically been called wild-type (WT). Among them, from 20% to 40% show loss of function of the succinate dehydrogenase complex (SDH), also defined as SDH‐deficient GIST. SDH-deficient GISTs display distinctive clinical and pathological features, and can be sporadic or associated with Carney triad or Carney-Stratakis syndrome. These tumors arise most frequently in the stomach with predilection to distal stomach and antrum, have a multi-nodular growth, display a histological epithelioid phenotype, and present frequent lympho-vascular invasion. Occurrence of lymph node metastases and indolent course are representative features of SDH-deficient GISTs. This subset of GIST is known for the immunohistochemical loss of succinate dehydrogenase subunit B (SDHB), which signals the loss of function of the entire SDH-complex. The overall aim of my PhD project consists of the comprehensive characterization of SDH deficient GIST. Throughout the project, clinical, molecular and cellular characterizations were performed using next-generation sequencing technologies (NGS), that has the potential to allow the identification of molecular patterns useful for the diagnosis and development of novel treatments. Moreover, while there are many different cell lines and preclinical models of KIT/PDGFRA mutant GIST, no reliable cell model of SDH-deficient GIST has currently been developed, which could be used for studies on tumor evolution and in vitro assessments of drug response. Therefore, another aim of this project was to develop a pre-clinical model of SDH deficient GIST using the novel technology of induced pluripotent stem cells (iPSC).
APA, Harvard, Vancouver, ISO, and other styles
25

Caniato, Elisa. "Development and Application of New Strategies for Genome Scaffolding and Gene Predictio applied to NGS data." Doctoral thesis, Università degli studi di Padova, 2011. http://hdl.handle.net/11577/3422022.

Full text
Abstract:
Next Generation Sequencing (NGS) technologies have a great impact both at economical and at research level, with the increasing of data production and the cost reduction and. This new kind of techniques allow the sequencing of thousands of genomes from humans to microbes and they open entirely new areas of biological inquiry, including the investigation of ancient genomes, of human disease, the characterization of ecological diversity, and the identification of unknown etiological agents. The application field could be divided into three main arguments: genomic tasks (genome assembly, SNPs and structural variations), transcriptome analysis (gene prediction and annotation, alternative splicing discovering) and epigenetic problems. The new technologies also offer challenges in experimental design, data management and analysis. In particular, it is desirable to have analysis keep pace with data production, and thus new bioinformatics tools are being developed. Three platforms for DNA sequencing read production are in reasonably widespread use: the Roche/454, the Illumina/Solexa Genome Analyzer and the Applied Biosystems SOLiDTM System. The Roche/454 is the first to achieve commercial introduction (in 2005) and it uses an innovative sequencing technology known as pyrosequencing. It produces sequences 300-400 bases, longer than Illumina/Solexa (about 70 bases) and SOLiD/Applied Biosystem (about 50 bases), but with a lower high throughput. During my Ph.D., the Next Generation Sequencing has become a wide spread practice and the aim of my research is the development of new proper tools. I hoped to create useful programs, that would be able to transform the large amount of produced raw data into useful information for biology tasks in few time. With my research and my algorithms, I collaborated to the development and solution of two of the most challenging and studied applications: genome assembly and gene prediction. De novo sequencing is the starting point of any possible genetic analysis with the creation of the original genomic sequence. This explain why de novo sequencing and genome assembly are very important and studied problems. With Next Generation Sequencing the task has began even more challenging: the reduced time and cost allows to sequence even long organisms. When I started my Ph.D., there were only few programs that were able to perform de novo assembly, and among these the most used were Newbler, Velvet and Cabog. My aim was to improve the current state of the art, developing a new assembly tool that use the strengths and overcome the weaknesses of the cited programs. Moreover, the new program should be able to work with any kind of data (Next Generation Sequencing and other available evidences), and to produce a well-defined genome assembly. Many efficient assembler have been implemented yet, but quite all of them are able to produce only unconnected fragments (contigs) of the original genome. In many cases, they were not able to realize the final scaffolding: set of well ordered and oriented contigs. Only few of them performs this task, that is useful to move toward the finishing of the assembly. The idea is to work in this direction: the development of a platform that is able to correctly order and orient a set of contigs, connected among them through mate-pairs, into scaffolds. The tool would be able to control the “consistency” of the starting assembly and correct the error of the links, to produce a genome sequence and reduce the background noise. The best strategy is to create contigs using Roche/454 reads and Newbler assembler, and mate-pair reads with Illumina or SOLiD. Gene prediction is a well studied and known problem. Over the past, a lot of program have been developed (Jigsaw, GeneID, GeneSplicer, Genscan, Glimmer, SNAP, TigrScan, Twinscan,...), and the reached results allow to predict quite all genes with a high specificity and sensibility level. After an accurate analysis, I found that a common weak point of all the programs was the requirement of a starting training set, from which learning the rules of the organism gene structure, used for the future prediction. Unfortunately, very often this set is not available and it is necessary to create a new one, using information coming from similar organism or from other source of evidences (EST, proteins,... ) . My idea was to use Next Generation Sequencing data to create a starting set of proper genes, sequencing the transcriptome, aligning the produced reads on the genome sequence and discovering the exons and introns to reconstruct the gene structure.
La commercializzazione delle nuove tecnologie di sequenziamento (NGS, Next Generation Sequencing), ha avuto un grande impatto sia a livello economico sia biologico, grazie alla significativa riduzione dei tempri di produzione e dei costi, e all’aumento della quantità di dati ottenuti. Le nuove tecniche di sequenziamento hanno permesso di ricreare il genoma di migliaia di organismi, sia piccoli come i microbi, sia grandi come il genoma umano, aprendo nuove aree di ricerca. Ad esempio, ora è possibile studiare il DNA antico, fare ricerca su malattie genetiche, studiare caratteristiche e differenze evolutive tra organismi,... I nuovi metodi si possono applicare a tre campi principali: genomico (come l’assemblaggio dei genomi, la ricerca di SNPs e variazioni strutturali), trascrittomico (per eseguire la predizione genica, l’annotazione e lo studio di splicing alternativi) ed epigenetico. I sequenziatori di nuova generazione hanno apportato cambiamenti anche a livello bioinformatico. Infatti, con l’acquisizione di moli di dati sempre più grandi, si è reso necessario affrontare il problema della loro gestione dal punto di vista sia di tempo computazionale per analizzarli sia di memoria richiesta per immagazzinarli. Inoltre, si è resa necessaria l’implementazione di strumenti in grado di elaborare i dati grezzi ottenuti, per trasformali in utili informazioni da applicare nelle analisi biologiche. Attualmente le tre piattaforme di sequenziamento più utilizzate sono Roche/454 , Illumina/Solexa Genome Analyzer, e Applied Biosystems SOLiDTM. Il primo sequenziatore ad essere commercializzato nel 2005 fu il 454. Si basa su tecniche di sequenziamento innovative (pyrosequencing) ed è in grado di produrre sequenze lunghe 300-400 basi, con una buona qualità media. Tuttavia il 454 non raggiunge i livelli di produzione di altri sequenziatori, come SOLiD ed Illumina, che in poco tempo sono in grado di produrre milioni di sequenze, anche se di dimensioni minori (circa 50 e 70 basi rispettivamente per SOLiD e Illumina). L’idea del mio dottorato è di applicare le conoscenze matematiche ed informatiche allo studio di nuove tecniche per l’utilizzo dei dati di nuova generazione in problemi biologici. Lo scopo è di sviluppare dei programmi in grado di elaborare grandi quantità di dati in poco tempo. Con la mia ricerca ho collaborato all’implementazione di metodi per la risoluzione di problemi di assemblaggio e di predizione genica. Il sequenziamento de novo e successivamente l’assemblaggio sono un punto fondamentale per l’analisi del genoma di un organismo. Attualmente il problema dell’assemblaggio è un problema ancora aperto e ampiamente studiato: non esistono ancora programmi in grado di ricostruire un genoma completo partendo da reads prodotte con un sequenziamento di nuova generazione. Esistono software come Newbler, Velvet e Cabog che producono lunghi frammenti di sequenza (contigs), ma tra loro disgiunti e di cui non si conosce la corretta posizione occupata all’interno del genoma d’origine. Alla maggior parte dei programmi manca una fase di “scaffolding” e “finishing”, in cui tutti i frammenti prodotti con l’assemblaggio vengono ordinati e orientati, creando gli scaffolds. Il mio scopo era di realizzare un metodo di scaffolding, Consort, e di analisi per il miglioramento dell’assemblaggio ottenuto. Il programma richiede come dati di input un insieme di contigs prodotti assemblando le reads 454 con il Newbler, e un insieme di mate-pairs generati con il SOLiD. La predizione genica è stata la mia seconda area di ricerca. E’ un problema ben studiato e negli anni moltissimi programmi sono stati sviluppati per predire efficientemente i geni contenuti in un genoma. Tra questi i più utilizzati e conosciuti sono: Jigsaw, GeneID, GeneSplice, Genscan, Glimmer, SNAP, TigrScan, Twinscan. La maggior parte dei software richiede un insieme di dati di allenamento dal quale apprendere le caratteristiche per eseguire la successiva predizione, che molto spesso non sono disponibili. Pertanto, si devono creare a partire da genomi simili. Tuttavia, questa soluzione non è sempre applicabile, anche se molto spesso lavora bene e permette di ottenere buon risultati. Infatti, se l’organismo studiato è nuovo e non se ne conoscono altri di abbastanza vicini, si rischia di non avere i i dati richiesti. La mia ricerca in quest’area si applica allo sviluppo di un metodo di creazione di un insieme di dati di allenamento a partire da sequenze di trascriptoma dello stesso organismo. L’idea è di allineare le reads prodotte sul genoma e di estrarre tutte le regioni individuate, che sono potenziali geni. L’algoritmo implementato ha mostrato la possibilità di ricavare insieme di dati sicuri con questa tecnica. Tuttavia, il metodo è soggetto alla predizione di molti falsi positivi a causa dell’elevato rumore di fondo. Per evitare di creare un training set poco affidabile, è preferibile essere molto stringenti nei criteri di selezione dei geni.
APA, Harvard, Vancouver, ISO, and other styles
26

SANDIONIGI, ANNA. "Biodiversity in the era of big data. On the problem of taxonomy assignment and the distribution of diversity in complex biological systems." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2014. http://hdl.handle.net/10281/81694.

Full text
Abstract:
The study of complex biological matrices is a remarkable hot topic in biology. Soil, water, gut content are some of these matrices characterized by a prominent number of organisms living in tight connection. Hundreds or thousands of species and/or strains could be present in the same sample coming from different habitats (e.g. soil ecosystem) and showing inter-relationships, mainly energetic, to guarantee their ecosystem health functioning. The discrimination and/or identification of the different biological entities, at least for the eukaryotic components, using traditional morphological approaches is relatively complicated, requiring a specialist in each taxonomic groups and generally an appropriate long time to achieve a correct identification and classification. To overcome these limitations, molecular approaches have demonstrated to be valid alternatives where PCRs, cloning, DNA sequencing and bioinformatics analysis of sequence differences were used as the standard protocol. Nowadays, the genomic massive sequencing revolution, generated by the heterogeneous techniques collectively known as Next Generation Sequencing (NGS), has become the new gold standard. The present thesis consists of four sections which overpass detailed aspects of the analysis of biodiversity and the issues associated and elaborate both promises and pitfalls of coupling DNA barcode approach with high-throughput pyrosequencing in two different cases of biodiversity assessment. In the following a brief description of each section is provided: Section 1: Introduction to the biodiversity analysis problem. In this section an analysis of the main methods used to investigate the biodiversity and their related problems has been addressed. Emphasis has been put on the integrated approach between the classic DNA barcoding approach and the advantages of high processivity guaranteed by next generation sequencing technologies. Furthermore, the state of art regarding bioinformatics methods for species assignment and biodiversity patterns elaboration including phylogenetic diversity analysis are described. Section 2: Targeted Sequencing on Metazoan Communities. In this section, the precision and the accuracy of denoising procedure and the candidate parameters able to reduce sequence error rate are investigated. This work also proposed an innovative taxon assignment pipeline. In addition, a novel library preparation method allowing the sequencing of the entire coxI barcode region (approximately 700 bp) on 454 pyrosequencing platform (Roche Life Science) is proposed. To address the objectives, metazoan communities coming from complex environmental matrix (soil) were considered. Section 3: Microbiota invasion mediated by Varroa destructor to Apis mellifera. The starting hypothesis of this section is that varroa mites play a fundamental role in the alteration of bacterial composition of honey bee larvae, acting not only as a vector, but also as a sort of an open “door” through which exogenous bacteria alter the mechanisms of primary succession in the “simple” honey bee larval microbiome. To explore these dynamics a classical microbial communities analysis approach and a new approach considering the phylogenetic entropy as a measure of biodiversity were tested. The varroa and honey bee bacterial communities were studied through barcoded amplicon pyrosequencing method , taking advantage of the NGS methods and the opportunity to detect uncultured and uncultivable bacteria allowed by such techniques. Section 4: general conclusions and perspectives. General conclusions and future promises highlighted by the above mentioned experiments are illustrated in this section.
APA, Harvard, Vancouver, ISO, and other styles
27

Trebulová, Debora. "Zálohování dat a datová úložiště." Master's thesis, Vysoké učení technické v Brně. Fakulta podnikatelská, 2017. http://www.nusl.cz/ntk/nusl-318599.

Full text
Abstract:
This diploma thesis focuses on ways of backing up data and their practical use in a specific proposal for Transroute Group s.r.o.. In the introduction part the theoretical knowledge on this issue is presented. Next part of the thesis deals with the analysis of the current state of backup in the company. This section is followed by a chapter where several solutions are presented each with their financial evaluation. The ending part is composed of the choice of a specific solution and a time estimate for its implementation.
APA, Harvard, Vancouver, ISO, and other styles
28

BERETTA, STEFANO. "Algorithms for next generation sequencing data analysis." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2013. http://hdl.handle.net/10281/42355.

Full text
Abstract:
Two of the main bioinformatics fields that have been influenced by the introduction of the Next-Generation Sequencing (NGS) techniques are transcriptomics and metagenomics. The adoption of these new methods to sequence DNA/RNA molecules has drastically changed the kind and also the amount of produced data. The effect is that all the developed algorithms and tools working on traditional data cannot be applied on NGS data. For this reason, in this thesis we face two central problems in two fields: transcriptmics and metagenomics. The first one regards the characterization of the Alternative Splicing (AS) events starting from NGS sequences coming from transcripts (called RNA-Seq reads). To this aim we have modeled the structure of a gene, with respect to the AS variations occurring in it, by using a graph representation (called splicing graph). More specifically, we have identified the conditions for the correct reconstruction of the splicing graph, starting from RNA-Seq data, and we have realized an algorithm for its construction. Moreover, our method is able to correct reconstruct the splicing graph even when the input RNA-Seq reads do not respect the identified conditions. Finally, we have performed an experimental analysis of our procedure in order to validated the obtained results. The second problem we face in this thesis is the assignment of NGS read, coming from a metagenomic sample, to a reference taxonomic tree, in order to assess the composition of the sample and classify the unknown micro-organisms in it. This is done by aligning the reads to the taxonomic tree and then choosing (when there are more valid matches) the node that best represents the read. This choice is based on the calculation of a Penalty Score (PS) function for all the nodes descending from the lowest common ancestor of the valid matches in the tree. We have realized an optimal algorithm for the computation of the PS function, based on the so called skeleton tree, which improve the performances of the taxonomic assignment procedure. We have also implemented the method by using more efficient data structures, with respect to the one used in the previous version of the procedure. Finally, we have offered the possibility to switch among different taxonomies by developing a method to map trees and translate the input alignments.
APA, Harvard, Vancouver, ISO, and other styles
29

Gomes, Ana Rita Silva. "Inovação e exportação nas PME's e nas grandes empresas." Master's thesis, Instituto Superior de Economia e Gestão, 2010. http://hdl.handle.net/10400.5/3387.

Full text
Abstract:
Mestrado em Ciências Económicas
O presente estudo faz a análise dos principais factores explicativos das exportações e da despesa em investigação e desenvolvimento (I&D) das PME´s e das grandes empresas a operar em Portugal, para o período 2004-2008. A partir de uma amostra constituída por 200 PME´s e por 30 grandes empresas exportadoras o estudo utiliza dados de painel e os estimadores de efeitos fixos e efeitos aleatórios para estimar os efeitos sobre as exportações e sobre a despesa em I&D. Em relação às exportações, o estudo conclui pelo efeito positivo do aumento da produtividade e da despesa em I&D tanto nas PME´s como nas grandes empresas e que as PME´s estrangeiras exportam mais que as PME´s nacionais. Quanto às determinantes da despesa em I&D o estudo conclui que o aumento dos capitais próprios e dos resultados líquidos tem um efeito positivo sobre as despesas em I&D nas grandes empresas ao passo que nas PME´s é o aumento das exportações que leva ao aumento das despesas em I&D, tendo o aumento dos resultados líquidos um efeito negativo.
This study analyses the main determinants of exports and research and development (R&D) expenses of small and medium enterprises (SME) and large companies operating in Portugal during the period 2004-2008. From a sample of 200 SMEs and 30 major exporting companies, the study uses a panel data analysis and fixed-effects and random-effects estimators to estimate the effects on exports and on R & D. Regarding exports, the study found a positive effect in terms of increased productivity and R & D in both SMEs and large companies. The results also suggest that SMEs that are owned by foreign enterprises export more than national SMEs. In relation to the determinants of spending on R & D, the study concludes that the increase in equity and net income has a positive effect on R & D spending in large companies, while in SMEs, increased expenditure on R & D is a consequence of increasing exports, whereas the increase in net income has a negative effect on R & D.
APA, Harvard, Vancouver, ISO, and other styles
30

Carraro, Marco. "Development of bioinformatics tools to predict disease predisposition from Next Generation Sequencing (NGS) data." Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3426807.

Full text
Abstract:
The sequencing of the human genome has opened up completely new avenues in research and the notion of personalized medicine has become common. DNA Sequencing technology has evolved by several orders of magnitude, coming into the range of $1,000 for a complete human genome. The promise of identifying genetic variants that influence our lifestyles and make us susceptible to diseases is now becoming reality. However, genome interpretation remains one the most challenging problems of modern biology. The focus of my PhD project is the development of bioinformatics tools to predict diseases predisposition from sequencing data. Several of these methods have been tested in the context of the Critical Assessment of Genome Interpretation (CAGI), always achieving good prediction performances. During my PhD project I faced the complete spectrum of challenges to be address in order to translate the sequencing revolution into clinical practice. One of the biggest problem when dealing with sequencing data is the interpretation of variants pathogenic effect. Dozens of bioinformatics tools have been created to separate mutations that could be involved in a pathogenic phenotype from neutral variants. In this context the problem of benchmarking is critical, as prediction performance are usually tested on different sets of variants, making the comparison among these tools impossible. To address this problem I performed a blinded comparison of pathogenicity predictors in the context of CAGI, realizing the most complete performance assessment among all the iterations of this collaborative experiment. Another challenge that needs to be address to realize the personalized medicine revolution is the phenotype prediction. During my PhD I had the opportunity to develop several methods for the complex phenotype prediction from targeted enrichment and exome sequencing data. In this context challenges like misinterpretation or overinterpretation of variants pathogenicity have emerged, like in the case of phenotype prediction from the Hopkins Clinical Panel. In addition, other complementary issues of phenotype predictions, like the possible presence of incidental findings have to be considered. Ad hoc prediction strategies have been defined while facing with different kinds of sequencing data. A clear example is the case of Crohn’s disease risk prediction. Always in the context of the CAGI experiment, three iterations of this prediction challenge have been run so far. Analysis of datasets revealed how population structure and bias in data preparation and sequencing could affect prediction performance, leading to inflated results. For this reason a completely new prediction strategy has been defined for the last edition of the Crohn’s disease challenge, exploiting data from Genome Wide Association Studies and Protein Protein Interaction network, to address the problem of missing heritability. Good prediction performance have been achieved, especially for individuals with an extreme predicted risk score. Last, my work has been focused on the prediction of a health related trait: the blood group phenotype. The accuracy of serological tests is very poor for minor blood groups or weak phenotypes. Blood groups incompatibilities can be harmful for critical individuals like oncohematological patients. BOOGIE exploits haplotype tables, and the nearest neighbor algorithm to identify the correct phenotype of a patient. The accuracy of our method has been tested in ABO and RhD systems achieving good results. In addition, our analyses paved the way for a further increase in performance, moving towards a prediction system that in the future could become a real alternative to wet lab experiments.
Il completamento del progetto genoma umano ha aperto numerosi nuovi orizzonti di ricerca. Tra questi, la possibilità di conoscere le basi genetiche che rendono ogni individuo suscettibile alle diverse malattie ha aperto la strada ad una nuova rivoluzione: l’avvento della medicina personalizzata. Le tecnologie di sequenziamento del DNA hanno subito una notevole evoluzione, ed oggi il prezzo per sequenziare un genoma è ormai prossimo alla soglia psicologica dei $ 1 000. La promessa di identificare varianti genetiche che influenzano il nostro stile di vita e che ci rendono suscettibili alle malattie sta quindi diventando realtà. Tuttavia, molto lavoro è ancora necessario perché questo nuovo tipo di medicina possa trasformarsi in realtà. In particolare la sfida oggi non è più data dalla generazione dei dati di sequenziamento, ma è rappresentata invece dalla loro interpretazione. L'obiettivo del mio progetto di dottorato è lo sviluppo di metodi bioinformatici per predire la predisposizione a patologie, a partire da dati di sequenziamento. Molti di questi metodi sono stati testati nel contesto del Critical Assessment of Genome Interpretation (CAGI), una competizione internazionale focalizzata nel definire lo stato dell’arte per l’interpretazione del genoma, ottenendo sempre buoni risultati. Durante il mio progetto di dottorato ho avuto l'opportunità di affrontare l’intero spettro delle sfide che devono essere gestite per tradurre le nuove capacità di sequenziamento del genoma in pratica clinica. Uno dei problemi principali che si devono gestire quando si ha a che fare con dati di sequenziamento è l'interpretazione della patogenicità delle mutazioni. Decine di predittori sono stati creati per separare varianti neutrali dalle mutazioni che possono essere causa di un fenotipo patologico. In questo contesto il problema del benchmarking è fondamentale, in quanto le prestazioni di questi tool sono di solito testate su diversi dataset di varianti, rendendo impossibile un confronto di performance. Per affrontare questo problema, una comparazione dell’accuratezza di questi predittori è stata effettuata su un set di mutazioni con fenotipo ignoto nel contesto del CAGI, realizzando la valutazione per predittori di patogenicità più completa tra tutte le edizioni di questo esperimento collaborativo. La previsione di fenotipi a partire da dati di sequenziamento è un'altra sfida che deve essere affrontata per realizzare le promesse della medicina personalizzata. Durante il mio dottorato ho avuto l'opportunità di sviluppare diversi predittori per fenotipi complessi utilizzando dati provenienti da pannelli genici ed esomi. In questo contesto sono stati affrontati problemi come errori di interpretazione o la sovra interpretazione della patogenicità della varianti, come nel caso della sfida focalizzata sulla predizione di fenotipi a partire dall’Hopkins Clinical Panel. Sono inoltre emersi altri problemi complementari alla previsione di fenotipo, come per esempio la possibile presenza di risultati accidentali. Specifiche strategie di predizione sono state definite lavorando con diversi tipi di dati di sequenziamento. Un esempio è dato dal morbo di Crohn. Tre edizioni del CAGI hanno proposto la sfida di identificare individui sani o affetti da questa patologia infiammatoria utilizzando unicamente dati di sequenziamento dell’esoma. L'analisi dei dataset ha rivelato come la presenza di struttura di popolazione e problemi nella preparazione e sequenziamento degli esomi abbiano compromesso le predizioni per questo fenotipo, generando una sovrastima delle performance di predizione. Tenendo in considerazione questo dato è stata definita una strategia di predizione completamente nuova per questo fenotipo, testata in occasione dell'ultima edizione del CAGI. Dati provenienti da studi di associazione GWAS e l’analisi delle reti di interazione proteica sono stati utilizzati per definire liste di geni coinvolti nell’insorgenza della malattia. Buone performance di predizione sono state ottenute in particolare per gli individui a cui era stata assegnata una elevata probabilità di essere affetti. In ultima istanza, il mio lavoro è stato focalizzato sulla predizione di gruppi sanguigni, sempre a partire da dati di sequenziamento. L'accuratezza dei test sierologici, infatti, è ridotta in caso di gruppi di sangue minori o fenotipi deboli. Incompatibilità per tali gruppi sanguigni possono essere critiche per alcune classi di individui, come nel caso dei pazienti oncoematologici. La nostra strategia di predizione ha sfruttato i dati genotipici per geni che codificano per gruppi sanguigni, presenti in database dedicati, e il principio di nearest neighbour per effettuare le predizioni. L’accuratezza del nostro metodo è stata testata sui sistemi ABO e RhD ottenendo buone performance di predizione. Inoltre le nostre analisi hanno aperto la strada ad un ulteriore aumento delle prestazioni per questo tool.
APA, Harvard, Vancouver, ISO, and other styles
31

Tominaga, Sacomoto Gustavo Akio. "Efficient algorithms for de novo assembly of alternative splicing events from RNA-seq data." Phd thesis, Université Claude Bernard - Lyon I, 2014. http://tel.archives-ouvertes.fr/tel-01015506.

Full text
Abstract:
In this thesis, we address the problem of identifying and quantifying variants (alternative splicing and genomic polymorphism) in RNA-seq data when no reference genome is available, without assembling the full transcripts. Based on the idea that each variant corresponds to a recognizable pattern, a bubble, in a de Bruijn graph constructed from the RNA-seq reads, we propose a general model for all variants in such graphs. We then introduce an exact method, called KisSplice, to extract alternative splicing events and show that it outperforms general purpose transcriptome assemblers. We put an extra effort to make KisSplice as scalable as possible. In order to improve the running time, we propose a new polynomial delay algorithm to enumerate bubbles. We show that it is several orders of magnitude faster than previous approaches. In order to reduce its memory consumption, we propose a new compact way to build and represent a de Bruijn graph. We show that our approach uses 30% to 40% less memory than the state of the art, with an insignificant impact on the construction time. Additionally, we apply the techniques developed to list bubbles in two classical problems: cycle enumeration and the K-shortest paths problem. We give the first optimal algorithm to list cycles in undirected graphs, improving over Johnson's algorithm. This is the first improvement to this problem in almost 40 years. We then consider a different parameterization of the K-shortest (simple) paths problem: instead of bounding the number of st-paths, we bound the weight of the st-paths. We present new algorithms using exponentially less memory than previous approaches
APA, Harvard, Vancouver, ISO, and other styles
32

Barcelona, Cabeza Rosa. "Genomics tools in the cloud: the new frontier in omics data analysis." Doctoral thesis, Universitat Politècnica de Catalunya, 2021. http://hdl.handle.net/10803/672757.

Full text
Abstract:
Substantial technological advancements in next generation sequencing (NGS) have revolutionized the genomic field. Over the last years, the speed and throughput of NGS technologies have increased while their costs have decreased, allowing us to achieve base-by-base interrogation of the human genome in an efficient and affordable way. All these advances have led to a growing application of NGS technologies in clinical practice to identify the genomics variations and their relationship with certain diseases. However, there is still the need to improve data accessibility, processing and interpretation due to both the huge amount of data generated by these sequencing technologies and the large number of tools available to process it. In addition to a large number of algorithms for variant discovery, each type of variation and data requires the use of a specific algorithm. Therefore, a solid background in bioinformatics is required to be able to select the most suitable algorithm in each case but also to be able to execute them successfully. On that basis, the aim of this project is to facilitate the processing of sequencing data for variant identification and interpretation for non-bioinformaticians. All this by creating high-performance workflows with a strong scientific basis, while remaining accessible and easy to use, as well as a simple and highly intuitive platform for data interpretation. An exhaustive bibliographic review has been carried out where the best existing algorithm has been selected to create automatic pipelines for the discovery of germline short variants (SNPs and indels) and germline structural variants (SVs), including both CNVs and chromosomal rearrangements, from modern human DNA. In addition to creating variant discovery pipelines, a pipeline has been implemented for in silico optimization of CNV detection from WES and TS data (isoCNV). This optimization pipeline has been shown to increase the sensitivity of CNV discovery using only NGS data. Such increased sensitivity is especially important for diagnosis in the clinical settings. Furthermore, a variant discovery workflow has been developed by integrating WES and RNA-seq data (varRED) that has been shown to increase the number of variants identified over those identified when only using WES data. It is important to note that variant discovery is not only important for modern populations, the study of the variation in ancient genomes is also essential to understand past human evolution. Thus, a germline short variant discovery pipeline from ancient WGS samples has been implemented. This workflow has been applied to a human mandible dated between 16980-16510 calibrated years before the present. The ancient short variants discovered were reported without further interpretation due to the low sample coverage. Finally, GINO has been implemented to facilitate the interpretation of the variants identified by the workflows developed in the context of this thesis. GINO is an easy-to-use platform for the visualization and interpretation of germline variants under user license. With the development of this thesis, it has been possible to implement the necessary tools for a high-performance identification of all types of germline variants, as well as a powerful platform to interpret the identified variants in a simple and fast way. Using this platform allows non-bioinformaticians to focus on interpreting results without having to worry about data processing with the guarantee of scientifically sound results. Furthermore, it has laid the foundations for implementing a platform for comprehensive analysis and visualization of genomic data in the cloud in the near future.
Los avances tecnológicos en la secuenciación de próxima generación (NGS) han revolucionado el campo de la genómica. El aumento de velocidad y rendimiento de las tecnologías NGS de los últimos años junto con la reducción de su coste ha permitido interrogar base por base el genoma humano de una manera eficiente y asequible. Todos estos avances han permitido incrementar el uso de las tecnologías NGS en la práctica clínica para la identificación de variaciones genómicas y su relación con determinadas enfermedades. Sin embargo, sigue siendo necesario mejorar la accesibilidad, el procesamiento y la interpretación de los datos debido a la enorme cantidad de datos generados y a la gran cantidad de herramientas disponibles para procesarlos. Además de la gran cantidad de algoritmos disponibles para el descubrimiento de variantes, cada tipo de variación y de datos requiere un algoritmo específico. Por ello, se requiere una sólida formación en bioinformática tanto para poder seleccionar el algoritmo más adecuado como para ser capaz de ejecutarlo correctamente. Partiendo de esa base, el objetivo de este proyecto es facilitar el procesamiento de datos de secuenciación para la identificación e interpretación de variantes para los no bioinformáticos. Todo ello mediante la creación de flujos de trabajo de alto rendimiento y con una sólida base científica, sin dejar de ser accesibles y fáciles de utilizar, así como de una plataforma sencilla y muy intuitiva para la interpretación de datos. Se ha realizado una exhaustiva revisión bibliográfica donde se han seleccionado los mejores algoritmos con los que crear flujos de trabajo automáticos para el descubrimiento de variantes cortas germinales (SNPs e indels) y variantes estructurales germinales (SV), incluyendo tanto CNV como reordenamientos cromosómicos, de ADN humano moderno. Además de crear flujos de trabajo para el descubrimiento de variantes, se ha implementado un flujo para la optimización in silico de la detección de CNV a partir de datos de WES y TS (isoCNV). Se ha demostrado que dicha optimización aumenta la sensibilidad de detección utilizando solo datos NGS, lo que es especialmente importante para el diagnóstico clínico. Además, se ha desarrollado un flujo de trabajo para el descubrimiento de variantes mediante la integración de datos de WES y RNA-seq (varRED) que ha demostrado aumentar el número de variantes detectadas sobre las identificadas cuando solo se utilizan datos de WES. Es importante señalar que la identificación de variantes no solo es importante para las poblaciones modernas, el estudio de las variaciones en genomas antiguos es esencial para comprender la evolución humana. Por ello, se ha implementado un flujo de trabajo para la identificación de variantes cortas a partir de muestras antiguas de WGS. Dicho flujo se ha aplicado a una mandíbula humana datada entre el 16980-16510 a.C. Las variantes ancestrales allí descubiertas se informaron sin mayor interpretación debido a la baja cobertura de la muestra. Finalmente, se ha implementado GINO para facilitar la interpretación de las variantes identificadas por los flujos de trabajo desarrollados en esta tesis. GINO es una plataforma fácil de usar para la visualización e interpretación de variantes germinales que requiere licencia de uso. Con el desarrollo de esta tesis se ha conseguido implementar las herramientas necesarias para la identificación de alto rendimiento de todos los tipos de variantes germinales, así como de una poderosa plataforma para visualizar dichas variantes de forma sencilla y rápida. El uso de esta plataforma permite a los no bioinformáticos centrarse en interpretar los resultados sin tener que preocuparse por el procesamiento de los datos con la garantía de que estos sean científicamente robustos. Además, ha sentado las bases para en un futuro próximo implementar una plataforma para el completo análisis y visualización de datos genómicos
Bioinformática
APA, Harvard, Vancouver, ISO, and other styles
33

Matocha, Petr. "Efektivní hledání překryvů u NGS dat." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-363811.

Full text
Abstract:
The main theme of this work is the detection of overlaps in NGS data. The work contains an overview of NGS sequencing technologies that are the source of NGS data. In the thesis, the problem of overlapping detection is generally defined. Next, an overview of the available algorithms and approaches for detecting overlaps in NGS data is created. Principles of these algorithms are described herein. In the second part of this work a suitable tool for detecting approximate overlaps in NGS data is designed and its implementation is described herein. In conclusion, the experiments performed with this tool and the conclusions that follow are summarized and described.
APA, Harvard, Vancouver, ISO, and other styles
34

Padmanabhan, Babu roshan. "Taxano-genomics, a strategy incorporating genomic data into the taxonomic description of human bacteria." Thesis, Aix-Marseille, 2014. http://www.theses.fr/2014AIXM5056.

Full text
Abstract:
Mon projet de doctorat était de créer un pipeline pour taxono-génomique pour la comparaison de plusieurs génomes bactériens. Deuxièmement, je automatisé le processus d'assemblage (NGS) et annotation à l'aide de divers logiciels open source ainsi que la création de scripts de maison pour le laboratoire. Enfin, nous avons intégré le pipeline dans la description de plusieurs espèces bactériennes de laboratoire sur. Cette thèse est divisée principalement en Taxono- génomique et Microbiogenomics. Les avis de la section taxono-génomique, décrit sur les avancées technologiques en génomique et métagénomique pertinentes dans le domaine de la microbiologie médicale et décrit la stratégie taxono-génomique en détail et comment la stratégie polyphasique avec des approches génomiques sont reformatage de la définition de la taxonomie bactérienne. Les articles décrivent les bactéries cliniquement importantes, leur séquençage complet du génome et les études génomiques comparatives, génomiques et taxono-génomique de ces bactéries. Dans cette thèse, j'ai inclus les articles décrivant ces organismes: Megasphaera massiliensis, Corynebacterium ihumii, Collinsella massiliensis, Clostridium dakarense. Bacillus dielmoensis, jeddahense, Occidentia Massiliensis, Necropsobacter rosorum et Pantoea septica. Oceanobacillus
My PhD project was to create a pipeline for taxono-genomics for the comparison of multiple bacterial genomes. Secondly I automated the process of assembly (NGS) and annotation using various open source softwares as well as creating in house scripts for the lab. Finally we incorporated the pipeline in describing several bacterial species from out lab. This thesis is subdivided mainly into Taxono-genomics and Microbiogenomics. The reviews in taxono-genomics section, describes about the technological advances in genomics and metagenomics relevant to the field of medical microbiology and describes the strategy taxono-genomics in detail and how polyphasic strategy along with genomic approaches are reformatting the definition of bacterial taxonomy. The articles describes clinically important bacteria, their whole genome sequencing and the genomic, comparative genomic and taxono-genomic studies of these bacteria
APA, Harvard, Vancouver, ISO, and other styles
35

Demidov, German 1990. "Methods for detection of germline and somatic copy-number variants in next generation sequencing data." Doctoral thesis, Universitat Pompeu Fabra, 2019. http://hdl.handle.net/10803/668208.

Full text
Abstract:
Germline copy-number variants (CNVs), as well as somatic copy-number alterations (CNAs), play an important role in many phenotypic traits, including genetic diseases and cancer. Next Generation Sequencing (NGS) allows accurate detection of short variants, but reliable detection of large-scale CNVs in NGS data remains challenging. In this work, I address this issue and describe a novel statistical method for detection of CNVs and CNAs implemented in the tool called ClinCNV. I present analytical performance measures of “ClinCNV” in different datasets, compare it with the performance of other existing methods, and show the advantages of ClinCNV. ClinCNV is already implemented as a part of the diagnostics pipeline at the Institute of Medical Genetics and Applied Genomics (IMGAG), Tuebingen, Germany. ClinCNV has the potential to facilitate molecular diagnostic of genetic-based diseases as well as cancer through accurate detection of copy-number variants.
Las variantes en el número de copias genéticas, tanto en estado germinal (CNV) como en somático (CNA), juegan un papel muy importante en muchos rasgos fenotípicos y están frecuentemente relacionadas con una gran variedad enfermedades genéticas y cáncer. Aunque la secuenciación de próxima generación (NGS) permite detectar variantes cortas con una gran precisión, la correcta detección de CNVs a gran escala con datos de secuenciación sigue siendo un gran desafío. En esta tesis, me centro en abordar este problema y describo un nuevo método estadístico para la detección de CNV y CNA englobado en una nueva herramienta llamada ClinCNV. Para el análisis del rendimiento de ClinCNV y demostrar las ventajas de este nuevo algoritmo, comparamos nuestra herramienta con otras existentes en distintos conjuntos de datos. Por otra parte, ClinCNV ya está implementado como parte del sistema de trabajo de diagnóstico en el Instituto de Genética Médica y Genómica Aplicada (IMGAG) en Tuebingen (Alemania). En resumen, ClinCNV tiene el potencial de facilitar el diagnóstico molecular de enfermedades genéticas y cáncer mediante la precisa detección de variantes en el número de copias genéticas.
APA, Harvard, Vancouver, ISO, and other styles
36

Pesare, Stefano. "Sistemi di Backup e tecniche di conservazione dei dati digitali." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018.

Find full text
Abstract:
La tesi si occupa del problema della conservazione dei dati digitali, spesso sottovalutato. Le odierne tecniche e strategie di conservazione e archiviazione non possono garantire da sole la sicurezza dei dati nel tempo, ma solo se vengono utilizzate sinergicamente. Durante questo percorso capiremo cosa siano i dati digitali, le loro caratteristiche e problematiche inerenti la loro gestione, nonché le tecniche di conservazione e storage. Vedremo come si sono evolute le memorie di massa, dalle schede perforate fino alla nascita dei dischi a stato solido. Inoltre, verranno introdotti il Cloud Computing e il ventaglio di servizi che offre, compreso il Cloud Storage. Infine, si mostreranno gli algoritmi principali di compressione, utili nella gestione dei dati.
APA, Harvard, Vancouver, ISO, and other styles
37

Finotello, Francesca. "Computational methods for the analysis of gene expression from RNA sequencing data." Doctoral thesis, Università degli studi di Padova, 2014. http://hdl.handle.net/11577/3423789.

Full text
Abstract:
In every living organism, the entirety of its hereditary information is encoded, in the form of DNA, through the so-called genome. The genome consists in both genes and non-coding sequences and contains the whole information needed to determine all the properties and functions of each single cell. Cells can access and translate specific instructions of this code through gene expression, namely by selectively switching on and off a particular set of genes. Thanks to gene expression, the information encoded into the active genes is transcribed into RNAs. This set of RNAs reflects the current state of a cell and can reveal pathological mechanisms underlying diseases. In recent years, a novel methodology for RNA sequencing, called RNA-seq, is replacing microarrays for the study of gene expression. The sequencing framework of RNA-seq methodology enables to investigate at high resolution all the RNA species present in a sample, characterizing their sequences and quantifying their abundances at the same time. In practice, millions of short sequences, called reads, are sequenced from random positions of the input RNAs. These reads can then be computationally mapped on a reference genome to reveal a transcriptional map, where the number of reads aligned on each gene, called counts, gives a measure of its level of expression. At first glance, this scheme may seem very simple, but the implementation of the whole analysis workflow is in fact complex and not well defined. So far, many computational methods have been proposed to perform the different steps of RNA-seq data analysis, but a unified processing pipeline is still lacking. The aim of my Ph.D. research project was the implementation of a robust computational pipeline for RNA-seq data analysis, from data pre-processing to differential expression detection. The definition of the different analysis modules was carried out through several steps. First, we drafted a basic analysis framework through the study of RNA-seq data features and the dissection of data models and state-of-the-art algorithmic strategies. Then, we focused on count bias, which is one of the most challenging aspects of RNA-seq data analysis. We demonstrated that some biases affecting counts can be effectively corrected with current normalization methods, while others, like length bias, cannot be completely removed without introducing additional systematic errors. Thus, we defined a novel approach to compute RNA-seq counts, which strongly reduces length bias prior to normalization and is robust to the upstream processing steps. Finally, we defined the complete analysis pipeline considering the best preforming methods and optimized some specific processing steps to enable correct expression estimates even in the presence of high-similarity genomic sequences. The implemented analysis pipeline was applied to a real case study to identify the genes involved in the pathogenesis of spinal muscular atrophy (SMA) from RNA-seq data of patients and healthy controls. SMA is a degenerative neuromuscular disease that has no cure and represents one of the major genetic causes of infant mortality. We identified a set of genes related to skeletal muscle and connective tissue disorders whose patterns of differential expression correlate with phenotype and may underlie protective mechanisms against SMA progression. Some putative positive targets identified by this analysis are currently under biological validation since they might improve diagnostic screening and therapy. To pose the basis for future research, which will focus on the optimization of the processing pipeline and to its extension to the analysis of dynamic expression data, we designed two time-series RNA-seq data sets: a real one and a simulated one. The experimental and sequencing design of the real data set, as well as the modelling of the synthetic data, have been an integral part of the Ph.D. activity. Overall, this thesis considers each step of the RNA-seq data processing and provides some valuable guidelines in a fast-evolving research field that, up to now, has prevented the establishment of a stable and standardized analysis scheme.
Il patrimonio genetico di ogni organismo vivente è codificato, sotto forma di DNA, nel genoma. Il genoma è costituito da geni e da sequenze non codificanti e racchiude in sé tutte le informazioni necessarie al corretto funzionamento delle cellule dell'organismo. Le cellule possono accedere a specifiche istruzioni di questo codice tramite un processo chiamato espressione genica, ovvero attivando o disattivando un particolare set di geni e trascrivendo l'informazione necessaria in RNA. L'insieme degli RNA trascritti caratterizza quindi un preciso stato cellulare e può fornire importanti informazioni sui meccanismi coinvolti nella patogenesi di una malattia. Recentemente, una metodologia per il sequenziamento dell'RNA, chiamata RNA-seq, sta rapidamente sostituendo i microarray nello studio dell'espressione genica. Grazie alle proprietà delle tecnologie di sequenziamento su cui è basato, l'RNA-seq permette di misurare il numero di RNA presenti in un campione e al contempo di "leggerne" l'esatta sequenza. In realtà, il sequenziamento produce milioni di sequenze, chiamate "read", che rappresentano piccole stringhe lette da posizioni random degli RNA in input. Le read devono quindi essere mappate con un algoritmo su un genoma di riferimento, in modo da ricostruire una mappa trascrizionale, in cui il numero di read allineate su ciascun gene dà una misura digitale (chiamata "count") del suo livello di espressione. Sebbene a prima vista questa procedura possa sembrare molto semplice, lo schema di analisi integrale è in realtà molto complesso e non ben definito. In questi anni sono stati sviluppati diversi metodi per ciascuna delle fasi di elaborazione, ma non è stata tuttora definita una pipeline di analisi dei dati RNA-seq standardizzata. L'obiettivo principale del mio progetto di dottorato è stato lo sviluppo di una pipeline computazionale per l'analisi di dati RNA-seq, dal pre-processing alla misura dell'espressione genica differenziale. I diversi moduli di elaborazione sono stati definiti e implementati tramite una serie di passi successivi. Inizialmente, abbiamo considerato e ridefinito metodi e modelli per la descrizione e l'elaborazione dei dati, in modo da stabilire uno schema di analisi preliminare. In seguito, abbiamo considerato più attentamente uno degli aspetti più problematici dell'analisi dei dati RNA-seq: la correzione dei bias presenti nei count. Abbiamo dimostrato che alcuni di questi bias possono essere corretti in modo efficace tramite le tecniche di normalizzazione correnti, mentre altri, ad esempio il "length bias", non possono essere completamente rimossi senza introdurre ulteriori errori sistematici. Abbiamo quindi definito e testato un nuovo approccio per il calcolo dei count che minimizza i bias ancora prima di procedere con un'eventuale normalizzazione. Infine, abbiamo implementato la pipeline di analisi completa considerando gli algoritmi più robusti e accurati, selezionati nelle fasi precedenti, e ottimizzato alcun step in modo da garantire stime dell'espressione genica accurate anche in presenza di geni ad alta similarità. La pipeline implementata è stata in seguito applicata ad un caso di studio reale, per identificare i geni coinvolti nella patogenesi dell'atrofia muscolare spinale (SMA). La SMA è una malattia neuromuscolare degenerativa che costituisce una delle principali cause genetiche di morte infantile e per la quale non sono ad oggi disponibili né una cura né un trattamento efficace. Con la nostra analisi abbiamo identificato un insieme di geni legati ad altre malattie del tessuto connettivo e muscoloscheletrico i cui pattern di espressione differenziale correlano con il fenotipo, e che quindi potrebbero rappresentare dei meccanismi protettivi in grado di combattere i sintomi della SMA. Alcuni di questi target putativi sono in via di validazione poiché potrebbero portare allo sviluppo di strumenti efficaci per lo screening diagnostico e il trattamento di questa malattia. Gli obiettivi futuri riguardano l'ottimizzazione della pipeline definita in questa tesi e la sua estensione all'analisi di dati dinamici da "time-series RNA-seq". A questo scopo, abbiamo definito il design di due data set "time-series", uno reale e uno simulato. La progettazione del design sperimentale e del sequenziamento del data set reale, nonché la modellazione dei dati simulati, sono stati parte integrante dell'attività di ricerca svolta durante il dottorato. L'evoluzione rapida e costante che ha caratterizzato i metodi per l'analisi di dati RNA-seq ha impedito fino ad ora la definizione di uno schema di analisi standardizzato e la risoluzione di problematiche legate a diversi aspetti dell'elaborazione, quali ad esempio la normalizzazione. In questo contesto, la pipeline definita in questa tesi e, più in ampiamente, i temi discussi in ciascun capitolo, toccano tutti i diversi aspetti dell'analisi dei dati RNA-seq e forniscono delle linee guida utili a definire un approccio computazionale efficace e robusto.
APA, Harvard, Vancouver, ISO, and other styles
38

Sutharzan, Sreeskandarajan. "CLUSTERING AND VISUALIZATION OF GENOMIC DATA." Miami University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=miami1563973517163859.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Evenstone, Lauren. "Employing Limited Next Generation Sequence Data for the Development of Genetic Loci of Phylogenetic and Population Genetic Utility." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2191.

Full text
Abstract:
Massively parallel high throughput sequencers are transforming the scientific research by reducing the cost and time necessary to sequence entire genomes. The goal of this project is to produce preliminary genome assemblies of calliphorid flies using Life Technologies’ Ion Torrent sequencing and Illumina’s MiSeq sequencing. I located, assembled, and annotated a novel mitochondrial genome for one such fly, the little studied Chrysomya pacifica that is central to one hypothesis about blow fly evolution. With sequencing data from Chrysomya megacephala, its forensically relevant sister species, much insight can be gained by alignments, sequence and protein analysis, and many more tools within the CLC Genomics Workbench software program. I present these analyses here of these recently diverged species.
APA, Harvard, Vancouver, ISO, and other styles
40

Camerlengo, Terry Luke. "Techniques for Storing and Processing Next-Generation DNA Sequencing Data." The Ohio State University, 2014. http://rave.ohiolink.edu/etdc/view?acc_num=osu1388502159.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

蘇金照 and Kam-chiu Ivan So. "Social workers' and NGOs' attitudes towards using computers in social welfare services." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1993. http://hub.hku.hk/bib/B31977467.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Defibaugh, June, and Norman Anderson. "National Guard Data Relay and the LAV Sensor System." International Foundation for Telemetering, 1996. http://hdl.handle.net/10150/611416.

Full text
Abstract:
International Telemetering Conference Proceedings / October 28-31, 1996 / Town and Country Hotel and Convention Center, San Diego, California
The Defense Evaluation Support Activity (DESA) is an independent Office of the Secretary of Defense (OSD) activity that provides tailored evaluation support to government organizations. DESA provides quick-response support capabilities and performs activities ranging from studies to large-scale field activities that include deployment, instrumentation, site setup, event execution, analysis and report writing. The National Guard Bureau requested DESA's assistance in the development and field testing of the Light Armored Vehicle (LAV) Sensor Suite (LSS). LSS was integrated by DESA to provide a multi-sensor suite that detects and identifies ground targets on foot or in vehicles with minimal operator workload. The LSS was designed primarily for deployment in high density drug trafficking areas along the northern and southern borders using primarily commercial-off-the-shelf and government-off-the-shelf equipment. Field testing of the system prototype in summer of 1995 indicates that the LSS will provide a significant new data collection and transfer capability to the National Guard in control of illegal drug transfer across the U.S. borders.
APA, Harvard, Vancouver, ISO, and other styles
43

Alshatti, Danah Ahmed. "Examining Driver Risk Factors in Road Departure Conflicts Using SHRP2 Data." University of Dayton / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=dayton152534759506242.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Gatton, Tim. "Using Telemetry Front-end Equipment and Network Attached Storage Connected to Form a Real-time Data Recording and Playback System." International Foundation for Telemetering, 2004. http://hdl.handle.net/10150/605316.

Full text
Abstract:
International Telemetering Conference Proceedings / October 18-21, 2004 / Town & Country Resort, San Diego, California
The use of traditional telemetry decommutation equipment can be easily expanded to create a real-time pulse code modulation (PCM) telemetry data recorder. However, there are two areas that create unique demands where architectural investment is required: the PCM output stage and the storage stage. This paper details the efforts to define the requirements and limits of a traditional telemetry system when used as a real-time, multistream PCM data recorder with time tagging.
APA, Harvard, Vancouver, ISO, and other styles
45

Ishi, Soares de Lima Leandro. "De novo algorithms to identify patterns associated with biological events in de Bruijn graphs built from NGS data." Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1055/document.

Full text
Abstract:
L'objectif principal de cette thèse est le développement, l'amélioration et l'évaluation de méthodes de traitement de données massives de séquençage, principalement des lectures de séquençage d'ARN courtes et longues, pour éventuellement aider la communauté à répondre à certaines questions biologiques, en particulier dans les contextes de transcriptomique et d'épissage alternatif. Notre objectif initial était de développer des méthodes pour traiter les données d'ARN-seq de deuxième génération à l'aide de graphes de De Bruijn afin de contribuer à la littérature sur l'épissage alternatif, qui a été exploré dans les trois premiers travaux. Le premier article (Chapitre 3, article [77]) a exploré le problème que les répétitions apportent aux assembleurs de transcriptome si elles ne sont pas correctement traitées. Nous avons montré que la sensibilité et la précision de notre assembleur local d'épissage alternatif augmentaient considérablement lorsque les répétitions étaient formellement modélisées. Le second (Chapitre 4, article [11]) montre que l'annotation d'événements d'épissage alternatifs avec une seule approche conduit à rater un grand nombre de candidats, dont beaucoup sont importants. Ainsi, afin d'explorer de manière exhaustive les événements d'épissage alternatifs dans un échantillon, nous préconisons l'utilisation combinée des approches mapping-first et assembly-first. Étant donné que nous avons une énorme quantité de bulles dans les graphes de De Bruijn construits à partir de données réelles d'ARN-seq, qui est impossible à analyser dans la pratique, dans le troisième travail (Chapitre 5, articles [1, 2]), nous avons exploré théoriquement la manière de représenter efficacement et de manière compacte l'espace des bulles via un générateur des bulles. L'exploration et l'analyse des bulles dans le générateur sont réalisables dans la pratique et peuvent être complémentaires aux algorithmes de l'état de l'art qui analysent un sous-ensemble de l'espace des bulles. Les collaborations et les avancées sur la technologie de séquençage nous ont incités à travailler dans d'autres sous-domaines de la bioinformatique, tels que: études d'association à l'échelle des génomes, correction d'erreur et assemblage hybride. Notre quatrième travail (Chapitre 6, article [48]) décrit une méthode efficace pour trouver et interpréter des unitigs fortement associées à un phénotype, en particulier la résistance aux antibiotiques, ce qui rend les études d'association à l'échelle des génomes plus accessibles aux panels bactériens, surtout ceux qui contiennent des bactéries plastiques. Dans notre cinquième travail (Chapitre 7, article [76]), nous évaluons dans quelle mesure les méthodes existantes de correction d'erreur ADN à lecture longue sont capables de corriger les lectures longues d'ARN-seq à taux d'erreur élevé. Nous concluons qu'aucun outil ne surpasse tous les autres pour tous les indicateurs et est le mieux adapté à toutes les situations, et que le choix devrait être guidé par l'analyse en aval. Les lectures longues d'ARN-seq fournissent une nouvelle perspective sur la manière d'analyser les données transcriptomiques, puisqu'elles sont capables de décrire les séquences complètes des ARN messagers, ce qui n'était pas possible avec des lectures courtes dans plusieurs cas, même en utilisant des assembleurs de transcriptome de l'état de l'art. En tant que tel, dans notre dernier travail (Chapitre 8, article [75]), nous explorons une méthode hybride d'assemblage d'épissages alternatifs qui utilise des lectures à la fois courtes et longues afin de répertorier les événements d'épissage alternatifs de manière complète, grâce aux lectures courtes, guidé par le contexte intégral fourni par les lectures longues
The main goal of this thesis is the development, improvement and evaluation of methods to process massively sequenced data, mainly short and long RNA-sequencing reads, to eventually help the community to answer some biological questions, especially in the transcriptomic and alternative splicing contexts. Our initial objective was to develop methods to process second-generation RNA-seq data through de Bruijn graphs to contribute to the literature of alternative splicing, which was explored in the first three works. The first paper (Chapter 3, paper [77]) explored the issue that repeats bring to transcriptome assemblers if not addressed properly. We showed that the sensitivity and the precision of our local alternative splicing assembler increased significantly when repeats were formally modeled. The second (Chapter 4, paper [11]), shows that annotating alternative splicing events with a single approach leads to missing out a large number of candidates, many of which are significant. Thus, to comprehensively explore the alternative splicing events in a sample, we advocate for the combined use of both mapping-first and assembly-first approaches. Given that we have a huge amount of bubbles in de Bruijn graphs built from real RNA-seq data, which are unfeasible to be analysed in practice, in the third work (Chapter 5, papers [1, 2]), we explored theoretically how to efficiently and compactly represent the bubble space through a bubble generator. Exploring and analysing the bubbles in the generator is feasible in practice and can be complementary to state-of-the-art algorithms that analyse a subset of the bubble space. Collaborations and advances on the sequencing technology encouraged us to work in other subareas of bioinformatics, such as: genome-wide association studies, error correction, and hybrid assembly. Our fourth work (Chapter 6, paper [48]) describes an efficient method to find and interpret unitigs highly associated to a phenotype, especially antibiotic resistance, making genome-wide association studies more amenable to bacterial panels, especially plastic ones. In our fifth work (Chapter 7, paper [76]), we evaluate the extent to which existing long-read DNA error correction methods are capable of correcting high-error-rate RNA-seq long reads. We conclude that no tool outperforms all the others across all metrics and is the most suited in all situations, and that the choice should be guided by the downstream analysis. RNA-seq long reads provide a new perspective on how to analyse transcriptomic data, since they are able to describe the full-length sequences of mRNAs, which was not possible with short reads in several cases, even by using state-of-the-art transcriptome assemblers. As such, in our last work (Chapter 8, paper [75]) we explore a hybrid alternative splicing assembly method, which makes use of both short and long reads, in order to list alternative splicing events in a comprehensive manner, thanks to short reads, guided by the full-length context provided by the long reads
APA, Harvard, Vancouver, ISO, and other styles
46

Ekström, Ted, and Eriksson Simon Kristensson. "Datalagring : nätverkslösning." Thesis, Högskolan Kristianstad, Sektionen för hälsa och samhälle, 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:hkr:diva-10404.

Full text
Abstract:
I dagens samhälle har många kundföretag problem med datalagringen vilket beror på att behovet av att lagra data började öka kraftigt under 2000-talet och har fortsatt öka sedan dess. När kundföretagen började använda sig av virtualisering så expanderade även datastorleken. I samband med att datastorleken växte ville företagen att data inte skulle lagras lokalt utan att den skulle vara lagrad på en central plats som en serverhall och vara överskådlig. Studien utfördes på tre sätt, först undersöktes vilka datalagringslösningar som fanns och det skedde genom litteraturstudier. Därefter kontaktades företag som använde olika lösningar för datalagring. Nästa steg var att kontakta de leverantörer som kundföretag använde sig av och som levererade lösningarna. Under arbetets gång upptäcktes olika lösningar på problematiken med ett ökat datalagringsbehov. Resultatet visade att kundföretagen använde centrala datalagringslösningar som NAS (Network Attached Storage), SAN(Storage Area Network) och NUS (Network Unified Storage). De här systemen var optimerade och hade säkerhet för att lösa behovet av datalagring. Efter genomförd undersökning gjordes en analys av såväl litteraturstudien som intervjumaterial. Syftet var att försöka ge en rekommendation på olika alternativ av lösningar som såväl nyblivna som befintliga kundföretag bör välja för att lösa sina behov.
APA, Harvard, Vancouver, ISO, and other styles
47

Schimd, Michele. "Quality value based models and methods for sequencing data." Doctoral thesis, Università degli studi di Padova, 2015. http://hdl.handle.net/11577/3424144.

Full text
Abstract:
First isolated by Friedrich Miescher in 1869 and then identified by James Watson and Francis Crick in 1953, the double stranded DeoxyriboNucleic Acid (DNA) molecule of Homo sapiens took fifty years to be completely reconstructed and to finally be at disposal to researchers for deep studies and analyses. The first technologies for DNA sequencing appeared around the mid-1970s; among them the most successful has been chain termination method, usually referred to as Sanger method. They remained de-facto standard for sequencing until, at the beginning of the 2000s, Next Generation Sequencing (NGS) technologies started to be developed. These technologies are able to produce huge amount of data with competitive costs in terms of dollars per base, but now further advances are revealing themselves in form of Single Molecule Real Time (SMRT) based sequencer, like Pacific Biosciences, that promises to produce fragments of length never been available before. However, none of above technologies are able to read an entire DNA, they can only produce short fragments (called reads) of the sample in a process referred to as sequencing. Although all these technologies have different characteristics, one recurrent trend in their evolution has been represented by the constant grow of the fraction of errors injected into the final reads. While Sanger machines produce as low as 1 erroneous base in 1000, the recent PacBio sequencers have an average error rate of 15%; NGS machines place themselves roughly in the middle with the expected error rate around 1%. With such a heterogeneity of error profiles and, as more and more data is produced every day, algorithms being able to cope with different sequencing technologies are becoming fundamental; at the same time also models for the description of sequencing with the inclusion of error profiling are gaining importance. A key feature that can make these approaches really effective is the ability of sequencers of producing quality scores which measure the probability of observing a sequencing error. In this thesis we present a stochastic model for the sequencing process and show its application to the problems of clustering and filtering of reads. The novel idea is to use quality scores to build a probabilistic framework that models the entire process of sequencing. Although relatively straightforward, the developing of such a model goes through the proper definition of probability spaces and events on such spaces. To keep the model simple and tractable several simplification hypotheses need to be introduce, each of them, however, must be explicitly stated and extensively discussed. The final result is a model for sequencing process that can be used: to give probabilistic interpretation of the problems defined on sequencing data and to characterize corresponding probabilistic answers (i.e., solutions). To experimentally validate the aforementioned model, we apply it to two different problems: reads clustering and reads filtering. The first set of experiments goes through the introduction of a set of novel alignment-free measures D2 resulting from the extension of the well known D2 -type measures to incorporate quality values. More precisely, instead of adding a unit contribution to the k-mers count statistic (as for D2 statistics), each k- mer contributes with an additive term corresponding to its probability of being correct as defined by our stochastic model. We show that this new measures are effective when applied to clustering of reads, by employing clusters produced with D2 as input to the problems of metagenomic binning and de-novo assembly. In the second set of experiments conducted to validate our stochastic model, we applied the same definition of correct read to the problem of reads filtering. We first define rank filtering which is a lossless filtering technique that sorts reads based on a given criterion; then we used the sorted list of reads as input of algorithms for reads mapping and de-novo assembly. The idea is that, on the reordered set, reads ranking higher should have better quality than the ones at lower ranks. To test this conjecture, we use such filtering as pre-processing step of reads mapping and de-novo assembly; in both cases we observe improvements when our rank filtering approach is used.
Isolata per la prima volta da Friedrich Miescher nel 1869 ed identificata nel 1953 da James Watson e Francis Crick, la molecola del DNA (acido desossiribonucleico) umano ha richiesto più di 50 anni perchè fosse a disposizione della comunità internazionale per studi e analisi approfondite. Le prime tecnologie di sequenziamento sono apparse attorno alla metà degli anni 70, tra queste quella di maggiore successo è stata la tecnologia denominata Sanger rimasta poi lo standard di fatto per il sequenziamento fino a che, agli inizi degli anni 2000, sequenziatori battezzati di nuova generazione (Next Generation Sequencing (NGS)) sono comparsi sul mercato. Questi ultimi hanno velocemente preso piede grazie ai bassi costi di sequenziamento soprattutto se confrontati con le precedenti macchine Sanger. Oggi tuttavia, nuove tecnologie (ad esempio PacBio di Pacific Biosciences) si stanno facendo strada grazie alla loro capacità di produrre frammenti di lunghezze mai ottenute prima d’ora. Nonostante la continua evoluzione nessuna di queste tecnologie è ancora in grado di produrre letture complete del DNA, ma solo parziali frammenti (chiamati read) come risultato del processo biochimico chiamato sequenziamento. Un trend ricorrente durante l’evoluzione dei sequenziatori è rappresentato dalla crescente presenza di errori di sequenziamento, se nelle read Sanger in media una lettura su mille corrisponde ad un errore, le ultime macchine PacBio sono caratterizzate da un tasso di errore di circa il 15%, una situazione più o meno intermedia è rappresentata dalle read NGS all’interno delle quali questo tasso si attesta su valori attorno al 1%. E’ chiaro quindi che algoritmi in grado di processare dati con diversi caratteristiche in termini di errori di sequenziamento stanno acquisendo maggiore importanza mentre lo sviluppo di modelli ad-hoc che affrontino esplicitamente il problema degli errori di sequenziamento stanno assumendo notevole rilevanza. A supporto di queste tecniche le macchine sequenziatrici producono valori di qualità (quality scores o quality values) che possono esser messi in relazione con la probabilità di osservare un errore di sequenziamento. In questa tesi viene presentato un modello stocastico per descrivere il processo di sequenziamento e ne vengono presentate due applicazioni: clustering di read e il filtraggio di read. L’idea alla base del modello è di utilizzare i valori di qualità come fondamento per la definizione di un modello probabilistico che descriva il processo di sequenziamento. La derivazione di tale modello richiede la definizione rigorosa degli spazi di probabilità coinvolti e degli eventi in essi definiti. Inoltre, allo scopo di sviluppare un modello semplice e trattabile è necessario introdurre ipotesi semplificative che agevolino tale processo, tuttavia tali ipotesi debbono essere esplicitate ed opportunamente discusse. Per fornirne una validazione sperimentale, il modello è stato applicato ai problemi di clustering e filtraggio. Nel primo caso il clustering viene eseguito utilizzando le nuove misure Dq2 ottenute come estensione delle note misure alignment-free D2 attraverso l’introduzione dei valori di qualità. Più precisamente anzichè indurre un contributo unitario al conto della frequenza dei k-mer (come avviene per le statistiche D2), nelle misure Dq2 il contributo di un k-mer coincide con la probabilità dello stesso si essere corretto, calcolata sulla base dei valori di qualità associati. I risultati del clustering sono poi utilizzati per risolvere il problema del de-novo assembly (ricostruzione ex-novo di sequenze) e del metagenomic binning (classificazione di read da esperimenti di metagenomica). Una seconda applicazione del modello teorico è rappresentata dal problema del filtraggio di read utilizzando un approccio senza perdita di informazione in cui le read vengono ordinate secondo la loro probabilità di correttezza. L’idea che giustifica l’impiego di tale approccio è che l’ordinamento dovrebbe collocare nelle posizioni più alte le read con migliore qualità retrocedendo quelle con qualità più bassa. Per verificare la validità di questa nostra congettura, il filtraggio è stato utilizzato come fase preliminare di algoritmi per mappaggio di read e de-novo assembly. In entrambi i casi si osserva un miglioramento delle prestazione degli algoritmi quando le read sono presentate nell’ordine indotto dalla nostra misura. La tesi è strutturata nel seguente modo. Nel Capitolo 1 viene fornita una introduzione al sequenziamento e una panoramica dei principali problemi definiti sui dati prodotti. Inoltre vengono dati alcuni cenni sulla rappresentazione di sequenze, read e valori di qualità. Alla fine dello stesso Capitolo 1 si delineano brevemente i principali contributi della tesi e la letteratura correlata. Il Capitolo 2 contiene la derivazione formale del modello probabilistico per il sequenziamento. Nella prima parte viene schematicamente presentato il processo di produzione di una coppia simbolo qualità per poi passare alla definizione di spazi di probabilità per sequenze e sequenziamento. Mentre gli aspetti relativo alla distribuzione di probabilità per la sequenza di riferimento non vengono considerati in questa tesi, la descrizione probabilistica del processo di sequenziamento è trattata in dettaglio nella parte centrale del Capitolo 2 nella cui ultima parte viene presentata la derivazione della probabilità di correttezza di una read che viene poi utilizzata nei capitoli successivi. Il Capitolo 3 presenta le misure Dq2 e gli esperimenti relativi al clustering i cui risultati sono frutto del lavoro svolto in collaborazione con Matto Comin e Andrea Leoni e pubblicato in [CLS14] e [CLS15]. Il Capitolo 4 presenta invece i risultati preliminari fin qui ottenuti per il filtraggio di read basato sui valori di qualità. Infine il Capitolo 5 presenta le conclusioni e delinea le direzioni future che si intendono perseguire a continuamento del lavoro qui presentato.
APA, Harvard, Vancouver, ISO, and other styles
48

Britto, Fernando Perez de. "Perspectivas organizacional e tecnológica da aplicação de analytics nas organizações." Pontifícia Universidade Católica de São Paulo, 2016. https://tede2.pucsp.br/handle/handle/19282.

Full text
Abstract:
Submitted by Filipe dos Santos (fsantos@pucsp.br) on 2016-11-01T17:05:22Z No. of bitstreams: 1 Fernando Perez de Britto.pdf: 2289185 bytes, checksum: c32224fdc1bfd0e47372fe52c8927cff (MD5)
Made available in DSpace on 2016-11-01T17:05:22Z (GMT). No. of bitstreams: 1 Fernando Perez de Britto.pdf: 2289185 bytes, checksum: c32224fdc1bfd0e47372fe52c8927cff (MD5) Previous issue date: 2016-09-12
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
The use of Analytics technologies is gaining prominence in organizations exposed to pressures for greater profitability and efficiency, and to a highly globalized and competitive environment in which cycles of economic growth and recession and cycles of liberalism and interventionism, short or long, are more frequents. However, the use of these technologies is complex and influenced by conceptual, human, organizational and technologicalaspects, the latter especially in relation to the manipulation and analysis of large volumes of data, Big Data. From a bibliographicresearch on the organizational and technological perspectives, this work initially deals with theconcepts and technologies relevant to the use of Analytics in organizations, and then explores issues related to the alignment between business processes and data and information, the assessment of the potential of theuseofAnalytics, the use of Analytics in performance management, in process optimization and as decision support, and the establishment of a continuousimprovement process. Enabling at the enda reflection on the directions, approaches, referrals, opportunities and challenges related to the use of Analytics in organizations
A utilização de tecnologias de Analyticsvem ganhando destaque nas organizações expostas a pressões por maior rentabilidade e eficiência, ea um ambiente altamente globalizado e competitivo no qual ciclos de crescimento econômico e recessão e ciclos de liberalismo e intervencionismo, curtos ou longos, estão mais frequentes. Entretanto, a utilização destas tecnologias é complexa e influenciada por aspectos conceituais, humanos, organizacionais e tecnológicos, este último principalmente com relação à manipulação e análise de grandes volumes de dados, Big Data. A partir de uma pesquisa bibliográfica sobre as perspectivas organizacional e tecnológica, este trabalho trata inicialmente de conceitos e tecnologias relevantes para a utilização de Analyticsnas organizações, eem seguida explora questões relacionadas ao alinhamento entre processos organizacionaise dados e informações, à avaliação de potencial de utilização de Analytics, à utilização de Analyticsem gestão de performance, otimização de processos e como suporte à decisão, e ao estabelecimento de um processo de melhoria contínua.Possibilitandoao finaluma reflexão sobre os direcionamentos, as abordagens, os encaminhamentos, as oportunidades e os desafios relacionados àutilização de Analyticsnas organizações
APA, Harvard, Vancouver, ISO, and other styles
49

Kawalia, Amit [Verfasser], Peter [Gutachter] Nürnberg, and Michael [Gutachter] Nothnagel. "Addressing NGS Data Challenges: Efficient High Throughput Processing and Sequencing Error Detection / Amit Kawalia ; Gutachter: Peter Nürnberg, Michael Nothnagel." Köln : Universitäts- und Stadtbibliothek Köln, 2016. http://d-nb.info/112370368X/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Chebbo, Manal. "Simulation fine d'optique adaptative à très grand champ pour des grands et futurs très grands télescopes." Thesis, Aix-Marseille, 2012. http://www.theses.fr/2012AIXM4733/document.

Full text
Abstract:
La simulation fine de systèmes d'OA à grand champ de type MOAO ou LTAO pour l'ELT se heurte à deux problématiques: l'augmentation du nombre de degrés de liberté du système. Cette augmentation rend les codes de simulation classiques peu utilisables, en particulier en ce qui concerne les processus d'inversion et de calcul matriciel. La complexité des systèmes, combinant EGL et EGN, grands miroirs déformables couvrant tout le champs et des miroirs dédiés dans les instruments eux mêmes, des rotations différentielles de pupille et ou de champs. Cette complexité conduit aux développements de procédures nouvelles d'étalonnage, de filtrage et fusion de données, de commande distribuée ou globale. Ces procédures doivent être simulées finement, comparées et quantifiées en termes de performances, avant d'être implantées dans de futurs systèmes. Pour répondre à ces deux besoins, le LAM développe en collaboration avec l'ONERA un code de simulation complet, basé sur une approche de résolution itérative de systèmes linéaires à grand nombre de paramètres (matrices creuses). Sur cette base, il incorpore de nouveaux concepts de filtrage et de fusion de données pour gérer efficacement les modes de tip/tilt/defocus dans le processus complet de reconstruction tomographique. Il permettra aussi, de développer et tester des lois de commandes complexes ayant à gérer un la combinaison du télescope adaptatif et d'instrument post-focaux comportant eux aussi des miroirs déformables dédiés.La première application de cet outil se fait naturellement dans le cadre du projet EAGLE, un des instruments phares du futur E-ELT, qui, du point de vue de l'OA combinera l'ensemble de ces problématiques
Refined simulation tools for wide field AO systems on ELTs present new challenges. Increasing the number of degrees of freedom makes the standard simulation's codes useless due to the huge number of operations to be performed at each step of the AO loop process. The classical matrix inversion and the VMM have to be replaced by a cleverer iterative resolution of the Least Square or Minimum Mean Square Error criterion. For this new generation of AO systems, concepts themselves will become more complex: data fusion coming from multiple LGS and NGS will have to be optimized, mirrors covering all the field of view associated to dedicated mirrors inside the scientific instrument itself will have to be coupled using split or integrated tomography schemes, differential pupil or/and field rotations will have to be considered.All these new entries should be carefully simulated, analysed and quantified in terms of performance before any implementation in AO systems. For those reasons i developed, in collaboration with the ONERA, a full simulation code, based on iterative solution of linear systems with many parameters (sparse matrices). On this basis, I introduced new concepts of filtering and data fusion to effectively manage modes such as tip, tilt and defoc in the entire process of tomographic reconstruction. The code will also eventually help to develop and test complex control laws who have to manage a combination of adaptive telescope and post-focal instrument including dedicated DM
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography