Dissertations / Theses: 'Data / knowledge partitioning and distribution'

1

De, Oliveira Joffrey. "Gestion de graphes de connaissances dans l'informatique en périphérie : gestion de flux, autonomie et adaptabilité." Electronic Thesis or Diss., Université Gustave Eiffel, 2023. http://www.theses.fr/2023UEFL2069.

Full text

Abstract:

Les travaux de recherche menés dans le cadre de cette thèse de doctorat se situent à l'interface du Web sémantique, des bases de données et de l'informatique en périphérie (généralement dénotée Edge computing). En effet, notre objectif est de concevoir, développer et évaluer un système de gestion de bases de données (SGBD) basé sur le modèle de données Resource Description Framework (RDF) du W3C, qui doit être adapté aux terminaux que l'on trouve dans l'informatique périphérique. Les applications possibles d'un tel système sont nombreuses et couvrent un large éventail de secteurs tels que l'industrie, la finance et la médecine, pour n'en citer que quelques-uns. Pour preuve, le sujet de cette thèse a été défini avec l'équipe du laboratoire d'informatique et d'intelligence artificielle (CSAI) du ENGIE Lab CRIGEN. Ce dernier est le centre de recherche et de développement d'ENGIE dédié aux gaz verts (hydrogène, biogaz et gaz liquéfiés), aux nouveaux usages de l'énergie dans les villes et les bâtiments, à l'industrie et aux technologies émergentes (numérique et intelligence artificielle, drones et robots, nanotechnologies et capteurs). Le CSAI a financé cette thèse dans le cadre d'une collaboration de type CIFRE. Les fonctionnalités d'un système satisfaisant ces caractéristiques doivent permettre de détecter de manière pertinente et efficace des anomalies et des situations exceptionnelles depuis des mesures provenant de capteurs et/ou actuateurs. Dans un contexte industriel, cela peut correspondre à la détection de mesures, par exemple de pression ou de débit sur un réseau de distribution de gaz, trop élevées qui pourraient potentiellement compromettre des infrastructures ou même la sécurité des individus. Le mode opératoire de cette détection doit se faire au travers d'une approche conviviale pour permettre au plus grand nombre d'utilisateurs, y compris les non-programmeurs, de décrire les situations à risque. L'approche doit donc être déclarative, et non procédurale, et doit donc s'appuyer sur un langage de requêtes, par exemple SPARQL. Nous estimons que l'apport des technologies du Web sémantique peut être prépondérant dans un tel contexte. En effet, la capacité à inférer des conséquences implicites depuis des données et connaissances explicites constitue un moyen de créer de nouveaux services qui se distinguent par leur aptitude à s'ajuster aux circonstances rencontrées et à prendre des décisions de manière autonome. Cela peut se traduire par la génération de nouvelles requêtes dans certaines situations alarmantes ou bien en définissant un sous-graphe minimal de connaissances dont une instance de notre SGBD a besoin pour répondre à l'ensemble de ses requêtes. La conception d'un tel SGBD doit également prendre en compte les contraintes inhérentes de l'informatique en périphérie, c'est-à-dire les limites en terme de capacité de calcul, de stockage, de bande passante et parfois énergétique (lorsque le terminal est alimenté par un panneau solaire ou bien une batterie). Il convient donc de faire des choix architecturaux et technologiques satisfaisant ces limitations. Concernant la représentation des données et connaissances, notre choix de conception s'est porté sur les structures de données succinctes (SDS) qui offrent, entre autres, les avantages d'être très compactes et ne nécessitant pas de décompression lors du requêtage. De même, il a été nécessaire d'intégrer la gestion de flux de données au sein de notre SGBD, par exemple avec le support du fenêtrage dans des requêtes SPARQL continues, et des différents services supportés par notre système. Enfin, la détection d'anomalies étant un domaine où les connaissances peuvent évoluer, nous avons intégré le support des modifications au niveau des graphes de connaissances stockés sur les instances des clients de notre SGBD. Ce support se traduit par une extension de certaines structures SDS utilisées dans notre prototype
The research work carried out as part of this PhD thesis lies at the interface between the Semantic Web, databases and edge computing. Indeed, our objective is to design, develop and evaluate a database management system (DBMS) based on the W3C Resource Description Framework (RDF) data model, which must be adapted to the terminals found in Edge computing.The possible applications of such a system are numerous and cover a wide range of sectors such as industry, finance and medicine, to name but a few. As proof of this, the subject of this thesis was defined with the team from the Computer Science and Artificial Intelligence Laboratory (CSAI) at ENGIE Lab CRIGEN. The latter is ENGIE's research and development centre dedicated to green gases (hydrogen, biogas and liquefied gases), new uses of energy in cities and buildings, industry and emerging technologies (digital and artificial intelligence, drones and robots, nanotechnologies and sensors). CSAI financed this thesis as part of a CIFRE-type collaboration.The functionalities of a system satisfying these characteristics must enable anomalies and exceptional situations to be detected in a relevant and effective way from measurements taken by sensors and/or actuators. In an industrial context, this could mean detecting excessively high measurements, for example of pressure or flow rate in a gas distribution network, which could potentially compromise infrastructure or even the safety of individuals. This detection must be carried out using a user-friendly approach to enable as many users as possible, including non-programmers, to describe risk situations. The approach must therefore be declarative, not procedural, and must be based on a query language, such as SPARQL.We believe that Semantic Web technologies can make a major contribution in this context. Indeed, the ability to infer implicit consequences from explicit data and knowledge is a means of creating new services that are distinguished by their ability to adjust to the circumstances encountered and to make autonomous decisions. This can be achieved by generating new queries in certain alarming situations, or by defining a minimal sub-graph of knowledge that an instance of our DBMS needs in order to respond to all of its queries.The design of such a DBMS must also take into account the inherent constraints of Edge computing, i.e. the limits in terms of computing capacity, storage, bandwidth and sometimes energy (when the terminal is powered by a solar panel or a battery). Architectural and technological choices must therefore be made to meet these limitations. With regard to the representation of data and knowledge, our design choice fell on succinct data structures (SDS), which offer, among other advantages, the fact that they are very compact and do not require decompression during querying. Similarly, it was necessary to integrate data flow management within our DBMS, for example with support for windowing in continuous SPARQL queries, and for the various services supported by our system. Finally, as anomaly detection is an area where knowledge can evolve, we have integrated support for modifications to the knowledge graphs stored on the client instances of our DBMS. This support translates into an extension of certain SDS structures used in our prototype

APA, Harvard, Vancouver, ISO, and other styles

2

HE, AIJING. "UNSUPERVISED DATA MINING BY RECURSIVE PARTITIONING." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1026406153.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Eberhagen, Niclas. "An investigation of emerging knowledge distribution means and their characterization." Licentiate thesis, Department of Computer and Systems Sciences, Stockholm University, 1999. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-8262.

Full text

Abstract:

This work investigates emerging knowledge distribution means through a descriptive study. Despite the amount of attention that processes and structures for knowledge management has received within research during the last decade, little attention has been directed towards the actual means used for the distribution of knowledge by individuals. In this respect it is the aim of the study to contribute with knowledge regarding knowledge distribution means. The study consists of a survey of emerging electronically mediated distribution means followed with a characterization and analysis. For the characterization and analysis a framework for interpretation of the different distribution means was created based on the constructs of organizational learning and the levels of knowledge system interpretation. Within the framework characteristics and concepts were identified and then used for the analysis of the knowledge distribution means. The characterization of the different knowledge distribution means as such may be used as an instrument for evaluation since it generalizable to other means of knowledge distribution. The results of the study show that knowledge distribution is not an isolated event. It takes place in larger context, such as organizational learning, since it touches upon other activities or phenomena such as knowledge acquisition, knowledge interpretation, and organizational memory. The concept of genre of knowledge distribution was found to be a viable concept to base exploration and development of support for knowledge distribution. The investigated distribution means only partly support a model for knowledge representation that captures both the problem-solution as well as an understanding of their relationship. In this respect existing distribution means must be enhanced or new ones developed if we wish to endorse such a representational model.

Licentiate thesis in partial fulfillment of the Licentiate of Philosophy degree in Computer and Systems Sciences, Stockholm University

APA, Harvard, Vancouver, ISO, and other styles

4

George, Chadrick Hendrik. "Knowledge management infrastructure and knowledge sharing: The case of a large fast moving consumer goods distribution centre in the Western Cape." Thesis, University of the Western Cape, 2014. http://hdl.handle.net/11394/3943.

Full text

Abstract:

Magister Commercii - MCom
The aim of this study is to understand how knowledge is created, shared and used within the fast moving consumer goods distribution centre in the Western Cape (WC). It also aims to understand knowledge sharing between individuals in the organisation. A literature review was conducted, in order to answer the research questions- this covered the background of knowledge management (KM) and KS and its current status with particular reference to SA’s private sector. The study found that technological KM infrastructure, cultural KM infrastructure and organisational KM infrastructure are important enablers of KS. A conceptual model was developed around these concepts. In order to answer the research questions, the study identified a FMCG DC in the WC, where KS is practiced

APA, Harvard, Vancouver, ISO, and other styles

5

Arres, Billel. "Optimisation des performances dans les entrepôts distribués avec Mapreduce : traitement des problèmes de partionnement et de distribution des données." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE2012.

Full text

Abstract:

Dans ce travail de thèse, nous abordons les problèmes liés au partitionnement et à la distribution des grands volumes d’entrepôts de données distribués avec Mapreduce. Dans un premier temps, nous abordons le problème de la distribution des données. Dans ce cas, nous proposons une stratégie d’optimisation du placement des données, basée sur le principe de la colocalisation. L’objectif est d’optimiser les traitements lors de l’exécution des requêtes d’analyse à travers la définition d’un schéma de distribution intentionnelle des données permettant de réduire la quantité des données transférées entre les noeuds lors des traitements, plus précisément lors phase de tri (shuffle). Nous proposons dans un second temps une nouvelle démarche pour améliorer les performances du framework Hadoop, qui est l’implémentation standard du paradigme Mapreduce. Celle-ci se base sur deux principales techniques d’optimisation. La première consiste en un pré-partitionnement vertical des données entreposées, réduisant ainsi le nombre de colonnes dans chaque fragment. Ce partitionnement sera complété par la suite par un autre partitionnement d’Hadoop, qui est horizontal, appliqué par défaut. L’objectif dans ce cas est d’améliorer l’accès aux données à travers la réduction de la taille des différents blocs de données. La seconde technique permet, en capturant les affinités entre les attributs d’une charge de requêtes et ceux de l’entrepôt, de définir un placement efficace de ces blocs de données à travers les noeuds qui composent le cluster. Notre troisième proposition traite le problème de l’impact du changement de la charge de requêtes sur la stratégie de distribution des données. Du moment que cette dernière dépend étroitement des affinités des attributs des requêtes et de l’entrepôt. Nous avons proposé, à cet effet, une approche dynamique qui permet de prendre en considération les nouvelles requêtes d’analyse qui parviennent au système. Pour pouvoir intégrer l’aspect de "dynamicité", nous avons utilisé un système multi-agents (SMA) pour la gestion automatique et autonome des données entreposées, et cela, à travers la redéfinition des nouveaux schémas de distribution et de la redistribution des blocs de données. Enfin, pour valider nos contributions nous avons conduit un ensemble d’expérimentations pour évaluer nos différentes approches proposées dans ce manuscrit. Nous étudions l’impact du partitionnement et la distribution intentionnelle sur le chargement des données, l’exécution des requêtes d’analyses, la construction de cubes OLAP, ainsi que l’équilibrage de la charge (Load Balacing). Nous avons également défini un modèle de coût qui nous a permis d’évaluer et de valider la stratégie de partitionnement proposée dans ce travail
In this manuscript, we addressed the problems of data partitioning and distribution for large scale data warehouses distributed with MapReduce. First, we address the problem of data distribution. In this case, we propose a strategy to optimize data placement on distributed systems, based on the collocation principle. The objective is to optimize queries performances through the definition of an intentional data distribution schema of data to reduce the amount of data transferred between nodes during treatments, specifically during MapReduce’s shuffling phase. Secondly, we propose a new approach to improve data partitioning and placement in distributed file systems, especially Hadoop-based systems, which is the standard implementation of the MapReduce paradigm. The aim is to overcome the default data partitioning and placement policies which does not take any relational data characteristics into account. Our proposal proceeds according to two steps. Based on queries workload, it defines an efficient partitioning schema. After that, the system defines a data distribution schema that meets the best user’s needs, and this, by collocating data blocks on the same or closest nodes. The objective in this case is to optimize queries execution and parallel processing performances, by improving data access. Our third proposal addresses the problem of the workload dynamicity, since users analytical needs evolve through time. In this case, we propose the use of multi-agents systems (MAS) as an extension of our data partitioning and placement approach. Through autonomy and self-control that characterize MAS, we developed a platform that defines automatically new distribution schemas, as new queries appends to the system, and apply a data rebalancing according to this new schema. This allows offloading the system administrator of the burden of managing load balance, besides improving queries performances by adopting careful data partitioning and placement policies. Finally, to validate our contributions we conduct a set of experiments to evaluate our different approaches proposed in this manuscript. We study the impact of an intentional data partitioning and distribution on data warehouse loading phase, the execution of analytical queries, OLAP cubes construction, as well as load balancing. We also defined a cost model that allowed us to evaluate and validate the partitioning strategy proposed in this work

APA, Harvard, Vancouver, ISO, and other styles

6

Antoine, Emilien. "Distributed data management with a declarative rule-based language webdamlog." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00933808.

Full text

Abstract:

Our goal is to enable aWeb user to easily specify distributed data managementtasks in place, i.e. without centralizing the data to a single provider. Oursystem is therefore not a replacement for Facebook, or any centralized system,but an alternative that allows users to launch their own peers on their machinesprocessing their own local personal data, and possibly collaborating with Webservices.We introduce Webdamlog, a datalog-style language for managing distributeddata and knowledge. The language extends datalog in a numberof ways, notably with a novel feature, namely delegation, allowing peersto exchange not only facts but also rules. We present a user study thatdemonstrates the usability of the language. We describe a Webdamlog enginethat extends a distributed datalog engine, namely Bud, with the supportof delegation and of a number of other novelties of Webdamlog such as thepossibility to have variables denoting peers or relations. We mention noveloptimization techniques, notably one based on the provenance of facts andrules. We exhibit experiments that demonstrate that the rich features ofWebdamlog can be supported at reasonable cost and that the engine scales tolarge volumes of data. Finally, we discuss the implementation of a Webdamlogpeer system that provides an environment for the engine. In particular, a peersupports wrappers to exchange Webdamlog data with non-Webdamlog peers.We illustrate these peers by presenting a picture management applicationthat we used for demonstration purposes.

APA, Harvard, Vancouver, ISO, and other styles

7

Galicia, Auyón Jorge Armando. "Revisiting Data Partitioning for Scalable RDF Graph Processing Combining Graph Exploration and Fragmentation for RDF Processing Query Optimization for Large Scale Clustered RDF Data RDFPart- Suite: Bridging Physical and Logical RDF Partitioning. Reverse Partitioning for SPARQL Queries: Principles and Performance Analysis. ShouldWe Be Afraid of Querying Billions of Triples in a Graph-Based Centralized System? EXGRAF: Exploration et Fragmentation de Graphes au Service du Traitement Scalable de Requˆetes RDF." Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2021. http://www.theses.fr/2021ESMA0001.

Full text

Abstract:

Le Resource Description Framework (RDF) et SPARQL sont des standards très populaires basés sur des graphes initialement conçus pour représenter et interroger des informations sur le Web. La flexibilité offerte par RDF a motivé son utilisation dans d'autres domaines. Aujourd'hui les jeux de données RDF sont d'excellentes sources d'information. Ils rassemblent des milliards de triplets dans des Knowledge Graphs qui doivent être stockés et exploités efficacement. La première génération de systèmes RDF a été construite sur des bases de données relationnelles traditionnelles. Malheureusement, les performances de ces systèmes se dégradent rapidement car le modèle relationnel ne convient pas au traitement des données RDF intrinsèquement représentées sous forme de graphe. Les systèmes RDF natifs et distribués cherchent à surmonter cette limitation. Les premiers utilisent principalement l’indexation comme stratégie d'optimisation pour accélérer les requêtes. Les deuxièmes recourent au partitionnement des données. Dans le modèle relationnel, la représentation logique de la base de données est cruciale pour concevoir le partitionnement. La couche logique définissant le schéma explicite de la base de données offre un certain confort aux concepteurs. Cette couche leur permet de choisir manuellement ou automatiquement, via des assistants automatiques, les tables et les attributs à partitionner. Aussi, elle préserve les concepts fondamentaux sur le partitionnement qui restent constants quel que soit le système de gestion de base de données. Ce schéma de conception n'est plus valide pour les bases de données RDF car le modèle RDF n'applique pas explicitement un schéma aux données. Ainsi, la couche logique est inexistante et le partitionnement des données dépend fortement des implémentations physiques des triplets sur le disque. Cette situation contribue à avoir des logiques de partitionnement différentes selon le système cible, ce qui est assez différent du point de vue du modèle relationnel. Dans cette thèse, nous promouvons l'idée d'effectuer le partitionnement de données au niveau logique dans les bases de données RDF. Ainsi, nous traitons d'abord le graphe de données RDF pour prendre en charge le partitionnement basé sur des entités logiques. Puis, nous proposons un framework pour effectuer les méthodes de partitionnement. Ce framework s'accompagne de procédures d'allocation et de distribution des données. Notre framework a été incorporé dans un système de traitement des données RDF centralisé (RDF_QDAG) et un système distribué (gStoreD). Nous avons mené plusieurs expériences qui ont confirmé la faisabilité de l'intégration de notre framework aux systèmes existants en améliorant leurs performances pour certaines requêtes. Enfin, nous concevons un ensemble d'outils de gestion du partitionnement de données RDF dont un langage de définition de données (DDL) et un assistant automatique de partitionnement
The Resource Description Framework (RDF) and SPARQL are very popular graph-based standards initially designed to represent and query information on the Web. The flexibility offered by RDF motivated its use in other domains and today RDF datasets are great information sources. They gather billions of triples in Knowledge Graphs that must be stored and efficiently exploited. The first generation of RDF systems was built on top of traditional relational databases. Unfortunately, the performance in these systems degrades rapidly as the relational model is not suitable for handling RDF data inherently represented as a graph. Native and distributed RDF systems seek to overcome this limitation. The former mainly use indexing as an optimization strategy to speed up queries. Distributed and parallel RDF systems resorts to data partitioning. The logical representation of the database is crucial to design data partitions in the relational model. The logical layer defining the explicit schema of the database provides a degree of comfort to database designers. It lets them choose manually or automatically (through advisors) the tables and attributes to be partitioned. Besides, it allows the partitioning core concepts to remain constant regardless of the database management system. This design scheme is no longer valid for RDF databases. Essentially, because the RDF model does not explicitly enforce a schema since RDF data is mostly implicitly structured. Thus, the logical layer is inexistent and data partitioning depends strongly on the physical implementations of the triples on disk. This situation contributes to have different partitioning logics depending on the target system, which is quite different from the relational model’s perspective. In this thesis, we promote the novel idea of performing data partitioning at the logical level in RDF databases. Thereby, we first process the RDF data graph to support logical entity-based partitioning. After this preparation, we present a partitioning framework built upon these logical structures. This framework is accompanied by data fragmentation, allocation, and distribution procedures. This framework was incorporated to a centralized (RDF_QDAG) and a distributed (gStoreD) triple store. We conducted several experiments that confirmed the feasibility of integrating our framework to existent systems improving their performances for certain queries. Finally, we design a set of RDF data partitioning management tools including a data definition language (DDL) and an automatic partitioning wizard

APA, Harvard, Vancouver, ISO, and other styles

8

Meiring, Linda. "A distribution model for the assessment of database systems knowledge and skills among second-year university students." Thesis, [Bloemfontein?] : Central University of Technology, Free State, 2009. http://hdl.handle.net/11462/44.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Dasgupta, Arghya. "How can the ‘Zeigarnik effect’ becombined with analogical reasoning inorder to enhance understanding ofcomplex knowledge related to computerscience?" Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-143636.

Full text

Abstract:

Many people face difficulties in remembering knowledge, which is complex and abstract. This is especially important when the descriptions of knowledge are to be stored in searchable knowledge bases. But if complex knowledge can be transferred through real life stories, it is more understandable and easier to retrieve for the knowledge acceptor. Moreover, if the stories follow a certain pattern like ‘intentional suspense’ it may be more useful. This study investigates how far a story with intentional interruption is helpful in transferring complex computer science knowledge through processing of information that compares similarities between new and well-understood concepts. The data collection was done by applying framework analysis approach through the interview of 40 students of Stockholm University. Results of this study is assumed to help organizations to design, store and retrieve complex knowledge structures in knowledge bases by using a specific pattern of the stories used in the narrative pedagogy known as 'Zeigarnik effect' which is a form of creating suspense. Interviews with managers showed that they are positive to using the type of knowledge transfer as is proposed in the results of this thesis. Transcribed interviews with students show that the students appreciate and understand the use of analogies in combination with the ‘Zeigarnik effect’ as is described in the result of this thesis. After analysis of the data collected from the experiments, it was confirmed that ‘Zeigarnik effect’ has a small positive effect for a group of people as better results have been found in most of the time when ‘Zeigarnik effect’ was used as compared to when the ‘Zeigarnik effect’ was not used. The participants that experienced the ‘Zeigarnik effect’ answered in a better way which proved that their understanding and memory regarding the subject have been enhanced using it.

APA, Harvard, Vancouver, ISO, and other styles

10

Coullon, Hélène. "Modélisation et implémentation de parallélisme implicite pour les simulations scientifiques basées sur des maillages." Thesis, Orléans, 2014. http://www.theses.fr/2014ORLE2029/document.

Full text

Abstract:

Le calcul scientifique parallèle est un domaine en plein essor qui permet à la fois d’augmenter la vitesse des longs traitements, de traiter des problèmes de taille plus importante ou encore des problèmes plus précis. Ce domaine permet donc d’aller plus loin dans les calculs scientifiques, d’obtenir des résultats plus pertinents, car plus précis, ou d’étudier des problèmes plus volumineux qu’auparavant. Dans le monde plus particulier de la simulation numérique scientifique, la résolution d’équations aux dérivées partielles (EDP) est un calcul particulièrement demandeur de ressources parallèles. Si les ressources matérielles permettant le calcul parallèle sont de plus en plus présentes et disponibles pour les scientifiques, à l’inverse leur utilisation et la programmation parallèle se démocratisent difficilement. Pour cette raison, des modèles de programmation parallèle, des outils de développement et même des langages de programmation parallèle ont vu le jour et visent à simplifier l’utilisation de ces machines. Il est toutefois difficile, dans ce domaine dit du “parallélisme implicite”, de trouver le niveau d’abstraction idéal pour les scientifiques, tout en réduisant l’effort de programmation. Ce travail de thèse propose tout d’abord un modèle permettant de mettre en oeuvre des solutions de parallélisme implicite pour les simulations numériques et la résolution d’EDP. Ce modèle est appelé “Structured Implicit Parallelism for scientific SIMulations” (SIPSim), et propose une vision au croisement de plusieurs types d’abstraction, en tentant de conserver les avantages de chaque vision. Une première implémentation de ce modèle, sous la forme d’une librairie C++ appelée SkelGIS, est proposée pour les maillages cartésiens à deux dimensions. Par la suite, SkelGIS, et donc l’implémentation du modèle, est étendue à des simulations numériques sur les réseaux (permettant l’application de simulations représentant plusieurs phénomènes physiques). Les performances de ces deux implémentations sont évaluées et analysées sur des cas d’application réels et complexes et démontrent qu’il est possible d’obtenir de bonnes performances en implémentant le modèle SIPSim
Parallel scientific computations is an expanding domain of computer science which increases the speed of calculations and offers a way to deal with heavier or more accurate calculations. Thus, the interest of scientific computations increases, with more precised results and bigger physical domains to study. In the particular case of scientific numerical simulations, solving partial differential equations (PDEs) is an especially heavy calculation and a perfect applicant to parallel computations. On one hand, it is more and more easy to get an access to very powerfull parallel machines and clusters, but on the other hand parallel programming is hard to democratize, and most scientists are not able to use these machines. As a result, high level programming models, framework, libraries, languages etc. have been proposed to hide technical details of parallel programming. However, in this “implicit parallelism” field, it is difficult to find the good abstraction level while keeping a low programming effort. This thesis proposes a model to write implicit parallelism solutions for numerical simulations such as mesh-based PDEs computations. This model is called “Structured Implicit Parallelism for scientific SIMulations” (SIPSim), and proposes an approach at the crossroads of existing solutions, taking advantage of each one. A first implementation of this model is proposed, as a C++ library called SkelGIS, for two dimensional Cartesian meshes. A second implementation of the model, and an extension of SkelGIS, proposes an implicit parallelism solution for network-simulations (which deals with simulations with multiple physical phenomenons), and is studied in details. A performance analysis of both these implementations is given on real case simulations, and it demonstrates that the SIPSim model can be implemented efficiently

APA, Harvard, Vancouver, ISO, and other styles

11

Hejblum, Boris. "Analyse intégrative de données de grande dimension appliquée à la recherche vaccinale." Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0049/document.

Full text

Abstract:

Les données d’expression génique sont reconnues comme étant de grande dimension, etnécessitant l’emploi de méthodes statistiques adaptées. Mais dans le contexte des essaisvaccinaux, d’autres mesures, comme par exemple les mesures de cytométrie en flux, sontégalement de grande dimension. De plus, ces données sont souvent mesurées de manièrelongitudinale. Ce travail est bâti sur l’idée que l’utilisation d’un maximum d’informationdisponible, en modélisant les connaissances a priori ainsi qu’en intégrant l’ensembledes différentes données disponibles, améliore l’inférence et l’interprétabilité des résultatsd’analyses statistiques en grande dimension. Tout d’abord, nous présentons une méthoded’analyse par groupe de gènes pour des données d’expression génique longitudinales. Ensuite,nous décrivons deux analyses intégratives dans deux études vaccinales. La premièremet en évidence une sous-expression des voies biologiques d’inflammation chez les patientsayant un rebond viral moins élevé à la suite d’un vaccin thérapeutique contre le VIH. Ladeuxième étude identifie un groupe de gènes lié au métabolisme lipidique dont l’impactsur la réponse à un vaccin contre la grippe semble régulé par la testostérone, et donc liéau sexe. Enfin, nous introduisons un nouveau modèle de mélange de distributions skew t àprocessus de Dirichlet pour l’identification de populations cellulaires à partir de donnéesde cytométrie en flux disponible notamment dans les essais vaccinaux. En outre, nousproposons une stratégie d’approximation séquentielle de la partition a posteriori dans lecas de mesures répétées. Ainsi, la reconnaissance automatique des populations cellulairespourrait permettre à la fois une avancée pratique pour le quotidien des immunologistesainsi qu’une interprétation plus précise des résultats d’expression génique après la priseen compte de l’ensemble des populations cellulaires
Gene expression data is recognized as high-dimensional data that needs specific statisticaltools for its analysis. But in the context of vaccine trials, other measures, such asflow-cytometry measurements are also high-dimensional. In addition, such measurementsare often repeated over time. This work is built on the idea that using the maximum ofavailable information, by modeling prior knowledge and integrating all data at hand, willimprove the inference and the interpretation of biological results from high-dimensionaldata. First, we present an original methodological development, Time-course Gene SetAnalysis (TcGSA), for the analysis of longitudinal gene expression data, taking into accountprior biological knowledge in the form of predefined gene sets. Second, we describetwo integrative analyses of two different vaccine studies. The first study reveals lowerexpression of inflammatory pathways consistently associated with lower viral rebound followinga HIV therapeutic vaccine. The second study highlights the role of a testosteronemediated group of genes linked to lipid metabolism in sex differences in immunologicalresponse to a flu vaccine. Finally, we introduce a new model-based clustering approach forthe automated treatment of cell populations from flow-cytometry data, namely a Dirichletprocess mixture of skew t-distributions, with a sequential posterior approximation strategyfor dealing with repeated measurements. Hence, the automatic recognition of thecell populations could allow a practical improvement of the daily work of immunologistsas well as a better interpretation of gene expression data after taking into account thefrequency of all cell populations

APA, Harvard, Vancouver, ISO, and other styles

12

PASINI, TOMMASO. "Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning." Doctoral thesis, 2019. http://hdl.handle.net/11573/1448979.

Full text

Abstract:

Communicating and understanding each other is one of the most important human abilities. As humans, in fact, we can easily assign the correct meaning to the ambiguous words in a text, while, at the same time, being able to abstract, summarise and enrich its content with new information that we learned somewhere else. On the contrary, machines rely on formal languages which do not leave space to ambiguity hence being easy to parse and understand. Therefore, to fill the gap between humans and machines and enabling the latter to better communicate with and comprehend its sentient counterpart, in the modern era of computer-science's much effort has been put into developing Natural Language Processing (NLP) approaches which aim at understanding and handling the ambiguity of the human language. At the core of NLP lies the task of correctly interpreting the meaning of each word in a given text, hence disambiguating its content exactly as a human would do. Researchers in the Word Sense Disambiguation (WSD) field address exactly this issue by leveraging either knowledge bases, i.e. graphs where nodes are concept and edges are semantic relations among them, or manually-annotated datasets for training machine learning algorithms. One common obstacle is the knowledge acquisition bottleneck problem, id est, retrieving or generating semantically-annotated data which are necessary to build both semantic graphs or training sets is a complex task. This phenomenon is even more serious when considering languages other than English where resources to generate human-annotated data are scarce and ready-made datasets are completely absent. With the advent of deep learning this issue became even more serious as more complex models need larger datasets in order to learn meaningful patterns to solve the task. Another critical issue in WSD, as well as in other machine-learning-related fields, is the domain adaptation problem, id est, performing the same task in different application domains. This is particularly hard when dealing with word senses, as, in fact, they are governed by a Zipfian distribution; hence, by slightly changing the application domain, a sense might become very frequent even though it is very rare in the general domain. For example the geometric sense of plane is very frequent in a corpus made of math books, while it is very rare in a general domain dataset. In this thesis we address both these problems. Inter alia, we focus on relieving the burden of human annotations in Word Sense Disambiguation thus enabling the automatic construction of high-quality sense-annotated dataset not only for English, but especially for other languages where sense-annotated data are not available at all. Furthermore, recognising in word-sense distribution one of the main pitfalls for WSD approaches, we also alleviate the dependency on most frequent sense information by automatically inducing the word-sense distribution in a given text of raw sentences. In the following we propose a language-independent and automatic approach to generating semantic annotations given a collection of sentences, and then introduce two methods for the automatic inference of word-sense distributions. Finally, we combine the two kind of approaches to build a semantically-annotated dataset that reflect the sense distribution which we automatically infer from the target text.

APA, Harvard, Vancouver, ISO, and other styles

13

Boychenko, Serhiy. "A Distributed Analysis Framework for Heterogeneous Data Processing in HEP Environments." Doctoral thesis, 2018. http://hdl.handle.net/10316/90651.

Full text

Abstract:

Tese de doutoramento do Programa de Doutoramento em Ciências e Tecnologias da Informação apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra
During the last extended maintenance period, CERNs Large Hadron Collider (LHC) and most of its equipment systems were upgraded to collide particles at an energy level almost twice higher compared to previous operational limits, significantly increasing the damage potential to accelerator components in case of equipment malfunctioning. System upgrades and the increased machine energy pose new challenges for the analysis of transient data recordings, which have to be both dependable and fast to maintain the required safety level of the deployed machine protection systems while at the same time maximizing the accelerator performance. With the LHC having operated for many years already, statistical and trend analysis across the collected data sets is an additional, growing requirement. The currently deployed accelerator transient data recording and analysis systems will equally require significant upgrades, as the developed architectures - state-of-art at the time of their initial development - are already working well beyond the initially provisioned capacities. Despite the fact that modern data storage and processing systems, are capable of solving multiple shortcomings of the present solution, the operation of the world's biggest scientific experiment creates a set of unique challenges which require additional effort to be overcome. Among others, the dynamicity and heterogeneity of the data sources and executed workloads pose a significant challenge for the modern distributed data analysis solutions to achieve its optimal efficiency. In this thesis, a novel workload-aware approach for distributed file system storage and processing solutions - a Mixed Partitioning Scheme Replication - is proposed. Taking into consideration the experience of other researchers in the field and the most popular large dataset analysis architectures, the developed solution takes advantage of both, replication and partitioning in order to improve the efficiency of the underlying engine. The fundamental concept of the proposed approach is the multi-criteria partitioning, optimized for different workload categories observed on the target system. Unlike in traditional solutions, the repository replicates the data copies with a different structure instead of distributing the exact same representation of the data through the cluster nodes. This approach is expected to be more efficient and flexible in comparison to the generically optimized partitioning schemes. Additionally,the partitioning and replication criteria can by dynamically altered in case significant workload changes with respect to the initial assumptions are developing with time. The performance of the presented technique was initially assessed recurring to simulations. A specific model which recreated the behavior of the proposed approach and the original Hadoop system was developed. The main assumption, which allowed to describe the system's behavior for different configurations, is based on the fact that the application execution time is linearly related with its input size, observed during initial assessment of the distributed data storage and processing solutions. The results of the simulations allowed to identify the profile of use cases for which the Mixed Partitioning Scheme Replication was more efficient in comparison to the traditional approaches and allowed quantifying the expected gains. Additionally, a prototype incorporating the core features of the proposed technique was developed and integrated into the Hadoop source code. The implementation was deployed on clusters with different characteristics and in-depth performance evaluation experiments were conducted. The workload was generated by a specifically developed and highly configurable application, which in addition monitors the application execution and collects a large set of execution- and infrastructure-related metrics. The obtained results allowed to study the efficiency of the proposed solution on the actual physical cluster, using genuine accelerator device data and user requests. In comparison to the traditional approach, the Mixed Partitioning Scheme Replication was considerably decreasing the application execution time and the queue size, while being slightly more inefficient when concerning aspects of failure tolerance and system scalability. The analysis of the collected measurements has proven the superiority of the Mixed Partitioning Scheme Replication when compared to the performance of generically optimized partitioning schemes. Despite the fact that only a limited subset of configurations was assessed during the performance evaluation phase, the results, validated the simulation observations, allowing to use the model for further estimations and extrapolations towards the requirements of a full scale infrastructure.
O Grande Colisor de Hadrões, construído e operado pelo CERN, é considerado o maior instrumento científico jamais criado pela humanidade. Durante a última paragem para manutenção geral, a maioria dos sistemas deste acelerador de partículas foi atualizada para conseguir duplicar as energias de colisão. Este incremento implica contudo um maior risco para os componentes do acelerador em caso de avaria. Esta actualização dos sistemas e a maior energia dos feixes cria também novos desafios para os sistemas de análise dos dados de diagnóstico. Estes têm de produzir resultados absolutamente fiáveis e em tempo real para manter o elevado nível de segurança dos sistemas responsáveis pela integridade do colisor sem limitar ao seu desempenho. Os sistemas informáticos actualmente existentes para a análise dos dados de diagnóstico também têm de ser actualizados, dado que a sua arquitectura foi definida na década passada e já não consegue acompanhar os novos requisitos, quer de escrita, quer de extração de dados. Apesar das modernas soluções de armazenamento e processamento de dados darem resposta à maioria das necessidades da implementação actual, esta actualização cria um conjunto de desafios novos e únicos. Entre outros, o dinamismo e heterogeneidade das fontes de dados, bem como os novos tipos de pedidos submetidos para análise pelos investigadores, que criam múltiplos de problemas para os sistemas actuais impedindo-os de alcançar a sua máxima eficácia. Nesta tese é proposta uma abordagem inovadora, designada por Mixed Partitioning Scheme Replication, que se adapta às cargas de trabalho deste tipo de sistemas distribuídos para a análise de gigantescas quantidades de dados. Tendo em conta a experiência de outros investigadores da área e as soluções de processamento de dados em larga escala mais conhecidos, o método proposto usa as técnicas de particionamento e replicação de dados para conseguir melhorar o desempenho da aplicação onde é integrado. O conceito fundamental da abordagem proposta consiste em particionar os dados, utilizando múltiplos critérios construídos a partir das observações da carga de trabalho no sistema que se pretende optimizar. Ao contrário das soluções tradicionais, nesta solução os dados são replicados com uma estrutura diferente nas várias máquinas do cluster, em vez de se propagar sempre a mesma cópia. Adicionalmente, os critérios de particionamento e replicação podem ser alterados dinamicamente no caso de se observarem alterações dos padrões inicialmente observados nos pedidos de utilizadores submetidos ao sistema. A abordagem proposta deverá superar significativamente o desempenho do sistema actual e ser mais flexível em comparação com os sistemas que usam um único critério de particionamento de dados. Os valores preliminares de desempenho da abordagem proposta foram obtidos com recurso a simulação. Foi desenvolvido de raíz um modelo computacional que recriou o comportamento do sistema proposto e da plataforma Hadoop. O pressuposto de base que suportava a modelação do comportamento do novo sistema para configurações distintas foi o facto do tempo de execução de uma aplicação ter uma dependência linear com o tamanho do respectivo input, comportamento este que se observou durante o estudo do actual sistema distribuído de armazenamento e processamento de dados. O resultado das simulações permitiu também identificar o perfil dos casos de uso para os quais a Mixed Partitioning Scheme Replication foi mais eficiente quando comparada com as abordagens tradicionais, permitindo-nos ainda quantificar os ganhos de desempenho expectáveis. Foi posteriormente desenvolvido e integrado dentro do código fonte do Hadoop o protótipo que incorporou as funcionalidades chave da técnica proposta. A nossa implementação foi instalada em clusters com diversas configurações permitindo-nos assim executar testes sintéticos de forma exaustiva. As cargas de trabalho foram geradas por uma aplicação especificamente desenvolvida para esse fim, que para além de submeter os pedidos também recolheu as métricas relevantes de funcionamento do sistema. Os resultados obtidos permitiram-nos analisar em detalhe o desempenho da solução proposta em ambiente muito semelhante ao real. A análise dos resultados obtidos provou a superioridade da Mixed Partitioning Scheme Replication quando comparada com sistemas que usam o particionamento com único critério genericamente optimizado para qualquer tipo de cargas de trabalho. Foi observada uma redução significativa do tempo de execução das aplicações, bem como do tamanho da fila de pedidos pendentes, a despeito de algumas limitações em termos de escalabilidade e tolerância a falhas. Apesar de só ter sido possível realizar as experiências num conjunto limitado de configurações, os resultados obtidos validaram as observações por simulação, abrindo assim a possibilidade de utilizar o modelo para estimar as características e requisitos deste sistema em escalas ainda maiores.
CERN

APA, Harvard, Vancouver, ISO, and other styles

14

Dlamini, Wisdom Mdumiseni Dabulizwe. "Spatial analysis of invasive alien plant distribution patterns and processes using Bayesian network-based data mining techniques." Thesis, 2016. http://hdl.handle.net/10500/20692.

Full text

Abstract:

Invasive alien plants have widespread ecological and socioeconomic impacts throughout many parts of the world, including Swaziland where the government declared them a national disaster. Control of these species requires knowledge on the invasion ecology of each species including how they interact with the invaded environment. Species distribution models are vital for providing solutions to such problems including the prediction of their niche and distribution. Various modelling approaches are used for species distribution modelling albeit with limitations resulting from statistical assumptions, implementation and interpretation of outputs. This study explores the usefulness of Bayesian networks (BNs) due their ability to model stochastic, nonlinear inter-causal relationships and uncertainty. Data-driven BNs were used to explore patterns and processes influencing the spatial distribution of 16 priority invasive alien plants in Swaziland. Various BN structure learning algorithms were applied within the Weka software to build models from a set of 170 variables incorporating climatic, anthropogenic, topo-edaphic and landscape factors. While all the BN models produced accurate predictions of alien plant invasion, the globally scored networks, particularly the hill climbing algorithms, performed relatively well. However, when considering the probabilistic outputs, the constraint-based Inferred Causation algorithm which attempts to generate a causal BN structure, performed relatively better. The learned BNs reveal that the main pathways of alien plants into new areas are ruderal areas such as road verges and riverbanks whilst humans and human activity are key driving factors and the main dispersal mechanism. However, the distribution of most of the species is constrained by climate particularly tolerance to very low temperatures and precipitation seasonality. Biotic interactions and/or associations among the species are also prevalent. The findings suggest that most of the species will proliferate by extending their range resulting in the whole country being at risk of further invasion. The ability of BNs to express uncertain, rather complex conditional and probabilistic dependencies and to combine multisource data makes them an attractive technique for species distribution modeling, especially as joint invasive species distribution models (JiSDM). Suggestions for further research are provided including the need for rigorous invasive species monitoring, data stewardship and testing more BN learning algorithms.
Environmental Sciences
D. Phil. (Environmental Science)

APA, Harvard, Vancouver, ISO, and other styles

15

Stewart-Knox, Barbara, S. Kuznesof, J. Robinson, A. Rankin, K. Orr, M. Duffy, R. Poinhos, et al. "Factors influencing European consumer uptake of personalised nutrition. Results of a qualitative analysis." 2013. http://hdl.handle.net/10454/6205.

Full text

Abstract:

The aim of this research was to explore consumer perceptions of personalised nutrition and to compare these across three different levels of "medicalization": lifestyle assessment (no blood sampling); phenotypic assessment (blood sampling); genomic assessment (blood and buccal sampling). The protocol was developed from two pilot focus groups conducted in the UK. Two focus groups (one comprising only "older" individuals between 30 and 60 years old, the other of adults 18-65 yrs of age) were run in the UK, Spain, the Netherlands, Poland, Portugal, Ireland, Greece and Germany (N=16). The analysis (guided using grounded theory) suggested that personalised nutrition was perceived in terms of benefit to health and fitness and that convenience was an important driver of uptake. Negative attitudes were associated with internet delivery but not with personalised nutrition per se. Barriers to uptake were linked to broader technological issues associated with data protection, trust in regulator and service providers. Services that required a fee were expected to be of better quality and more secure. An efficacious, transparent and trustworthy regulatory framework for personalised nutrition is required to alleviate consumer concern. In addition, developing trust in service providers is important if such services to be successful.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data / knowledge partitioning and distribution'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles