Dissertations / Theses on the topic 'Data / knowledge partitioning and distribution'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 15 dissertations / theses for your research on the topic 'Data / knowledge partitioning and distribution.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
De, Oliveira Joffrey. "Gestion de graphes de connaissances dans l'informatique en périphérie : gestion de flux, autonomie et adaptabilité." Electronic Thesis or Diss., Université Gustave Eiffel, 2023. http://www.theses.fr/2023UEFL2069.
Full textThe research work carried out as part of this PhD thesis lies at the interface between the Semantic Web, databases and edge computing. Indeed, our objective is to design, develop and evaluate a database management system (DBMS) based on the W3C Resource Description Framework (RDF) data model, which must be adapted to the terminals found in Edge computing.The possible applications of such a system are numerous and cover a wide range of sectors such as industry, finance and medicine, to name but a few. As proof of this, the subject of this thesis was defined with the team from the Computer Science and Artificial Intelligence Laboratory (CSAI) at ENGIE Lab CRIGEN. The latter is ENGIE's research and development centre dedicated to green gases (hydrogen, biogas and liquefied gases), new uses of energy in cities and buildings, industry and emerging technologies (digital and artificial intelligence, drones and robots, nanotechnologies and sensors). CSAI financed this thesis as part of a CIFRE-type collaboration.The functionalities of a system satisfying these characteristics must enable anomalies and exceptional situations to be detected in a relevant and effective way from measurements taken by sensors and/or actuators. In an industrial context, this could mean detecting excessively high measurements, for example of pressure or flow rate in a gas distribution network, which could potentially compromise infrastructure or even the safety of individuals. This detection must be carried out using a user-friendly approach to enable as many users as possible, including non-programmers, to describe risk situations. The approach must therefore be declarative, not procedural, and must be based on a query language, such as SPARQL.We believe that Semantic Web technologies can make a major contribution in this context. Indeed, the ability to infer implicit consequences from explicit data and knowledge is a means of creating new services that are distinguished by their ability to adjust to the circumstances encountered and to make autonomous decisions. This can be achieved by generating new queries in certain alarming situations, or by defining a minimal sub-graph of knowledge that an instance of our DBMS needs in order to respond to all of its queries.The design of such a DBMS must also take into account the inherent constraints of Edge computing, i.e. the limits in terms of computing capacity, storage, bandwidth and sometimes energy (when the terminal is powered by a solar panel or a battery). Architectural and technological choices must therefore be made to meet these limitations. With regard to the representation of data and knowledge, our design choice fell on succinct data structures (SDS), which offer, among other advantages, the fact that they are very compact and do not require decompression during querying. Similarly, it was necessary to integrate data flow management within our DBMS, for example with support for windowing in continuous SPARQL queries, and for the various services supported by our system. Finally, as anomaly detection is an area where knowledge can evolve, we have integrated support for modifications to the knowledge graphs stored on the client instances of our DBMS. This support translates into an extension of certain SDS structures used in our prototype
HE, AIJING. "UNSUPERVISED DATA MINING BY RECURSIVE PARTITIONING." University of Cincinnati / OhioLINK, 2002. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1026406153.
Full textEberhagen, Niclas. "An investigation of emerging knowledge distribution means and their characterization." Licentiate thesis, Department of Computer and Systems Sciences, Stockholm University, 1999. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-8262.
Full textLicentiate thesis in partial fulfillment of the Licentiate of Philosophy degree in Computer and Systems Sciences, Stockholm University
George, Chadrick Hendrik. "Knowledge management infrastructure and knowledge sharing: The case of a large fast moving consumer goods distribution centre in the Western Cape." Thesis, University of the Western Cape, 2014. http://hdl.handle.net/11394/3943.
Full textThe aim of this study is to understand how knowledge is created, shared and used within the fast moving consumer goods distribution centre in the Western Cape (WC). It also aims to understand knowledge sharing between individuals in the organisation. A literature review was conducted, in order to answer the research questions- this covered the background of knowledge management (KM) and KS and its current status with particular reference to SA’s private sector. The study found that technological KM infrastructure, cultural KM infrastructure and organisational KM infrastructure are important enablers of KS. A conceptual model was developed around these concepts. In order to answer the research questions, the study identified a FMCG DC in the WC, where KS is practiced
Arres, Billel. "Optimisation des performances dans les entrepôts distribués avec Mapreduce : traitement des problèmes de partionnement et de distribution des données." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE2012.
Full textIn this manuscript, we addressed the problems of data partitioning and distribution for large scale data warehouses distributed with MapReduce. First, we address the problem of data distribution. In this case, we propose a strategy to optimize data placement on distributed systems, based on the collocation principle. The objective is to optimize queries performances through the definition of an intentional data distribution schema of data to reduce the amount of data transferred between nodes during treatments, specifically during MapReduce’s shuffling phase. Secondly, we propose a new approach to improve data partitioning and placement in distributed file systems, especially Hadoop-based systems, which is the standard implementation of the MapReduce paradigm. The aim is to overcome the default data partitioning and placement policies which does not take any relational data characteristics into account. Our proposal proceeds according to two steps. Based on queries workload, it defines an efficient partitioning schema. After that, the system defines a data distribution schema that meets the best user’s needs, and this, by collocating data blocks on the same or closest nodes. The objective in this case is to optimize queries execution and parallel processing performances, by improving data access. Our third proposal addresses the problem of the workload dynamicity, since users analytical needs evolve through time. In this case, we propose the use of multi-agents systems (MAS) as an extension of our data partitioning and placement approach. Through autonomy and self-control that characterize MAS, we developed a platform that defines automatically new distribution schemas, as new queries appends to the system, and apply a data rebalancing according to this new schema. This allows offloading the system administrator of the burden of managing load balance, besides improving queries performances by adopting careful data partitioning and placement policies. Finally, to validate our contributions we conduct a set of experiments to evaluate our different approaches proposed in this manuscript. We study the impact of an intentional data partitioning and distribution on data warehouse loading phase, the execution of analytical queries, OLAP cubes construction, as well as load balancing. We also defined a cost model that allowed us to evaluate and validate the partitioning strategy proposed in this work
Antoine, Emilien. "Distributed data management with a declarative rule-based language webdamlog." Phd thesis, Université Paris Sud - Paris XI, 2013. http://tel.archives-ouvertes.fr/tel-00933808.
Full textGalicia, Auyón Jorge Armando. "Revisiting Data Partitioning for Scalable RDF Graph Processing Combining Graph Exploration and Fragmentation for RDF Processing Query Optimization for Large Scale Clustered RDF Data RDFPart- Suite: Bridging Physical and Logical RDF Partitioning. Reverse Partitioning for SPARQL Queries: Principles and Performance Analysis. ShouldWe Be Afraid of Querying Billions of Triples in a Graph-Based Centralized System? EXGRAF: Exploration et Fragmentation de Graphes au Service du Traitement Scalable de Requˆetes RDF." Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2021. http://www.theses.fr/2021ESMA0001.
Full textThe Resource Description Framework (RDF) and SPARQL are very popular graph-based standards initially designed to represent and query information on the Web. The flexibility offered by RDF motivated its use in other domains and today RDF datasets are great information sources. They gather billions of triples in Knowledge Graphs that must be stored and efficiently exploited. The first generation of RDF systems was built on top of traditional relational databases. Unfortunately, the performance in these systems degrades rapidly as the relational model is not suitable for handling RDF data inherently represented as a graph. Native and distributed RDF systems seek to overcome this limitation. The former mainly use indexing as an optimization strategy to speed up queries. Distributed and parallel RDF systems resorts to data partitioning. The logical representation of the database is crucial to design data partitions in the relational model. The logical layer defining the explicit schema of the database provides a degree of comfort to database designers. It lets them choose manually or automatically (through advisors) the tables and attributes to be partitioned. Besides, it allows the partitioning core concepts to remain constant regardless of the database management system. This design scheme is no longer valid for RDF databases. Essentially, because the RDF model does not explicitly enforce a schema since RDF data is mostly implicitly structured. Thus, the logical layer is inexistent and data partitioning depends strongly on the physical implementations of the triples on disk. This situation contributes to have different partitioning logics depending on the target system, which is quite different from the relational model’s perspective. In this thesis, we promote the novel idea of performing data partitioning at the logical level in RDF databases. Thereby, we first process the RDF data graph to support logical entity-based partitioning. After this preparation, we present a partitioning framework built upon these logical structures. This framework is accompanied by data fragmentation, allocation, and distribution procedures. This framework was incorporated to a centralized (RDF_QDAG) and a distributed (gStoreD) triple store. We conducted several experiments that confirmed the feasibility of integrating our framework to existent systems improving their performances for certain queries. Finally, we design a set of RDF data partitioning management tools including a data definition language (DDL) and an automatic partitioning wizard
Meiring, Linda. "A distribution model for the assessment of database systems knowledge and skills among second-year university students." Thesis, [Bloemfontein?] : Central University of Technology, Free State, 2009. http://hdl.handle.net/11462/44.
Full textDasgupta, Arghya. "How can the ‘Zeigarnik effect’ becombined with analogical reasoning inorder to enhance understanding ofcomplex knowledge related to computerscience?" Thesis, KTH, Skolan för informations- och kommunikationsteknik (ICT), 2013. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-143636.
Full textCoullon, Hélène. "Modélisation et implémentation de parallélisme implicite pour les simulations scientifiques basées sur des maillages." Thesis, Orléans, 2014. http://www.theses.fr/2014ORLE2029/document.
Full textParallel scientific computations is an expanding domain of computer science which increases the speed of calculations and offers a way to deal with heavier or more accurate calculations. Thus, the interest of scientific computations increases, with more precised results and bigger physical domains to study. In the particular case of scientific numerical simulations, solving partial differential equations (PDEs) is an especially heavy calculation and a perfect applicant to parallel computations. On one hand, it is more and more easy to get an access to very powerfull parallel machines and clusters, but on the other hand parallel programming is hard to democratize, and most scientists are not able to use these machines. As a result, high level programming models, framework, libraries, languages etc. have been proposed to hide technical details of parallel programming. However, in this “implicit parallelism” field, it is difficult to find the good abstraction level while keeping a low programming effort. This thesis proposes a model to write implicit parallelism solutions for numerical simulations such as mesh-based PDEs computations. This model is called “Structured Implicit Parallelism for scientific SIMulations” (SIPSim), and proposes an approach at the crossroads of existing solutions, taking advantage of each one. A first implementation of this model is proposed, as a C++ library called SkelGIS, for two dimensional Cartesian meshes. A second implementation of the model, and an extension of SkelGIS, proposes an implicit parallelism solution for network-simulations (which deals with simulations with multiple physical phenomenons), and is studied in details. A performance analysis of both these implementations is given on real case simulations, and it demonstrates that the SIPSim model can be implemented efficiently
Hejblum, Boris. "Analyse intégrative de données de grande dimension appliquée à la recherche vaccinale." Thesis, Bordeaux, 2015. http://www.theses.fr/2015BORD0049/document.
Full textGene expression data is recognized as high-dimensional data that needs specific statisticaltools for its analysis. But in the context of vaccine trials, other measures, such asflow-cytometry measurements are also high-dimensional. In addition, such measurementsare often repeated over time. This work is built on the idea that using the maximum ofavailable information, by modeling prior knowledge and integrating all data at hand, willimprove the inference and the interpretation of biological results from high-dimensionaldata. First, we present an original methodological development, Time-course Gene SetAnalysis (TcGSA), for the analysis of longitudinal gene expression data, taking into accountprior biological knowledge in the form of predefined gene sets. Second, we describetwo integrative analyses of two different vaccine studies. The first study reveals lowerexpression of inflammatory pathways consistently associated with lower viral rebound followinga HIV therapeutic vaccine. The second study highlights the role of a testosteronemediated group of genes linked to lipid metabolism in sex differences in immunologicalresponse to a flu vaccine. Finally, we introduce a new model-based clustering approach forthe automated treatment of cell populations from flow-cytometry data, namely a Dirichletprocess mixture of skew t-distributions, with a sequential posterior approximation strategyfor dealing with repeated measurements. Hence, the automatic recognition of thecell populations could allow a practical improvement of the daily work of immunologistsas well as a better interpretation of gene expression data after taking into account thefrequency of all cell populations
PASINI, TOMMASO. "Knowledge-based approaches to producing large-scale training data from scratch for Word Sense Disambiguation and Sense Distribution Learning." Doctoral thesis, 2019. http://hdl.handle.net/11573/1448979.
Full textBoychenko, Serhiy. "A Distributed Analysis Framework for Heterogeneous Data Processing in HEP Environments." Doctoral thesis, 2018. http://hdl.handle.net/10316/90651.
Full textDuring the last extended maintenance period, CERNs Large Hadron Collider (LHC) and most of its equipment systems were upgraded to collide particles at an energy level almost twice higher compared to previous operational limits, significantly increasing the damage potential to accelerator components in case of equipment malfunctioning. System upgrades and the increased machine energy pose new challenges for the analysis of transient data recordings, which have to be both dependable and fast to maintain the required safety level of the deployed machine protection systems while at the same time maximizing the accelerator performance. With the LHC having operated for many years already, statistical and trend analysis across the collected data sets is an additional, growing requirement. The currently deployed accelerator transient data recording and analysis systems will equally require significant upgrades, as the developed architectures - state-of-art at the time of their initial development - are already working well beyond the initially provisioned capacities. Despite the fact that modern data storage and processing systems, are capable of solving multiple shortcomings of the present solution, the operation of the world's biggest scientific experiment creates a set of unique challenges which require additional effort to be overcome. Among others, the dynamicity and heterogeneity of the data sources and executed workloads pose a significant challenge for the modern distributed data analysis solutions to achieve its optimal efficiency. In this thesis, a novel workload-aware approach for distributed file system storage and processing solutions - a Mixed Partitioning Scheme Replication - is proposed. Taking into consideration the experience of other researchers in the field and the most popular large dataset analysis architectures, the developed solution takes advantage of both, replication and partitioning in order to improve the efficiency of the underlying engine. The fundamental concept of the proposed approach is the multi-criteria partitioning, optimized for different workload categories observed on the target system. Unlike in traditional solutions, the repository replicates the data copies with a different structure instead of distributing the exact same representation of the data through the cluster nodes. This approach is expected to be more efficient and flexible in comparison to the generically optimized partitioning schemes. Additionally,the partitioning and replication criteria can by dynamically altered in case significant workload changes with respect to the initial assumptions are developing with time. The performance of the presented technique was initially assessed recurring to simulations. A specific model which recreated the behavior of the proposed approach and the original Hadoop system was developed. The main assumption, which allowed to describe the system's behavior for different configurations, is based on the fact that the application execution time is linearly related with its input size, observed during initial assessment of the distributed data storage and processing solutions. The results of the simulations allowed to identify the profile of use cases for which the Mixed Partitioning Scheme Replication was more efficient in comparison to the traditional approaches and allowed quantifying the expected gains. Additionally, a prototype incorporating the core features of the proposed technique was developed and integrated into the Hadoop source code. The implementation was deployed on clusters with different characteristics and in-depth performance evaluation experiments were conducted. The workload was generated by a specifically developed and highly configurable application, which in addition monitors the application execution and collects a large set of execution- and infrastructure-related metrics. The obtained results allowed to study the efficiency of the proposed solution on the actual physical cluster, using genuine accelerator device data and user requests. In comparison to the traditional approach, the Mixed Partitioning Scheme Replication was considerably decreasing the application execution time and the queue size, while being slightly more inefficient when concerning aspects of failure tolerance and system scalability. The analysis of the collected measurements has proven the superiority of the Mixed Partitioning Scheme Replication when compared to the performance of generically optimized partitioning schemes. Despite the fact that only a limited subset of configurations was assessed during the performance evaluation phase, the results, validated the simulation observations, allowing to use the model for further estimations and extrapolations towards the requirements of a full scale infrastructure.
O Grande Colisor de Hadrões, construído e operado pelo CERN, é considerado o maior instrumento científico jamais criado pela humanidade. Durante a última paragem para manutenção geral, a maioria dos sistemas deste acelerador de partículas foi atualizada para conseguir duplicar as energias de colisão. Este incremento implica contudo um maior risco para os componentes do acelerador em caso de avaria. Esta actualização dos sistemas e a maior energia dos feixes cria também novos desafios para os sistemas de análise dos dados de diagnóstico. Estes têm de produzir resultados absolutamente fiáveis e em tempo real para manter o elevado nível de segurança dos sistemas responsáveis pela integridade do colisor sem limitar ao seu desempenho. Os sistemas informáticos actualmente existentes para a análise dos dados de diagnóstico também têm de ser actualizados, dado que a sua arquitectura foi definida na década passada e já não consegue acompanhar os novos requisitos, quer de escrita, quer de extração de dados. Apesar das modernas soluções de armazenamento e processamento de dados darem resposta à maioria das necessidades da implementação actual, esta actualização cria um conjunto de desafios novos e únicos. Entre outros, o dinamismo e heterogeneidade das fontes de dados, bem como os novos tipos de pedidos submetidos para análise pelos investigadores, que criam múltiplos de problemas para os sistemas actuais impedindo-os de alcançar a sua máxima eficácia. Nesta tese é proposta uma abordagem inovadora, designada por Mixed Partitioning Scheme Replication, que se adapta às cargas de trabalho deste tipo de sistemas distribuídos para a análise de gigantescas quantidades de dados. Tendo em conta a experiência de outros investigadores da área e as soluções de processamento de dados em larga escala mais conhecidos, o método proposto usa as técnicas de particionamento e replicação de dados para conseguir melhorar o desempenho da aplicação onde é integrado. O conceito fundamental da abordagem proposta consiste em particionar os dados, utilizando múltiplos critérios construídos a partir das observações da carga de trabalho no sistema que se pretende optimizar. Ao contrário das soluções tradicionais, nesta solução os dados são replicados com uma estrutura diferente nas várias máquinas do cluster, em vez de se propagar sempre a mesma cópia. Adicionalmente, os critérios de particionamento e replicação podem ser alterados dinamicamente no caso de se observarem alterações dos padrões inicialmente observados nos pedidos de utilizadores submetidos ao sistema. A abordagem proposta deverá superar significativamente o desempenho do sistema actual e ser mais flexível em comparação com os sistemas que usam um único critério de particionamento de dados. Os valores preliminares de desempenho da abordagem proposta foram obtidos com recurso a simulação. Foi desenvolvido de raíz um modelo computacional que recriou o comportamento do sistema proposto e da plataforma Hadoop. O pressuposto de base que suportava a modelação do comportamento do novo sistema para configurações distintas foi o facto do tempo de execução de uma aplicação ter uma dependência linear com o tamanho do respectivo input, comportamento este que se observou durante o estudo do actual sistema distribuído de armazenamento e processamento de dados. O resultado das simulações permitiu também identificar o perfil dos casos de uso para os quais a Mixed Partitioning Scheme Replication foi mais eficiente quando comparada com as abordagens tradicionais, permitindo-nos ainda quantificar os ganhos de desempenho expectáveis. Foi posteriormente desenvolvido e integrado dentro do código fonte do Hadoop o protótipo que incorporou as funcionalidades chave da técnica proposta. A nossa implementação foi instalada em clusters com diversas configurações permitindo-nos assim executar testes sintéticos de forma exaustiva. As cargas de trabalho foram geradas por uma aplicação especificamente desenvolvida para esse fim, que para além de submeter os pedidos também recolheu as métricas relevantes de funcionamento do sistema. Os resultados obtidos permitiram-nos analisar em detalhe o desempenho da solução proposta em ambiente muito semelhante ao real. A análise dos resultados obtidos provou a superioridade da Mixed Partitioning Scheme Replication quando comparada com sistemas que usam o particionamento com único critério genericamente optimizado para qualquer tipo de cargas de trabalho. Foi observada uma redução significativa do tempo de execução das aplicações, bem como do tamanho da fila de pedidos pendentes, a despeito de algumas limitações em termos de escalabilidade e tolerância a falhas. Apesar de só ter sido possível realizar as experiências num conjunto limitado de configurações, os resultados obtidos validaram as observações por simulação, abrindo assim a possibilidade de utilizar o modelo para estimar as características e requisitos deste sistema em escalas ainda maiores.
CERN
Dlamini, Wisdom Mdumiseni Dabulizwe. "Spatial analysis of invasive alien plant distribution patterns and processes using Bayesian network-based data mining techniques." Thesis, 2016. http://hdl.handle.net/10500/20692.
Full textEnvironmental Sciences
D. Phil. (Environmental Science)
Stewart-Knox, Barbara, S. Kuznesof, J. Robinson, A. Rankin, K. Orr, M. Duffy, R. Poinhos, et al. "Factors influencing European consumer uptake of personalised nutrition. Results of a qualitative analysis." 2013. http://hdl.handle.net/10454/6205.
Full text