Log in

Relevant bibliographies by topics / Query processing and optimisation / Dissertations / Theses

Dissertations / Theses on the topic 'Query processing and optimisation'

To see the other types of publications on this topic, follow the link: Query processing and optimisation.

Author: Grafiati

Published: 10 December 2022

Last updated: 28 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Query processing and optimisation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Manolescu, Ioana. "Efficient XML query processing." Habilitation à diriger des recherches, Université Paris Sud - Paris XI, 2009. http://tel.archives-ouvertes.fr/tel-00542801.

Full text

Abstract:

Nous présentons des travaux autour de la thématique de l'évaluation efficace de requêtes XML. Une première partie est liée à l'optimisation de l'accès aux données XML dans des bases de données centralisées. La deuxième partie considère des architectures distribuées à grande échelle pour le partage de données XML.

APA, Harvard, Vancouver, ISO, and other styles

2

Al-Hoqani, Noura Y. S. "In-network database query processing for wireless sensor networks." Thesis, Loughborough University, 2018. https://dspace.lboro.ac.uk/2134/36226.

Full text

Abstract:

In the past research, smart sensor devices have become mature enough for large, distributed networks of such sensors to start to be deployed. Such networks can include tens or hundreds of independent nodes that can perform their functions without human interactions such as recharging of batteries, the configuration of network routes and others. Each of the sensors in the wireless sensor network is considered as microsystem, which consists of memory, processor, transducers and low bandwidth as well as a low range radio transceiver. This study investigates an adaptive sampling strategy for WSS aimed at reducing the number of data samples by sensing data only when a significant change in these processes is detected. This detection strategy is based on an extension to Holt's Method and statistical model. To investigate this strategy, the water consumption in a household is used as a case study. A query distribution approach is proposed, which is presented in detail in chapter 5. Our developed wireless sensor query engine is programmed on Sensinode testbed cc2430. The implemented model used on the wireless sensor platform and the architecture of the model is presented in chapters six, seven, and eight. This thesis presents a contribution by designing the experimental simulation setup and by developing the required database interface GUI sensing system, which enables the end user to send the inquiries to the sensor s network whenever needed, the On-Demand Query Sensing system ODQS is enhanced with a probabilistic model for the purpose of sensing only when the system is insufficient to answer the user queries. Moreover, a dynamic aggregation methodology is integrated so as to make the system more adaptive to query message costs. Dynamic on-demand approach for aggregated queries is implemented, based in a wireless sensor network by integrating the dynamic programming technique for the most optimal query decision, the optimality factor in our experiment is the query cost. In-network query processing of wireless sensor networks is discussed in detail in order to develop a more energy efficient approach to query processing. Initially, a survey of the research on existing WSN query processing approaches is presented. Building on this background, novel primary achievements includes an adaptive sampling mechanism and a dynamic query optimiser. These new approaches are extremely helpful when existing statistics are not sufficient to generate an optimal plan. There are two distinct aspects in query processing optimisation; query dynamic adaptive plans, which focus on improving the initial execution of a query, and dynamic adaptive statistics, which provide the best query execution plan to improve subsequent executions of the aggregation of on-demand queries requested by multiple end-users. In-network query processing is attractive to researchers developing user-friendly sensing systems. Since the sensors are a limited resource and battery powered devices, more robust features are recommended to limit the communication access to the sensor nodes in order to maximise the sensor lifetime. For this reason, a new architecture that combines a probability modelling technique with dynamic programming (DP) query processing to optimise the communication cost of queries is proposed. In this thesis, a dynamic technique to enhance the query engine for the interactive sensing system interface is developed. The probability technique is responsible for reducing communication costs for each query executed outside the wireless sensor networks. As remote sensors have limited resources and rely on battery power, control strategies should limit communication access to sensor nodes to maximise battery life. We propose an energy-efficient data acquisition system to extend the battery life of nodes in wireless sensor networks. The system considers a graph-based network structure, evaluates multiple query execution plans, and selects the best plan with the lowest cost obtained from an energy consumption model. Also, a genetic algorithm is used to analyse the performance of the approach. Experimental testing are provided to demonstrate the proposed on-demand sensing system capabilities to successfully predict the query answer injected by the on-demand sensing system end-user based-on a sensor network architecture and input query statement attributes and the query engine ability to determine the best and close to the optimal execution plan, given specific constraints of these query attributes . As a result of the above, the thesis contributes to the state-of-art in a network distributed wireless sensor network query design, implementation, analysis, evaluation, performance and optimisation.

APA, Harvard, Vancouver, ISO, and other styles

3

Belghoul, Abdeslem. "Optimizing Communication Cost in Distributed Query Processing." Thesis, Université Clermont Auvergne‎ (2017-2020), 2017. http://www.theses.fr/2017CLFAC025/document.

Full text

Abstract:

Dans cette thèse, nous étudions le problème d’optimisation du temps de transfert de données dans les systèmes de gestion de données distribuées, en nous focalisant sur la relation entre le temps de communication de données et la configuration du middleware. En réalité, le middleware détermine, entre autres, comment les données sont divisées en lots de F tuples et messages de M octets avant d’être communiqués à travers le réseau. Concrètement, nous nous concentrons sur la question de recherche suivante : étant donnée requête Q et l’environnement réseau, quelle est la meilleure configuration de F et M qui minimisent le temps de communication du résultat de la requête à travers le réseau?A notre connaissance, ce problème n’a jamais été étudié par la communauté de recherche en base de données.Premièrement, nous présentons une étude expérimentale qui met en évidence l’impact de la configuration du middleware sur le temps de transfert de données. Nous explorons deux paramètres du middleware que nous avons empiriquement identifiés comme ayant une influence importante sur le temps de transfert de données: (i) la taille du lot F (c’est-à-dire le nombre de tuples dans un lot qui est communiqué à la fois vers une application consommant des données) et (ii) la taille du message M (c’est-à-dire la taille en octets du tampon du middleware qui correspond à la quantité de données à transférer à partir du middleware vers la couche réseau). Ensuite, nous décrivons un modèle de coût permettant d’estimer le temps de transfert de données. Ce modèle de coût est basé sur la manière dont les données sont transférées entre les noeuds de traitement de données. Notre modèle de coût est basé sur deux observations cruciales: (i) les lots et les messages de données sont communiqués différemment sur le réseau : les lots sont communiqués de façon synchrone et les messages dans un lot sont communiqués en pipeline (asynchrone) et (ii) en raison de la latence réseau, le coût de transfert du premier message d’un lot est plus élevé que le coût de transfert des autres messages du même lot. Nous proposons une stratégie pour calibrer les poids du premier et non premier messages dans un lot. Ces poids sont des paramètres dépendant de l’environnement réseau et sont utilisés par la fonction d’estimation du temps de communication de données. Enfin, nous développons un algorithme d’optimisation permettant de calculer les valeurs des paramètres F et M qui fournissent un bon compromis entre un temps optimisé de communication de données et une consommation minimale de ressources. L’approche proposée dans cette thèse a été validée expérimentalement en utilisant des données issues d’une application en Astronomie
In this thesis, we take a complementary look to the problem of optimizing the time for communicating query results in distributed query processing, by investigating the relationship between the communication time and the middleware configuration. Indeed, the middleware determines, among others, how data is divided into batches and messages before being communicated over the network. Concretely, we focus on the research question: given a query Q and a network environment, what is the best middleware configuration that minimizes the time for transferring the query result over the network? To the best of our knowledge, the database research community does not have well-established strategies for middleware tuning. We present first an intensive experimental study that emphasizes the crucial impact of middleware configuration on the time for communicating query results. We focus on two middleware parameters that we empirically identified as having an important influence on the communication time: (i) the fetch size F (i.e., the number of tuples in a batch that is communicated at once to an application consuming the data) and (ii) the message size M (i.e., the size in bytes of the middleware buffer, which corresponds to the amount of data that can be communicated at once from the middleware to the network layer; a batch of F tuples can be communicated via one or several messages of M bytes). Then, we describe a cost model for estimating the communication time, which is based on how data is communicated between computation nodes. Precisely, our cost model is based on two crucial observations: (i) batches and messages are communicated differently over the network: batches are communicated synchronously, whereas messages in a batch are communicated in pipeline (asynchronously), and (ii) due to network latency, it is more expensive to communicate the first message in a batch compared to any other message that is not the first in its batch. We propose an effective strategy for calibrating the network-dependent parameters of the communication time estimation function i.e, the costs of first message and non first message in their batch. Finally, we develop an optimization algorithm to effectively compute the values of the middleware parameters F and M that minimize the communication time. The proposed algorithm allows to quickly find (in small fraction of a second) the values of the middleware parameters F and M that translate a good trade-off between low resource consumption and low communication time. The proposed approach has been evaluated using a dataset issued from application in Astronomy

APA, Harvard, Vancouver, ISO, and other styles

4

Mesmoudi, Amin. "Declarative parallel query processing on large scale astronomical databases." Thesis, Lyon 1, 2015. http://www.theses.fr/2015LYO10326.

Full text

Abstract:

Les travaux de cette thèse s'inscrivent dans le cadre du projet Petasky. Notre objectif est de proposer des outils permettant de gérer des dizaines de Peta-octets de données issues d'observations astronomiques. Nos travaux se focalisent essentiellement sur la conception des nouveaux systèmes permettant de garantir le passage à l'échelle. Dans cette thèse, nos contributions concernent trois aspects : Benchmarking des systèmes existants, conception d'un nouveau système et optimisation du système. Nous avons commencé par analyser la capacité des systèmes fondés sur le modèle MapReduce et supportant SQL à gérer les données LSST et leurs capacités d'optimisation de certains types de requêtes. Nous avons pu constater qu'il n'y a pas de technique « magique » pour partitionner, stocker et indexer les données mais l'efficacité des techniques dédiées dépend essentiellement du type de requête et de la typologie des données considérées. Suite à notre travail de Benchmarking, nous avons retenu quelques techniques qui doivent être intégrées dans un système de gestion de données à large échelle. Nous avons conçu un nouveau système de façon à garantir la capacité dudit système à supporter plusieurs mécanismes de partitionnement et plusieurs opérateurs d'évaluation. Nous avons utilisé BSP (Bulk Synchronous Parallel) comme modèle de calcul. Les données sont représentées logiquement par des graphes. L'évaluation des requêtes est donc faite en explorant le graphe de données en utilisant les arcs entrants et les arcs sortants. Les premières expérimentations ont montré que notre approche permet une amélioration significative des performances par rapport aux systèmes Map/Reduce
This work is carried out in framework of the PetaSky project. The objective of this project is to provide a set of tools allowing to manage Peta-bytes of data from astronomical observations. Our work is concerned with the design of a scalable approach. We first started by analyzing the ability of MapReduce based systems and supporting SQL to manage the LSST data and ensure optimization capabilities for certain types of queries. We analyzed the impact of data partitioning, indexing and compression on query performance. From our experiments, it follows that there is no “magic” technique to partition, store and index data but the efficiency of dedicated techniques depends mainly on the type of queries and the typology of data that are considered. Based on our work on benchmarking, we identified some techniques to be integrated to large-scale data management systems. We designed a new system allowing to support multiple partitioning mechanisms and several evaluation operators. We used the BSP (Bulk Synchronous Parallel) model as a parallel computation paradigm. Unlike MapeReduce model, we send intermediate results to workers that can continue their processing. Data is logically represented as a graph. The evaluation of queries is performed by exploring the data graph using forward and backward edges. We also offer a semi-automatic partitioning approach, i.e., we provide the system administrator with a set of tools allowing her/him to choose the manner of partitioning data using the schema of the database and domain knowledge. The first experiments show that our approach provides a significant performance improvement with respect to Map/Reduce systems

APA, Harvard, Vancouver, ISO, and other styles

5

Oğuz, Damla. "Méthodes d'optimisation pour le traitement de requêtes réparties à grande échelle sur des données liées." Thesis, Toulouse 3, 2017. http://www.theses.fr/2017TOU30067/document.

Full text

Abstract:

Données Liées est un terme pour définir un ensemble de meilleures pratiques pour la publication et l'interconnexion des données structurées sur le Web. A mesure que le nombre de fournisseurs de Données Liées augmente, le Web devient un vaste espace de données global. La fédération de requêtes est l'une des approches permettant d'interroger efficacement cet espace de données distribué. Il est utilisé via un moteur de requêtes fédéré qui vise à minimiser le temps de réponse du premier tuple du résultat et le temps d'exécution pour obtenir tous les tuples du résultat. Il existe trois principales étapes dans un moteur de requêtes fédéré qui sont la sélection de sources de données, l'optimisation de requêtes et l'exécution de requêtes. La plupart des études sur l'optimisation de requêtes dans ce contexte se concentrent sur l'optimisation de requêtes statique qui génère des plans d'exécution de requêtes avant l'exécution et nécessite des statistiques. Cependant, l'environnement des Données Liées a plusieurs caractéristiques spécifiques telles que les taux d'arrivée de données imprévisibles et les statistiques peu fiables. En conséquence, l'optimisation de requêtes statique peut provoquer des plans d'exécution inefficaces. Ces contraintes montrent que l'optimisation de requêtes adaptative est une nécessité pour le traitement de requêtes fédéré sur les données liées. Dans cette thèse, nous proposons d'abord un opérateur de jointure adaptatif qui vise à minimiser le temps de réponse et le temps d'exécution pour les requêtes fédérées sur les endpoints SPARQL. Deuxièmement, nous étendons la première proposition afin de réduire encore le temps d'exécution. Les deux propositions peuvent changer la méthode de jointure et l'ordre de jointures pendant l'exécution en utilisant une optimisation de requêtes adaptative. Les opérateurs adaptatifs proposés peuvent gérer différents taux d'arrivée des données et le manque de statistiques sur des relations. L'évaluation de performances dans cette thèse montre l'efficacité des opérateurs adaptatifs proposés. Ils offrent des temps d'exécution plus rapides et presque les mêmes temps de réponse, comparé avec une jointure par hachage symétrique. Par rapport à bind join, les opérateurs proposés se comportent beaucoup mieux en ce qui concerne le temps de réponse et peuvent également offrir des temps d'exécution plus rapides. En outre, le deuxième opérateur proposé obtient un temps de réponse considérablement plus rapide que la bind-bloom join et peut également améliorer le temps d'exécution. Comparant les deux propositions, la deuxième offre des temps d'exécution plus rapides que la première dans toutes les conditions. En résumé, les opérateurs de jointure adaptatifs proposés présentent le meilleur compromis entre le temps de réponse et le temps d'exécution. Même si notre objectif principal est de gérer différents taux d'arrivée des données, l'évaluation de performance révèle qu'ils réussissent à la fois avec des taux d'arrivée de données fixes et variés
Linked Data is a term to define a set of best practices for publishing and interlinking structured data on the Web. As the number of data providers of Linked Data increases, the Web becomes a huge global data space. Query federation is one of the approaches for efficiently querying this distributed data space. It is employed via a federated query engine which aims to minimize the response time and the completion time. Response time is the time to generate the first result tuple, whereas completion time refers to the time to provide all result tuples. There are three basic steps in a federated query engine which are data source selection, query optimization, and query execution. This thesis contributes to the subject of query optimization for query federation. Most of the studies focus on static query optimization which generates the query plans before the execution and needs statistics. However, the environment of Linked Data has several difficulties such as unpredictable data arrival rates and unreliable statistics. As a consequence, static query optimization can cause inefficient execution plans. These constraints show that adaptive query optimization should be used for federated query processing on Linked Data. In this thesis, we first propose an adaptive join operator which aims to minimize the response time and the completion time for federated queries over SPARQL endpoints. Second, we extend the first proposal to further reduce the completion time. Both proposals can change the join method and the join order during the execution by using adaptive query optimization. The proposed operators can handle different data arrival rates of relations and the lack of statistics about them. The performance evaluation of this thesis shows the efficiency of the proposed adaptive operators. They provide faster completion times and almost the same response times, compared to symmetric hash join. Compared to bind join, the proposed operators perform substantially better with respect to the response time and can also provide faster completion times. In addition, the second proposed operator provides considerably faster response time than bind-bloom join and can improve the completion time as well. The second proposal also provides faster completion times than the first proposal in all conditions. In conclusion, the proposed adaptive join operators provide the best trade-off between the response time and the completion time. Even though our main objective is to manage different data arrival rates of relations, the performance evaluation reveals that they are successful in both fixed and different data arrival rates

APA, Harvard, Vancouver, ISO, and other styles

6

Gillani, Syed. "Semantically-enabled stream processing and complex event processing over RDF graph streams." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSES055/document.

Full text

Abstract:

Résumé en français non fourni par l'auteur
There is a paradigm shift in the nature and processing means of today’s data: data are used to being mostly static and stored in large databases to be queried. Today, with the advent of new applications and means of collecting data, most applications on the Web and in enterprises produce data in a continuous manner under the form of streams. Thus, the users of these applications expect to process a large volume of data with fresh low latency results. This has resulted in the introduction of Data Stream Processing Systems (DSMSs) and a Complex Event Processing (CEP) paradigm – both with distinctive aims: DSMSs are mostly employed to process traditional query operators (mostly stateless), while CEP systems focus on temporal pattern matching (stateful operators) to detect changes in the data that can be thought of as events. In the past decade or so, a number of scalable and performance intensive DSMSs and CEP systems have been proposed. Most of them, however, are based on the relational data models – which begs the question for the support of heterogeneous data sources, i.e., variety of the data. Work in RDF stream processing (RSP) systems partly addresses the challenge of variety by promoting the RDF data model. Nonetheless, challenges like volume and velocity are overlooked by existing approaches. These challenges require customised optimisations which consider RDF as a first class citizen and scale the processof continuous graph pattern matching. To gain insights into these problems, this thesis focuses on developing scalable RDF graph stream processing, and semantically-enabled CEP systems (i.e., Semantic Complex Event Processing, SCEP). In addition to our optimised algorithmic and data structure methodologies, we also contribute to the design of a new query language for SCEP. Our contributions in these two fields are as follows: • RDF Graph Stream Processing. We first propose an RDF graph stream model, where each data item/event within streams is comprised of an RDF graph (a set of RDF triples). Second, we implement customised indexing techniques and data structures to continuously process RDF graph streams in an incremental manner. • Semantic Complex Event Processing. We extend the idea of RDF graph stream processing to enable SCEP over such RDF graph streams, i.e., temporalpattern matching. Our first contribution in this context is to provide a new querylanguage that encompasses the RDF graph stream model and employs a set of expressive temporal operators such as sequencing, kleene-+, negation, optional,conjunction, disjunction and event selection strategies. Based on this, we implement a scalable system that employs a non-deterministic finite automata model to evaluate these operators in an optimised manner. We leverage techniques from diverse fields, such as relational query optimisations, incremental query processing, sensor and social networks in order to solve real-world problems. We have applied our proposed techniques to a wide range of real-world and synthetic datasets to extract the knowledge from RDF structured data in motion. Our experimental evaluations confirm our theoretical insights, and demonstrate the viability of our proposed methods

APA, Harvard, Vancouver, ISO, and other styles

7

Alrammal, Muath. "Algorithms for XML stream processing : massive data, external memory and scalable performance." Phd thesis, Université Paris-Est, 2011. http://tel.archives-ouvertes.fr/tel-00779309.

Full text

Abstract:

Plusieurs applications modernes nécessitent un traitement de flux massifs de données XML, cela crée de défis techniques. Parmi ces derniers, il y a la conception et la mise en ouvre d'outils pour optimiser le traitement des requêtes XPath et fournir une estimation précise des coûts de ces requêtes traitées sur un flux massif de données XML. Dans cette thèse, nous proposons un nouveau modèle de prédiction de performance qui estime a priori le coût (en termes d'espace utilisé et de temps écoulé) pour les requêtes structurelles de Forward XPath. Ce faisant, nous réalisons une étude expérimentale pour confirmer la relation linéaire entre le traitement de flux, et les ressources d'accès aux données. Par conséquent, nous présentons un modèle mathématique (fonctions de régression linéaire) pour prévoir le coût d'une requête XPath donnée. En outre, nous présentons une technique nouvelle d'estimation de sélectivité. Elle se compose de deux éléments. Le premier est le résumé path tree: une présentation concise et précise de la structure d'un document XML. Le second est l'algorithme d'estimation de sélectivité: un algorithme efficace de flux pour traverser le synopsis path tree pour estimer les valeurs des paramètres de coût. Ces paramètres sont utilisés par le modèle mathématique pour déterminer le coût d'une requête XPath donnée. Nous comparons les performances de notre modèle avec les approches existantes. De plus, nous présentons un cas d'utilisation d'un système en ligne appelé "online stream-querying system". Le système utilise notre modèle de prédiction de performance pour estimer le coût (en termes de temps / mémoire) d'une requête XPath donnée. En outre, il fournit une réponse précise à l'auteur de la requête. Ce cas d'utilisation illustre les avantages pratiques de gestion de performance avec nos techniques

APA, Harvard, Vancouver, ISO, and other styles

8

Phan, Duy-Hung. "Algorithmes d'aggrégation pour applications Big Data." Electronic Thesis or Diss., Paris, ENST, 2016. http://www.theses.fr/2016ENST0043.

Full text

Abstract:

Les bases de données traditionnelles sont confrontées à des problèmes de scalabilité et d'efficacité en raison d’importants volumes de données. Ainsi, les systèmes de gestion de base de données modernes, tels que Apache Hadoop et Spark, peuvent désormais être distribués sur des clusters de milliers de machines: ces systèmes sont donc devenus les principaux outils pour le traitement des données à grande échelle. De nombreuses optimisations ont été développées pour les bases de données conventionnelles, cependant celles-ci ne peuvent être appliquées aux nouvelles architectures et modèles de programmation. Dans ce contexte, cette thèse vise à optimiser une des opérations les plus prédominantes dans le traitement des données : l'agrégation de données pour ces systèmes à grande échelle. Nos principales contributions sont les optimisations logiques et physiques de l'agrégation de grands volumes de données. Ces optimisations sont fortement interconnectées : le problème d'optimisation d'agrégation de données ne pourrait être entièrement résolu si l’une d’entre elles venait à manquer. Par ailleurs, nous avons intégré les optimisations dans le moteur d'optimisation multi-requêtes, ce qui est transparent pour les usagers. Le moteur, les optimisations logiques et physiques proposées dans cette thèse forment une solution complété exécutable et prête à répondre aux requêtes d'agrégation de données à grande échelle. Nos optimisations ont été évaluées de manière théorique et expérimentale. Les résultats d'analyses ont démontré que le passage à l’échelle et l’efficacité de nos algorithmes et techniques surpassent les résultats des études antérieures
Traditional databases are facing problems of scalability and efficiency dealing with a vast amount of big-data. Thus, modern data management systems that scale to thousands of nodes, like Apache Hadoop and Spark, have emerged and become the de-facto platforms to process data at massive scales. In such systems, many data processing optimizations that were well studied in the database domain have now become futile because of the novel architectures and programming models. In this context, this dissertation pledged to optimize one of the most predominant operations in data processing: data aggregation for such systems.Our main contributions were the logical and physical optimizations for large-scale data aggregation, including several algorithms and techniques. These optimizations are so intimately related that without one or the other, the data aggregation optimization problem would not be solved entirely. Moreover, we integrated these optimizations in our multi-query optimization engine, which is totally transparent to users. The engine, the logical and physical optimizations proposed in this dissertation formed a complete package that is runnable and ready to answer data aggregation queries at massive scales. We evaluated our optimizations both theoretically and experimentally. The theoretical analyses showed that our algorithms and techniques are much more scalable and efficient than prior works. The experimental results using a real cluster with synthetic and real datasets confirmed our analyses, showed a significant performance boost and revealed various angles about our works. Last but not least, our works are published as open sources for public usages and studies

APA, Harvard, Vancouver, ISO, and other styles

9

Camacho, Rodriguez Jesus. "Efficient techniques for large-scale Web data management." Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112229/document.

Full text

Abstract:

Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation
The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies

APA, Harvard, Vancouver, ISO, and other styles

10

Geng, Ke. "XML semantic query optimisation." Thesis, University of Auckland, 2011. http://hdl.handle.net/2292/6815.

Full text

Abstract:

XML Semantic Query Optimisation (XSQO) is a method that optimises execution of queries based on semantic constraints, which are extracted from XML documents. Currently most research into XSQO concentrates on optimisation based on structural constraints in the XML documents. Research, which optimises XML query execution based on semantic constraints, has been limited because of the flexibility of XML. In this thesis, we introduce a method, which optimises XML query execution based on the constraints on the content of XML documents. In our method, elements are analysed and classified based on the distribution of values of sub-elements. Information about the classification is extracted and represented in OWL, which is stored in the database together with the XML document. The user input XML query is evaluated and transformed to a new query, which will execute faster and return exactly the same results, based on the element classification information. There are three kinds of transformation that may be carried out in our method: Elimination, which blocks the non-result queries, Reduction, which simplifies the query conditions by removing redundant conditions, and Introduction, which reduces the search area by introducing a new query condition. Two engines are designed and built for the research. The data analysis engine is designed to analyse the XML documents and classify the specified elements. The query transformation engine evaluates the input XML queries and carries out the query transformation automatically based on the classification information. A case study has been carried out with the data analysis engine and we carried out a series of experiments with the query transformation engine. The results show that: a. XML documents can be analysed and elements can be classified using our method, and the classification results satisfy the requirement of XML query transformation. b. content based XML query transformation can improve XML query execution performance by about 20% to 30%. In this thesis, we also introduce a data generator, which is designed and built to support the research. With this generator, users can build semantic information into the XML dataset with specified structure, size and selectivity. A case study with the generator shows that the generator satisfies the requirements of content-based XSQO research.

APA, Harvard, Vancouver, ISO, and other styles

11

Papadopoulos, Stavros. "Authenticated query processing /." View abstract or full-text, 2010. http://library.ust.hk/cgi/db/thesis.pl?CSED%202010%20PAPADO.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Khelil, Abdallah. "Gestion et optimisation des données massives issues du Web Combining graph exploration and fragmentation for scalable rdf query processing Should We Be Afraid of Querying Billions of Triples in a Graph-Based Centralized System? EXGRAF : Exploration et Fragmentation de Graphes au Service du Traitement Scalable de Requˆetes RDF." Thesis, Chasseneuil-du-Poitou, Ecole nationale supérieure de mécanique et d'aérotechnique, 2020. http://www.theses.fr/2020ESMA0009.

Full text

Abstract:

Le Big Data représente un défi non seulement pour le monde socio-économique mais aussi pour la recherchescientifique. En effet, comme il a été souligné dans plusieurs articles scientifiques et rapports stratégiques, lesapplications informatiques modernes sont confrontées à de nouveaux problèmes qui sont liés essentiellement austockage et à l’exploitation de données générées par les instruments d’observation et de simulation. La gestion de tellesdonnées représente un véritable goulot d’étranglement qui a pour effet de ralentir la valorisation des différentesdonnées collectées non seulement dans le cadre de programmes scientifiques internationaux mais aussi par desentreprises, ces dernières s'appuyant de plus en plus sur l’analyse de données massives. Une bonne partie de cesdonnées sont publié aujourd’hui sur le WEB. Nous assistons en effet à une évolution du Web classique permettant degérer les documents vers un Web de données qui permet d’offrir des mécanismes d’interrogation des informationssémantiques. Plusieurs modèles de données ont été proposés pour représenter ces informations sur le Web. Le plusimportant est le Resource Description Framework (RDF) qui fournit une représentation des connaissances simple etabstraite pour les ressources sur le Web. Chaque fait du Web sémantique peut être codé avec un triplet RDF. Afin depouvoir explorer et interroger les informations structurées exprimées en RDF, plusieurs langages de requête ont étéproposés au fil des années. En 2008, SPARQL est devenu le langage de recommandation officiel du W3C pourl'interrogation des données RDF. La nécessité de gérer et interroger efficacement les données RDF a conduit audéveloppement de nouveaux systèmes conçus spécialement pour traiter ce format de données. Ces approches peuventêtre catégorisées en étant centralisées qui s’appuient sur une seule machine pour gérer les données RDF et distribuéesqui peuvent combiner plusieurs machines connectées avec un réseau informatique. Certaines de ces approchess’appuient sur un système de gestion de données existant tels que Virtuoso et Jena, d’autres approches sont basées surune approche spécialement conçue pour la gestion des triplets RDF comme GRIN, RDF3X et gStore. Avec l’évolutiondes jeux de données RDF (e.g. DBPedia) et du langage Sparql, la plupart des systèmes sont devenus obsolètes et/ouinefficaces. A titre d’exemple, aucun système centralisé existant n’est en mesure de gérer 1 Milliard de triplets fourniesdans le cadre du benchmark WatDiv. Les systèmes distribués permettraient sous certaines conditions d’améliorer cepoint mais une perte de performances conséquente est induite.Dans cette thèse, nous proposons le système centralisé "RDF_QDAG" qui permet de trouver un bon compromisentre passage à l’échelle et performances. Nous proposons de combiner la fragmentation physique de données etl’exploration du graphe de données. "RDF_QDAG" permet de support plusieurs types de requêtes basées nonseulement sur les motifs basiques de graphes mais aussi qui intègrent des filtres à base d’expressions régulières et aussides fonctions d’agrégation et de tri. "RDF_QDAG" se base sur le modèle d’exécution Volcano, ce qui permet decontrôler la mémoire principale, en évitant tout débordement pour garantir les performances même si la configurationmatérielle est limitée. A notre connaissance, "RDF_QDAG" est le seul système centralisé capable de gérer plusieursmilliards de triplets tout en garantissant de bonnes performances. Nous avons comparé ce système avec d’autressystèmes qui représentent l’état de l’art en matière de gestion de données RDF : une approche relationnelle (Virtuoso),une approche à base de graphes (g-Store), une approche d'indexation intensive (RDF-3X) et une approche MPP(CliqueSquare). "RDF_QDAG" surpasse les systèmes existants lorsqu’il s’agit de garantir à la fois le passage à l’échelleet les performances
Big Data represents a challenge not only for the socio-economic world but also for scientific research. Indeed, as has been pointed out in several scientific articles and strategic reports, modern computer applications are facing new problems and issues that are mainly related to the storage and the exploitation of data generated by modern observation and simulation instruments. The management of such data represents a real bottleneck which has the effect of slowing down the exploitation of the various data collected not only in the framework of international scientific programs but also by companies, the latter relying increasingly on the analysis of large-scale data. Much of this data is published today on the WEB. Indeed, we are witnessing an evolution of the traditional web, designed basically to manage documents, to a web of data that allows to offer mechanisms for querying semantic information. Several data models have been proposed to represent this information on the Web. The most important is the Resource Description Framework (RDF) which provides a simple and abstract representation of knowledge for resources on the Web. Each semantic Web fact can be encoded with an RDF triple. In order to explore and query structured information expressed in RDF, several query languages have been proposed over the years. In 2008,SPARQL became the official W3C Recommendation language for querying RDF data.The need to efficiently manage and query RDF data has led to the development of new systems specifically designed to process this data format. These approaches can be categorized as centralized that rely on a single machine to manage RDF data and distributed that can combine multiple machines connected with a computer network. Some of these approaches are based on an existing data management system such as Virtuoso and Jena, others relies on an approach specifically designed for the management of RDF triples such as GRIN, RDF3X and gStore. With the evolution ofRDF datasets (e.g. DBPedia) and Sparql, most systems have become obsolete and/or inefficient. For example, no one of existing centralized system is able to manage 1 billion triples provided under the WatDiv benchmark. Distributed systems would allow under certain conditions to improve this point but consequently leads a performance degradation. In this Phd thesis, we propose the centralized system "RDF_QDAG" that allows to find a good compromise between scalability and performance. We propose to combine physical data fragmentation and data graph exploration."RDF_QDAG" supports multiple types of queries based not only on basic graph patterns but also that incorporate filters based on regular expressions and aggregation and sorting functions. "RDF_QDAG" relies on the Volcano execution model, which allows controlling the main memory, avoiding any overflow even if the hardware configuration is limited. To the best of our knowledge, "RDF_QDAG" is the only centralized system that good performance when manage several billion triples. We compared this system with other systems that represent the state of the art in RDF data management: a relational approach (Virtuoso), a graph-based approach (g-Store), an intensive indexing approach (RDF-3X) and two parallel approaches (CliqueSquare and g-Store-D). "RDF_QDAG" surpasses existing systems when it comes to ensuring both scalability and performance

APA, Harvard, Vancouver, ISO, and other styles

13

Lu, Yu-En. "Distributed proximity query processing." Thesis, University of Cambridge, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.612165.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

He, Bingsheng. "Cache-oblivious query processing /." View abstract or full-text, 2008. http://library.ust.hk/cgi/db/thesis.pl?CSED%202008%20HE.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Jiang, Zhewei. "On XML query processing /." Available to subscribers only, 2008. http://proquest.umi.com/pqdweb?did=1757062851&sid=2&Fmt=2&clientId=1509&RQT=309&VName=PQD.

Full text

Abstract:

Thesis (Ph. D.)--Southern Illinois University Carbondale, 2008.
"Department of Electrical and Computer Engineering." Keywords: XML, Queries, Processing, Query processing. Includes bibliographical references (p. 58-65). Also available online.

APA, Harvard, Vancouver, ISO, and other styles

16

Marathe, Arunprasad Prabhakar. "Query processing techniques for arrays." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/NQ60556.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Stokes, Alan Barry. "Resilient sensor network query processing." Thesis, University of Manchester, 2014. https://www.research.manchester.ac.uk/portal/en/theses/resilient-sensor-network-query-processing(208a729a-5d48-47a9-b1f5-3d156932e197).html.

Full text

Abstract:

Sensor networks comprise of a collection of resource-constrained, low cost, sometimes fragile wireless motes which have the capability to gather information about their surroundings through the use of sensors, and can be conceived as a distributed computing platform for applications ranging from event detection to environmental monitoring. A Sensor Network Query Processor (SNQP) is a means of collecting data from sensor networks where the requirements are defined using a declarative query language with a set of Quality of Service (QoS) expectations. As sensor networks are often deployed in hostile environments, there is a high possibility that the motes could break or that the communication links between the motes become unreliable. SNQP Query Execution Plans (QEPs) are often optimised for a specific network deployment and are designed to be as energy efficient as possible whilst ensuring the QEPs meet the QoS expectations, yet little has been done for handling the situation where the deployment itself has changed since the optimisation in such a way as to make the original QEP no longer efficient, or unable to operate. In this respect, the previous work on SNQPs has not aimed at being resilient to failures in the assumptions used at compilation/optimisation time which result in a QEP terminating earlier than expected. This dissertation presents a collection of approaches that embed resilience into a SNQP generated QEPs in such a way that a QEP operates for longer whilst still meeting the QoS expectations demanded of it, thereby resulting in a more reliable platform that can be applicable to a broader range of applications. The research contributions reported here include (a) a strategy designed to adapt to predictable node failures due to energy depletion; (b) a collection of strategies designed to adapt to unpredictable node failures; (c) a strategy designed to handle unreliable communication channels; and (d) an empirical evaluation to show the benefits of a resilient SNQP in relation to a representative non-resilient SNQP.

APA, Harvard, Vancouver, ISO, and other styles

18

Bondiombouy, Carlyna. "Query Processing in Multistore Systems." Thesis, Montpellier, 2017. http://www.theses.fr/2017MONTS056/document.

Full text

Abstract:

Le cloud computing a eu un impact majeur sur la gestion des données, conduisant à une prolifération de nouvelles solutions évolutives de gestion des données telles que le stockage distribué de fichiers et d’objets, les bases de données NoSQL et les frameworks de traitement de données. Cela a conduit également à une grande diversification des interfaces aux SGBD et à la perte d’un paradigme de programmation commun, ce qui rend très difficile pour un utilisateur d’intégrer ses données lorsqu’elles se trouvent dans des sources de données spécialisées, par exemple, relationnelle, document et graphe.Dans cette thèse, nous abordons le problème du traitement de requêtes avec plusieurs sources de données dans le cloud, où ces sources ont des modèles, des langages et des API différents. Cette thèse a été préparée dans le cadre du projet européen CoherentPaaS et, en particulier, du système multistore CloudMdsQL. CloudMdsQL est un langage de requête fonctionnel capable d’exploiter toute la puissance des sources de données locales, en permettant simplement à certaines requêtes natives portant sur les systèmes locauxd’être appelées comme des fonctions et en même temps optimisées, par exemple, en exploitant les prédicats de sélection, en utilisant le bindjoin, en réalisant l’ordonnancement des jointures ou en réduisant les transferts de données intermédiaires.Dans cette thèse, nous proposons une extension de CloudMdsQL pour tirer pleinement parti des fonctionnalités des frameworks de traitement de données sous-jacents tels que Spark en permettant l’utilisation ad hoc des opérateurs de map/filter/reduce (MFR) définis par l’utilisateur en combinaison avec les ordres SQL traditionnels. Cela permet d’effectuer des jointures entre données relationnelles et HDFS. Notre solution permet l’optimisation en permettant la réécriture de sous-requêtes afin de réaliser des optimisations majeures comme le bindjoin ou le filtrage des données le plus tôt possible.Nous avons validé notre solution en implémentant l’extension MFR dans le moteur de requête CloudMdsQL. Sur la base de ce prototype, nous proposons une validation expérimentale du traitement des requêtes multistore dans un cluster pour évaluer l’impact sur les performances de l’optimisation. Plus précisément, nous explorons les avantages de l’utilisation du bindjoin et du filtrage de données dans des conditions différentes. Dans l’ensemble, notre évaluation des performances illustre la capacité du moteur de requête CloudMdsQL à optimiser une requête et à choisir la stratégie d’exécution la plus efficace
Cloud computing is having a major impact on data management, with a proliferation of new, scalable data management solutions such as distributed file and object storage, NoSQL databases and big data processing frameworks. This also leads to a wide diversification of DBMS interfaces and the loss of a common programming paradigm, making it very hard for a user to integrate its data sitting in specialized data stores, e.g. relational, documents and graph data stores.In this thesis, we address the problem of query processing with multiple cloud data stores, where the data stores have different models, languages and APIs. This thesis has been prepared in the context of the CoherentPaaS European project and, in particular, the CloudMdsQL multistore system. CloudMdsQL is a functional query language able to exploit the full power of local data stores, by simply allowing some local data store native queries to be called as functions, and at the same time be optimized, e.g. by pushing down select predicates, using bind join, performing join ordering, or planning intermediate data shipping.In this thesis, we propose an extension of CloudMdsQL to take full advantage of the functionality of the underlying data processing frameworks such as Spark by allowing the ad-hoc usage of user defined map/filter/reduce (MFR) operators in combination with traditional SQL statements. This allows performing joins between relational and HDFS big data. Our solution allows for optimization by enabling subquery rewriting so that bind join can be used and filter conditions can be pushed down and applied by the data processing framework as early as possible.We validated our solution by implementing the MFR extension as part of the CloudMdsQL query engine. Based on this prototype, we provide an experimental validation of multistore query processing in a cluster to evaluate the impact on performance of optimization. More specifically, we explore the performance benefit of using bind join and select pushdown under different conditions. Overall, our performance evaluation illustrates the CloudMdsQL query engine’s ability to optimize a query and choose the most efficient execution strategy

APA, Harvard, Vancouver, ISO, and other styles

19

Prasher, Sham. "Query processing in multiresolution spatial databases /." [St. Lucia, Qld.], 2005. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe18682.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Yiu, Man-lung. "Advanced query processing on spatial networks." Click to view the E-thesis via HKUTO, 2006. http://sunzi.lib.hku.hk/hkuto/record/B36279365.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Chen, Yingwen. "XQuery Query Processing in Relational Systems." Thesis, University of Waterloo, 2004. http://hdl.handle.net/10012/1201.

Full text

Abstract:

With the rapid growth of XML documents to serve as a popular and major media for storage and interchange of the data on the Web, there is an increasing interest in using existing traditional relational database techniques to store and/or query XML data. Since XQuery is becoming a standard XML query language, significant effort has been made in developing an efficient and comprehensive XQuery-to-SQL query processor. In this thesis, we design and implement an XQuery-to-SQL Query Processor based on the Dynamic Intervals approach. We also provide a comprehensive translation for XQuery basic operations and FLWR expressions. The query processor is able to translate a complex XQuery query, which might include arbitrarily composed and nested FLWR expressions, basic functions, and element constructors, into a single SQL query for RDBMS and a physical plan for the XQuery-enhanced Relational Engine. In order to produce efficient and concise SQL queries, succinct XQuery to SQL translation templates and the optimization algorithms for the SQL query generation are proposed and implemented. The preferable merge-join approach is also proposed to avoid the inefficient nested-loop evaluation for FLWR expressions. Merge-join patterns and query rewriting rules are designed to identify XQuery fragments that can utilize the efficient merge-join evaluation. Proofs of correctness of the approach are provided in the thesis. Experimental results justify the correctness of our work.

APA, Harvard, Vancouver, ISO, and other styles

22

Nagel, Fabian Oliver. "Efficient query processing in managed runtimes." Thesis, University of Edinburgh, 2015. http://hdl.handle.net/1842/15869.

Full text

Abstract:

This thesis presents strategies to improve the query evaluation performance over huge volumes of relational-like data that is stored in the memory space of managed applications. Storing and processing application data in the memory space of managed applications is motivated by the convergence of two recent trends in data management. First, dropping DRAM prices have led to memory capacities that allow the entire working set of an application to fit into main memory and to the emergence of in-memory database systems (IMDBs). Second, language-integrated query transparently integrates query processing syntax into programming languages and, therefore, allows complex queries to be composed in the application. IMDBs typically serve as data stores to applications written in an object-oriented language running on a managed runtime. In this thesis, we propose a deeper integration of the two by storing all application data in the memory space of the application and using language-integrated query, combined with query compilation techniques, to provide fast query processing. As a starting point, we look into storing data as runtime-managed objects in collection types provided by the programming language. Queries are formulated using language-integrated query and dynamically compiled to specialized functions that produce the result of the query in a more efficient way by leveraging query compilation techniques similar to those used in modern database systems. We show that the generated query functions significantly improve query processing performance compared to the default execution model for language-integrated query. However, we also identify additional inefficiencies that can only be addressed by processing queries using low-level techniques which cannot be applied to runtime-managed objects. To address this, we introduce a staging phase in the generated code that makes query-relevant managed data accessible to low-level query code. Our experiments in .NET show an improvement in query evaluation performance of up to an order of magnitude over the default language-integrated query implementation. Motivated by additional inefficiencies caused by automatic garbage collection, we introduce a new collection type, the black-box collection. Black-box collections integrate the in-memory storage layer of a relational database system to store data and hide the internal storage layout from the application by employing existing object-relational mapping techniques (hence, the name black-box). Our experiments show that black-box collections provide better query performance than runtime-managed collections by allowing the generated query code to directly access the underlying relational in-memory data store using low-level techniques. Black-box collections also outperform a modern commercial database system. By removing huge volumes of collection data from the managed heap, black-box collections further improve the overall performance and response time of the application and improve the application’s scalability when facing huge volumes of collection data. To enable a deeper integration of the data store with the application, we introduce self-managed collections. Self-managed collections are a new type of collection for managed applications that, in contrast to black-box collections, store objects. As the data elements stored in the collection are objects, they are directly accessible from the application using references which allows for better integration of the data store with the application. Self-managed collections manually manage the memory of objects stored within them in a private heap that is excluded from garbage collection. We introduce a special collection syntax and a novel type-safe manual memory management system for this purpose. As was the case for black-box collections, self-managed collections improve query performance by utilizing a database-inspired data layout and allowing the use of low-level techniques. By also supporting references between collection objects, they outperform black-box collections.

APA, Harvard, Vancouver, ISO, and other styles

23

Lei, Ma. "Distributed query processing using composite semijoins." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ62238.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Katchaounov, Timour. "Query Processing for Peer Mediator Databases." Doctoral thesis, Uppsala : Acta Universitatis Upsaliensis : Univ.-bibl. [distributör], 2003. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-3687.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Liu, Ying. "Query optimization for distributed stream processing." [Bloomington, Ind.] : Indiana University, 2007. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3274258.

Full text

Abstract:

Thesis (Ph.D.)--Indiana University, Dept. of Computer Science, 2007.
Source: Dissertation Abstracts International, Volume: 68-07, Section: B, page: 4597. Adviser: Beth Plale. Title from dissertation home page (viewed Apr. 21, 2008).

APA, Harvard, Vancouver, ISO, and other styles

26

Kissinger, Thomas, Benjamin Schlegel, Dirk Habich, and Wolfgang Lehner. "QPPT: Query Processing on Prefix Trees." Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2013. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-113269.

Full text

Abstract:

Modern database systems have to process huge amounts of data and should provide results with low latency at the same time. To achieve this, data is nowadays typically hold completely in main memory, to benefit of its high bandwidth and low access latency that could never be reached with disks. Current in-memory databases are usually columnstores that exchange columns or vectors between operators and suffer from a high tuple reconstruction overhead. In this paper, we present the indexed table-at-a-time processing model that makes indexes the first-class citizen of the database system. The processing model comprises the concepts of intermediate indexed tables and cooperative operators, which make indexes the common data exchange format between plan operators. To keep the intermediate index materialization costs low, we employ optimized prefix trees that offer a balanced read/write performance. The indexed tableat-a-time processing model allows the efficient construction of composed operators like the multi-way-select-join-group. Such operators speed up the processing of complex OLAP queries so that our approach outperforms state-of-the-art in-memory databases.

APA, Harvard, Vancouver, ISO, and other styles

27

eurviriyanukul, kwanchai. "adaptive query processing in pipelined plans." Thesis, University of Manchester, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.492914.

Full text

Abstract:

This thesis presents an approach to enable changes to partially evaluated pipelined Query-Execution Plans (QEPs) at run-time to recover from optimisation mistakes in cardinality estimation according to the absences of accurate statistics. These mistakes may influence an optimiser to select in-efficient QEPs.

APA, Harvard, Vancouver, ISO, and other styles

28

Yiu, Man-lung, and 姚文龍. "Advanced query processing on spatial networks." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B36279365.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Liu, Fuyu. "Query processing in location-based services." Doctoral diss., University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/4634.

Full text

Abstract:

In the fifth part of this dissertation, we propose to use Road Network Embedding technique to process privacy protected queries.; With the advances in wireless communication technology and advanced positioning systems, a variety of Location-Based Services (LBS) become available to the public. Mobile users can issue location-based queries to probe their surrounding environments. One important type of query in LBS is moving monitoring queries over mobile objects. Due to the high frequency in location updates and the expensive cost of continuous query processing, server computation capacity and wireless communication bandwidth are the two limiting factors for large-scale deployment of moving object database systems. To address both of the scalability factors, distributed computing has been considered. These schemes enable moving objects to participate as a peer in query processing to substantially reduce the demand on server computation, and wireless communications associated with location updates. In the first part of this dissertation, we propose a distributed framework to process moving monitoring queries over moving objects in a spatial network environment. In the second part of this dissertation, in order to reduce the communication cost, we leverage both on-demand data access and periodic broadcast to design a new hybrid distributed solution for moving monitoring queries in an open space environment. Location-based services make our daily life more convenient. However, to receive the services, one has to reveal his/her location and query information when issuing location-based queries. This could lead to privacy breach if these personal information are possessed by some untrusted parties. In the third part of this dissertation, we introduce a new privacy protection measure called query l-diversity, and provide two cloaking algorithms to achieve both location k-anonymity and query l-diversity to better protect user privacy. In the fourth part of this dissertation, we design a hybrid three-tier architecture to help reduce privacy exposure.
ID: 029050964; System requirements: World Wide Web browser and PDF reader.; Mode of access: World Wide Web.; Thesis (Ph.D.)--University of Central Florida, 2010.; Includes bibliographical references (p. 138-145).
Ph.D.
Doctorate
Department of Electrical Engineering and Computer Science
Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

30

Zhang, Yang S. M. Massachusetts Institute of Technology. "ICEDB : intermittently-connected continuous query processing." Thesis, Massachusetts Institute of Technology, 2008. http://hdl.handle.net/1721.1/43064.

Full text

Abstract:

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.
Includes bibliographical references (leaves 61-64).
Several emerging wireless sensor network applications must cope with a combination of node mobility (e.g., sensors on moving cars) and high data rates (media-rich sensors capturing videos, images, sounds, etc.). Due to their mobility, these sensor networks display intermittent and variable network connectivity, and often have to deliver large quantities of data relative to the bandwidth available during periods of connectivity. Unfortunately, existing distributed data management and stream processing are not appropriate for such applications because they assume that the network connecting nodes in the data processor is "always on," and that the absence of a network connection is a fault that needs to be masked to avoid failure. This thesis describes ICEDB (Intermittently Connected Embedded Database), a continuous query processing system for intermittently connected mobile sensor networks. ICEDB incorporates two key ideas: (1) a delay-tolerant continuous query processor, coordinated by a central server and distributed across the mobile nodes, and (2) algorithms for prioritizing certain query results to improve application-defined "utility" metrics. We describe the results of several experiments that use data collected from a deployed fleet of cabs driving in Boston.
by Yang Zhang.
S.M.

APA, Harvard, Vancouver, ISO, and other styles

31

Shenoy, Sreekumar Thrivikrama. "Semantic query processing in database systems." Case Western Reserve University School of Graduate Studies / OhioLINK, 1990. http://rave.ohiolink.edu/etdc/view?acc_num=case1054585075.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

Xu, Cheng. "Authenticated query processing in the cloud." HKBU Institutional Repository, 2019. https://repository.hkbu.edu.hk/etd_oa/620.

Full text

Abstract:

With recent advances in data-as-a-service (DaaS) and cloud computing, outsourcing data to the cloud has become a common practice. In a typical scenario, the data owner (DO) outsources the data and delegates the query processing service to a service provider (SP). However, as the SP is often an untrusted third party, the integrity of the query results cannot be guaranteed and is thus imperative to be authenticated. To tackle this issue, a typical approach is letting the SP provide a cryptographic proof, which can be used to verify the soundness and completeness of the query results by the clients. Despite extensive research on authenticated query processing for outsourced databases, existing techniques have only considered limited query types. They fail to address a variety of needs demanded by enterprise customers such as supporting aggregate queries over set-valued data, enforcing fine-grained access control, and using distributed computing paradigms. In this dissertation, we take the first step to comprehensively investigate the authenticated query processing in the cloud that fulfills the aforementioned requirements. Security analysis and performance evaluation show that the proposed solutions and techniques are robust and efficient under a wide range of system settings.

APA, Harvard, Vancouver, ISO, and other styles

33

Tran, Van-Hoang. "Range query processing over untrustworthy clouds." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S072.

Full text

Abstract:

Le nuage informatique est devenu de plus en plus un standard pour réduire les coûts et permettre l'élasticité. Alors que les fournisseurs de nuages élargissent leurs services, les préoccupations relatives à la sécurité des données externalisées empêchent une adoption généralisée des technologies des nuages. Pour y remédier, le chiffrement est généralement utilisé pour protéger les données confidentielles stockées et traitées sur des nuages non fiables. Le chiffrement des données externalisées diminue toutefois les fonctionnalités des applications parce que la prise en charge de certaines fonctions fondamentales sur les données chiffrées est encore limitée. Cette thèse se concentre sur le problème de la prise en charge des requêtes d'intervalle sur des données chiffrées stockées dans les nuages. De nombreuses études ont été introduites dans ce domaine. Néanmoins, aucun des schémas précédents ne montre des performances satisfaisantes pour les systèmes modernes, qui exigent non seulement des réponses à faible latence, mais aussi une haute évolutivité. En particulier, la plupart des solutions existantes souffrent soit d'un traitement inefficace des requêtes d'intervalle, soit d'un manque de confidentialité. Même si certaines peuvent assurer à la fois une protection élevée de la vie privée et un traitement rapide, elles ne satisfont pas aux exigences d'évolutivité, à savoir un haut débit d'ingestion, une surcharge de stockage pratique et des mises à jour légères. Pour surmonter ces limites, nous proposons des solutions évolutives sur le traitement des requêtes d'intervalle sécurisées tout en préservant l'efficacité et une sécurité forte. Nos contributions sont les suivantes : (1) Nous adaptons l'une des solutions de pointe au contexte de haut débit de données entrantes qui crée souvent des goulets d'étranglement. En d'autres termes, nous introduisons et intégrons la notion de modèle d'index dans l'une des solutions de pointe afin qu'elle puisse s'adapter au contexte cible. (2) Nous développons un cadre d'ingestion intensive dédié au traitement de requêtes d'intervalle sécurisée sur des données chiffrées. En particulier, nous reconcevons l'architecture de la première contribution pour la rendre entièrement distribuée. Une présentation des données et une méthode asynchrone sont ensuite introduites. Ensemble, elles augmentent significativement la capacité de réception du système. En outre, nous adaptons le cadre à un type d'adversaires plus forts (par exemple, les attaquants en ligne) et améliorons son aspect pratique. (3) Nous proposons un schéma pour le traitement des requêtes d'intervalle privées sur des ensembles de données externalisés. Ce schéma répond au besoin d'une solution évolutive en termes d'efficacité, de haute sécurité, de surcharge de stockage pratique et de nombreuses mises à jour, qui ne peuvent être pris en charge par les protocoles existants. À cette fin, nous développons notre solution basée sur des conteneurs de données de taille égale et des index sécurisés. Le premier permet de protéger la confidentialité des données contre l'adversaire, tandis que le second permet l'efficacité. Pour permettre des mises à jour légères, nous proposons de découpler les index sécurisés de leurs conteneurs en utilisant des bitmaps de taille égale
Cloud computing has increasingly become a standard for saving costs and enabling elasticity. While cloud providers expand their services, concerns about the security of outsourced data hinder cloud technologies from a widespread adoption. To address it, encryption is usually used to protect confidential data stored and processed on untrustworthy clouds. Encrypting outsourced data however mitigates the functionalities of applications since supporting some fundamental functions on encrypted data is still limited. This thesis focuses on the problem of supporting range queries over encrypted data stored on clouds. Many studies have been introduced in this line of work. Nevertheless, none of prior schemes exhibits satisfactory performances for modern systems, that require not only low-latency responses, but also high scalability. Particularly, most existing solutions suffer from either inefficient range query processing or privacy leaks. Even if some can achieve both strong privacy protection and fast processing, they do not satisfy scalability requirements, namely high ingestion throughput, practical storage overhead, and lightweight updates. To overcome this limitation, we propose scalable solutions on secure range query processing while still preserving efficiency and strong security. Our contributions are: (1) We adapt one of the state-of-the-art solutions to the context of high rate of incoming data that often creates bottlenecks. In other words, we introduce and integrate the notion of index template into one of the state-of-the-art solutions so that it can cope with the target context. (2) We develop an intensive ingestion framework dedicated to secure range query processing on encrypted data. Particularly, we re-design the architecture of the first contribution to make it fully distributed. A data presentation and asynchronous method are then introduced. Together, they significantly increase the intake ability of the system. Besides, we adapt the framework to a stronger type of adversaries (e.g., online attackers) and enhance its practicality. (3) We propose a scalable scheme for private range query processing on outsourced datasets. This scheme addresses the need of a scalable solution in terms of efficiency, high security, practical storage overhead, and numerous updates, which can not be supported by existing protocols. To this purpose, we develop our solution relying on equal-size chunks (buckets) of data and secure indexes. The former helps to protect privacy of the underlying data from the adversary while the latter enables efficiency. To support lightweight updates, we propose to decouple secure indexes from their buckets by using use equal-size bitmaps

APA, Harvard, Vancouver, ISO, and other styles

34

Cheng, James Sheung-Chak. "Efficient query processing on graph databases /." View abstract or full-text, 2008. http://library.ust.hk/cgi/db/thesis.pl?CSED%202008%20CHENG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

35

Unnava, Vasundhara. "Query processing in distributed database systems." Connect to resource, 1992. http://rave.ohiolink.edu/etdc/view.cgi?acc%5Fnum=osu1261314105.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Sankaranarayanan, Jagan. "Scalable query processing on spatial networks." College Park, Md.: University of Maryland, 2008. http://hdl.handle.net/1903/8130.

Full text

Abstract:

Thesis (Ph. D.) -- University of Maryland, College Park, 2008.
Thesis research directed by: Dept. of Computer Science. Title from t.p. of PDF. Includes bibliographical references. Published by UMI Dissertation Services, Ann Arbor, Mich. Also available in paper.

APA, Harvard, Vancouver, ISO, and other styles

37

Ives, Zachary G. "Efficient query processing for data integration /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6864.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Lian, Xiang. "Efficient query processing over uncertain data /." View abstract or full-text, 2009. http://library.ust.hk/cgi/db/thesis.pl?CSED%202009%20LIAN.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Frosini, Riccardo. "Flexible query processing of SPARQL queries." Thesis, Birkbeck (University of London), 2018. http://bbktheses.da.ulcc.ac.uk/319/.

Full text

Abstract:

SPARQL is the predominant language for querying RDF data, which is the standard model for representing web data and more specifically Linked Open Data (a collection of heterogeneous connected data). Datasets in RDF form can be hard to query by a user if she does not have a full knowledge of the structure of the dataset. Moreover, many datasets in Linked Data are often extracted from actual web page content which might lead to incomplete or inaccurate data. We extend SPARQL 1.1 with two operators, APPROX and RELAX, previously introduced in the context of regular path queries. Using these operators we are able to support exible querying over the property path queries of SPARQL 1.1. We call this new language SPARQLAR. Using SPARQLAR users are able to query RDF data without fully knowing the structure of a dataset. APPROX and RELAX encapsulate different aspects of query flexibility: finding different answers and finding more answers, respectively. This means that users can access complex and heterogeneous datasets without the need to know precisely how the data is structured. One of the open problems we address is how to combine the APPROX and RELAX operators with a pragmatic language such as SPARQL. We also devise an implementation of a system that evaluates SPARQLAR queries in order to study the performance of the new language. We begin by defining the semantics of SPARQLAR and the complexity of query evaluation. We then present a query processing technique for evaluating SPARQLAR queries based on a rewriting algorithm and prove its soundness and completeness. During the evaluation of a SPARQLAR query we generate multiple SPARQL 1.1 queries that are evaluated against the dataset. Each such query will generate answers with a cost that indicates their distance with respect to the exact form of the original SPARQLAR query. Our prototype implementation incorporates three optimisation techniques that aim to enhance query execution performance: the first optimisation is a pre-computation technique that caches the answers of parts of the queries generated by the rewriting algorithm. These answers will then be reused to avoid the re-execution of those sub-queries. The second optimisation utilises a summary of the dataset to discard queries that it is known will not return any answer. The third optimisation technique uses the query containment concept to discard queries whose answers would be returned by another query at the same or lower cost. We conclude by conducting a performance study of the system on three different RDF datasets: LUBM (Lehigh University Benchmark), YAGO and DBpedia.

APA, Harvard, Vancouver, ISO, and other styles

40

Kotto, Kombi Roland. "Distributed query processing over fluctuating streams." Thesis, Lyon, 2018. http://www.theses.fr/2018LYSEI050/document.

Full text

Abstract:

Le traitement de flux de données est au cœur des problématiques actuelles liées au Big Data. Face à de grandes quantités de données (Volume) accessibles de manière éphémère (Vélocité), des solutions spécifiques tels que les systèmes de gestion de flux de données (SGFD) ont été développés. Ces SGFD reçoivent des flux et des requêtes continues pour générer de nouveaux résultats aussi longtemps que des données arrivent en entrée. Dans le contexte de cette thèse, qui s’est réalisée dans le cadre du projet ANR Socioplug (ANR-13-INFR-0003), nous considérons une plateforme collaborative de traitement de flux de données à débit variant en termes de volume et de distribution des valeurs. Chaque utilisateur peut soumettre des requêtes continues et contribue aux ressources de traitement de la plateforme. Cependant, chaque unité de traitement traitant les requêtes dispose de ressources limitées ce qui peut engendrer la congestion du système en fonction des variations des flux en entrée. Le problème est alors de savoir comment adapter dynamiquement les ressources utilisées par chaque requête continue par rapport aux besoins de traitement. Cela soulève plusieurs défis : i) comment détecter un besoin de reconfiguration ? ii) quand reconfigurer le système pour éviter sa congestion ? Durant ces travaux de thèse, nous nous sommes intéressés à la gestion automatique de la parallélisation des opérateurs composant une requête continue. Nous proposons une approche originale basée sur une estimation des besoins de traitement dans un futur proche. Ainsi, nous pouvons adapter le niveau de parallélisme des opérateurs de manière proactive afin d’ajuster les ressources utilisées aux besoins des traitements. Nous montrons qu’il est possible d’éviter la congestion du système mais également de réduire significativement la consommation de ressources à performance équivalente. Ces différents travaux ont été implémentés et validés dans un SGFD largement utilisé avec différents jeux de tests reproductibles
In a Big Data context, stream processing has become a very active research domain. In order to manage ephemeral data (Velocity) arriving at important rates (Volume), some specific solutions, denoted data stream management systems (DSMSs),have been developed. DSMSs take as inputs some queries, called continuous queries,defined on a set of data streams. Acontinuous query generates new results as long as new data arrive in input. In many application domains, data streams haveinput rates and distribution of values which change over time. These variations may impact significantly processingrequirements for each continuous query.This thesis takes place in the ANR project Socioplug (ANR-13-INFR-0003). In this context, we consider a collaborative platformfor stream processing. Each user can submit multiple continuous queries and contributes to the execution support of theplatform. However, as each processing unit supporting treatments has limited resources in terms of CPU and memory, asignificant increase in input rate may cause the congestion of the system. The problem is then how to adjust dynamicallyresource usage to processing requirements for each continuous query ? It raises several challenges : i) how to detect a need ofreconfiguration ? ii) when reconfiguring the system to avoid its congestion at runtime ?In this work, we are interested by the different processing steps involved in the treatment of a continuous query over adistributed infrastructure. From this global analysis, we extract mechanisms enabling dynamic adaptation of resource usage foreach continuous query. We focus on automatic parallelization, or auto-parallelization, of operators composing the executionplan of a continuous query. We suggest an original approach based on the monitoring of operators and an estimation ofprocessing requirements in near future. Thus, we can increase (scale-out), or decrease (scale-in) the parallelism degree ofoperators in a proactive many such as resource usage fits to processing requirements dynamically. Compared to a staticconfiguration defined by an expert, we show that it is possible to avoid the congestion of the system in many cases or to delay itin most critical cases. Moreover, we show that resource usage can be reduced significantly while delivering equivalentthroughput and result quality. We suggest also to combine this approach with complementary mechanisms for dynamic adaptation of continuous queries at runtime. These differents approaches have been implemented within a widely used DSMS and have been tested over multiple and reproductible micro-benchmarks

APA, Harvard, Vancouver, ISO, and other styles

41

Lei, Chuan. "Recurring Query Processing on Big Data." Digital WPI, 2015. https://digitalcommons.wpi.edu/etd-dissertations/550.

Full text

Abstract:

The advances in hardware, software, and networks have enabled applications from business enterprises, scientific and engineering disciplines, to social networks, to generate data at unprecedented volume, variety, velocity, and varsity not possible before. Innovation in these domains is thus now hindered by their ability to analyze and discover knowledge from the collected data in a timely and scalable fashion. To facilitate such large-scale big data analytics, the MapReduce computing paradigm and its open-source implementation Hadoop is one of the most popular and widely used technologies. Hadoop's success as a competitor to traditional parallel database systems lies in its simplicity, ease-of-use, flexibility, automatic fault tolerance, superior scalability, and cost effectiveness due to its use of inexpensive commodity hardware that can scale petabytes of data over thousands of machines. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Efficient execution and optimization techniques must be designed to assure the responsiveness and scalability of these recurring queries. In this dissertation, we thoroughly investigate topics in the area of recurring query processing on big data. In this dissertation, we first propose a novel scalable infrastructure called Redoop that treats recurring query over big evolving data as first class citizens during query processing. This is in contrast to state-of-the-art MapReduce/Hadoop system experiencing significant challenges when faced with recurring queries including redundant computations, significant latencies, and huge application development efforts. Redoop offers innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support as well. Second, we address the crucial need to accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated data sets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. On top of Redoop, we built a scalable multi-query sharing engine tailored for recurring workloads in the MapReduce infrastructure, called Helix. Helix deploys new sliced window-alignment techniques to create sharing opportunities among recurring queries without introducing additional I/O overheads or unnecessary data scans. Furthermore, Helix introduces a cost/benefit model for creating a sharing plan among the recurring queries, and a scheduling strategy for executing them to maximize the SLA satisfaction. Third, recurring analytics queries tend to be expensive, especially when query processing consumes data sets in the hundreds of terabytes or more. Time sensitive recurring queries, such as fraud detection, often come with tight response time constraints as query deadlines. Data sampling is a popular technique for computing approximate results with an acceptable error bound while reducing high-demand resource consumption and thus improving query turnaround times. In this dissertation, we propose the first fast approximate query engine for recurring workloads in the MapReduce infrastructure, called Faro. Faro introduces two key innovations: (1) a deadline-aware sampling strategy that builds samples from the original data with reduced sample sizes compared to uniform sampling, and (2) adaptive resource allocation strategies that maximally improve the approximate results while assuring to still meet the response time requirements specified in recurring queries. In our comprehensive experimental study of each part of this dissertation, we demonstrate the superiority of the proposed strategies over state-of-the-art techniques in scalability, effectiveness, as well as robustness.

APA, Harvard, Vancouver, ISO, and other styles

42

Yi, Peipei. "Graph query autocompletion." HKBU Institutional Repository, 2018. https://repository.hkbu.edu.hk/etd_oa/557.

Full text

Abstract:

The prevalence of graph-structured data in modern real-world applications has led to a rejuvenation of research on graph data management and analytics. Several database query languages have been proposed for textually querying graph databases. Unfortunately, formulating a graph query using any of these query languages often demands considerable cognitive effort and requires "programming" skill at least similar to programming in SQL. Yet, in a wide spectrum of graph applications consumers need to query graph data but are not proficient query writers. Hence, it is important to devise intuitive techniques that can alleviate the burden of query formulation and thus increase the usability of graph databases. In this dissertation, we take the first step to study the graph query autocompletion problem. We provide techniques that take a user's graph query as input and generate top-k query suggestions as output, to help to alleviate the verbose and error-prone graph query formulation process in a visual environment. Firstly, we study visual query autocompletion for graph databases. Techniques for query autocompletion have been proposed for web search and XML search. However, a corresponding capability for graph query engine is in its infancy. We propose a novel framework for graph query autocompletion (called AutoG). The novelties of AutoG are as follows: First, we formalize query composition that specifies how query suggestions are formed. Second, we propose to increment a query with the logical units called c-prime features, that are (i) frequent subgraphs and (ii) constructed from smaller c-prime features in no more than c ways. Third, we propose algorithms to rank candidate suggestions. Fourth, we propose a novel index called feature DAG (FDAG) to further optimize the ranking. Secondly, we propose user focus-based graph query autocompletion. AutoG provides suggestions that are formed by adding subgraph increments to arbitrary places of an existing user query. However, humans can only interact with a small number of recent software artifacts in hand. Hence, many such suggestions could be irrelevant. We present the GFocus framework that exploits a novel notion of user focus of graph query formulation. Intuitively, the focus is the subgraph that a user is working on. We formulate locality principles to automatically identify and maintain the focus. We propose novel monotone submodular ranking functions for generating popular and comprehensive query suggestions only at the focus. We propose efficient algorithms and an index for ranking the suggestions. Thirdly, we propose graph query autocompletion for large graphs. Graph features that have been exploited in AutoG are either absent or rare in large graphs. To address this, we present Flexible graph query autocompletion for LArge Graphs, called FLAG. We propose wildcard label for query graph and query suggestions. In particular, FLAG allows augmenting users' queries using subgraph increments with wildcard labels, which summarize query suggestions that have similar increment structures but different labels. We propose an efficient ranking algorithm and a novel index, called Suggestion Summarization DAG (SSDAG), to optimize the online suggestion ranking. Detailed problem analysis and extensive experimental studies consistently demonstrate the effectiveness and robustness of our proposed techniques in a broad range of settings.

APA, Harvard, Vancouver, ISO, and other styles

43

Alyoubi, Khaled Hamed. "Database query optimisation based on measures of regret." Thesis, Birkbeck (University of London), 2016. http://bbktheses.da.ulcc.ac.uk/224/.

Full text

Abstract:

The query optimiser in a database management system (DBMS) is responsible for �nding a good order in which to execute the operators in a given query. However, in practice the query optimiser does not usually guarantee to �nd the best plan. This is often due to the non-availability of precise statistical data or inaccurate assumptions made by the optimiser. In this thesis we propose a robust approach to logical query optimisation that takes into account the unreliability in database statistics during the optimisation process. In particular, we study the ordering problem for selection operators and for join operators, where selectivities are modelled as intervals rather than exact values. As a measure of optimality, we use a concept from decision theory called minmax regret optimisation (MRO). When using interval selectivities, the decision problem for selection operator ordering turns out to be NP-hard. After investigating properties of the problem and identifying special cases which can be solved in polynomial time, we develop a novel heuristic for solving the general selection ordering problem in polynomial time. Experimental evaluation of the heuristic using synthetic data, the Star Schema Benchmark and real-world data sets shows that it outperforms other heuristics (which take an optimistic, pessimistic or midpoint approach) and also produces plans whose regret is on average very close to optimal. The general join ordering problem is known to be NP-hard, even for exact selectivities. So, for interval selectivities, we restrict our investigation to sets of join operators which form a chain and to plans that correspond to left-deep join trees. We investigate properties of the problem and use these, along with ideas from the selection ordering heuristic and other algorithms in the literature, to develop a polynomial-time heuristic tailored for the join ordering problem. Experimental evaluation of the heuristic shows that, once again, it performs better than the optimistic, pessimistic and midpoint heuristics. In addition, the results show that the heuristic produces plans whose regret is on average even closer to the optimal than for selection ordering.

APA, Harvard, Vancouver, ISO, and other styles

44

Luong, Vu Ngoc Duy. "Optimisation for image processing." Thesis, Imperial College London, 2014. http://hdl.handle.net/10044/1/24904.

Full text

Abstract:

The main purpose of optimisation in image processing is to compensate for missing, corrupted image data, or to find good correspondences between input images. We note that image data essentially has infinite dimensionality that needs to be discretised at certain levels of resolution. Most image processing methods find a suboptimal solution, given the characteristics of the problem. While the general optimisation literature is vast, there does not seem to be an accepted universal method for all image problems. In this thesis, we consider three interrelated optimisation approaches to exploit problem structures of various relaxations to three common image processing problems: 1. The first approach to the image registration problem is based on the nonlinear programming model. Image registration is an ill-posed problem and suffers from many undesired local optima. In order to remove these unwanted solutions, certain regularisers or constraints are needed. In this thesis, prior knowledge of rigid structures of the images is included in the problem using linear and bilinear constraints. The aim is to match two images while maintaining the rigid structure of certain parts of the images. A sequential quadratic programming algorithm is used, employing dimensional reduction, to solve the resulting discretised constrained optimisation problem. We show that pre-processing of the constraints can reduce problem dimensionality. Experimental results demonstrate better performance of our proposed algorithm compare to the current methods. 2. The second approach is based on discrete Markov Random Fields (MRF). MRF has been successfully used in machine learning, artificial intelligence, image processing, including the image registration problem. In the discrete MRF model, the domain of the image problem is fixed (relaxed) to a certain range. Therefore, the optimal solution to the relaxed problem could be found in the predefined domain. The original discrete MRF is NP hard and relaxations are needed to obtain a suboptimal solution in polynomial time. One popular approach is the linear programming (LP) relaxation. However, the LP relaxation of MRF (LP-MRF) is excessively high dimensional and contains sophisticated constraints. Therefore, even one iteration of a standard LP solver (e.g. interior-point algorithm), may take too long to terminate. Dual decomposition technique has been used to formulate a convex-nondifferentiable dual LP-MRF that has geometrical advantages. This has led to the development of first order methods that take into account the MRF structure. The methods considered in this thesis for solving the dual LP-MRF are the projected subgradient and mirror descent using nonlinear weighted distance functions. An analysis of the convergence properties of the method is provided, along with improved convergence rate estimates. The experiments on synthetic data and an image segmentation problem show promising results. 3. The third approach employs a hierarchy of problem's models for computing the search directions. The first two approaches are specialised methods for image problems at a certain level of discretisation. As input images are infinite-dimensional, all computational methods require their discretisation at some levels. Clearly, high resolution images carry more information but they lead to very large scale and ill-posed optimisation problems. By contrast, although low level discretisation suffers from the loss of information, it benefits from low computational cost. In addition, a coarser representation of a fine image problem could be treated as a relaxation to the problem, i.e. the coarse problem is less ill-conditioned. Therefore, propagating a solution of a good coarse approximation to the fine problem could potentially improve the fine level. With the aim of utilising low level information within the high level process, we propose a multilevel optimisation method to solve the convex composite optimisation problem. This problem consists of the minimisation of the sum of a smooth convex function and a simple non-smooth convex function. The method iterates between fine and coarse levels of discretisation in the sense that the search direction is computed using information from either the gradient or a solution of the coarse model. We show that the proposed algorithm is a contraction on the optimal solution and demonstrate excellent performance on experiments with image restoration problems.

APA, Harvard, Vancouver, ISO, and other styles

45

Golab, Lukasz. "Sliding Window Query Processing over Data Streams." Thesis, University of Waterloo, 2006. http://hdl.handle.net/10012/2930.

Full text

Abstract:

Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes sense for users to ask a query once and receive updated answers over time.

This dissertation begins with the observation that the two fundamental requirements of a DSMS are dealing with transient (time-evolving) rather than static data and answering persistent rather than transient queries. One implication of the first requirement is that data maintenance costs have a significant effect on the performance of a DSMS. Additionally, traditional query processing algorithms must be re-engineered for the sliding window model because queries may need to re-process expired data and "undo" previously generated results. The second requirement suggests that a DSMS may execute a large number of persistent queries at the same time, therefore there exist opportunities for resource sharing among similar queries.

The purpose of this dissertation is to develop solutions for efficient query processing over sliding windows by focusing on these two fundamental properties. In terms of the transient nature of streaming data, this dissertation is based upon the following insight. Although the data keep changing over time as the windows slide forward, the changes are not random; on the contrary, the inputs and outputs of a DSMS exhibit patterns in the way the data are inserted and deleted. It will be shown that the knowledge of these patterns leads to an understanding of the semantics of persistent queries, lower window maintenance costs, as well as novel query processing, query optimization, and concurrency control strategies. In the context of the persistent nature of DSMS queries, the insight behind the proposed solution is that various queries may need to be refreshed at different times, therefore synchronizing the refresh schedules of similar queries creates more opportunities for resource sharing.

APA, Harvard, Vancouver, ISO, and other styles

46

Thomo, Alex-Imir. "Query processing using views in semistructured databases." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2001. http://www.collectionscanada.ca/obj/s4/f2/dsk3/ftp04/MQ59343.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

47

Ding, Luping. "Metadata-aware query processing over data streams." Worcester, Mass. : Worcester Polytechnic Institute, 2008. http://www.wpi.edu/Pubs/ETD/Available/etd-042208-194826/.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Wu, Hejun. "Scheduling for in-network sensor query processing /." View abstract or full-text, 2008. http://library.ust.hk/cgi/db/thesis.pl?CSED%202008%20WU.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

Jonassen, Simon. "Efficient Query Processing in Distributed Search Engines." Doctoral thesis, Norges teknisk-naturvitenskapelige universitet, Institutt for datateknikk og informasjonsvitenskap, 2013. http://urn.kb.se/resolve?urn=urn:nbn:no:ntnu:diva-20206.

Full text

Abstract:

Web search engines have to deal with a rapidly increasing amount of information, high query loads and tight performance constraints. The success of a search engine depends on the speed with which it answers queries (efficiency) and the quality of its answers (effectiveness). These two metrics have a large impact on the operational costs of the search engine and the overall user satisfaction, which determine the revenue of the search engine. In this context, any improvement in query processing efficiency can reduce the operational costs and improve user satisfaction, hence improve the overall benefit. In this thesis, we elaborate on query processing efficiency, address several problems within partitioned query processing, pruning and caching and propose several novel techniques: First, we look at term-wise partitioned indexes and address the main limitations of the state-of-the-art query processing methods. Our first approach combines the advantage of pipelined and traditional (non-pipelined) query processing. This approach assumes one disk access per posting list and traditional term-at-a-time processing. For the second approach, we follow an alternative direction and look at document-at-a-time processing of sub-queries and skipping. Subsequently, we present several skipping extensions to pipelined query processing, which as we show can improve the query processing performance and/or the quality of results. Then, we extend one of these methods with intra-query parallelism, which as we show can improve the performance at low query loads. Second, we look at skipping and pruning optimizations designed for a monolithic index. We present an efficient self-skipping inverted index designed for modern index compression methods and several query processing optimizations. We show that these optimizations can provide a significant speed-up compared to a full (non-pruned) evaluation and reduce the performance gap between disjunctive (OR) and conjunctive (AND) queries. We also propose a linear programming optimization that can further improve the I/O, decompression and computation efficiency of Max-Score. Third, we elaborate on caching in Web search engines in two independent contributions. First, we present an analytical model that finds the optimal split in a static memory-based two-level cache. Second, we present several strategies for selecting, ordering and scheduling prefetch queries and demonstrate that these can improve the efficiency and effectiveness of Web search engines. We carefully evaluate our ideas either using a real implementation or by simulation using real-world text collections and query logs. Most of the proposed techniques are found to improve the state-of-the-art in the conducted empirical studies. However, the implications and applicability of these techniques in practice need further evaluation in real-life settings.

APA, Harvard, Vancouver, ISO, and other styles

50

Mühleisen, Hannes [Verfasser]. "Architecture-independent distributed query processing / Hannes Mühleisen." Berlin : Freie Universität Berlin, 2013. http://d-nb.info/1031100261/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!