Dissertations / Theses: 'Data models, storage and indexing'

1

Munishwar, Vikram P. "Storage and indexing issues in sensor networks." Diss., Online access via UMI:, 2006.

APA, Harvard, Vancouver, ISO, and other styles

2

Ottoson, Patrik. "Geographic Indexing and Data Management for 3D-Visualisation." Doctoral thesis, Stockholm : Tekniska högsk, 2001. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-3235.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Vasaitis, Vasileios. "Novel storage architectures and pointer-free search trees for database systems." Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6240.

Full text

Abstract:

Database systems research is an old and well-established field in computer science. Many of the key concepts appeared as early as the 60s, while the core of relational databases, which have dominated the database world for a while now, was solidified during the 80s. However, the underlying hardware has not displayed such stability in the same period, which means that a lot of assumptions that were made about the hardware by early database systems are not necessarily true for modern computer architectures. In particular, over the last few decades there have been two notable consistent trends in the evolution of computer hardware. The first is that the memory hierarchy of mainstream computer systems has been getting deeper, with its different levels moving away from each other, and new levels being added in between as a result, in particular cache memories. The second is that, when it comes to data transfers between any two adjacent levels of the memory hierarchy, access latencies have not been keeping up with transfer rates. The challenge is therefore to adapt database index structures so that they become immune to these two trends. The latter is addressed by gradually increasing the size of the data transfer unit; the former, by organizing the data so that it exhibits good locality for memory transfers across multiple memory boundaries. We have developed novel structures that facilitate both of these strategies. We started our investigation with the venerable B+-tree, which is the cornerstone order-preserving index of any database system, and we have developed a novel pointer-free tree structure for its pages that optimizes its cache performance and makes it immune to the page size. We then adapted our approach to the R-tree and the GiST, making it applicable to multi-dimensional data indexes as well as generalized indexes for any abstract data type. Finally, we have investigated our structure in the context of main memory alone, and have demonstrated its superiority over the established approaches in that setting too. While our research has its roots in data structures and algorithms theory, we have conducted it with a strong experimental focus, as the complex interactions within the memory hierarchy of a modern computer system can be quite challenging to model and theorize about effectively. Our findings are therefore backed by solid experimental results that verify our hypotheses and prove the superiority of our structures over competing approaches.

APA, Harvard, Vancouver, ISO, and other styles

4

Jia, Yanan Jia. "Generalized Bilinear Mixed-Effects Models for Multi-Indexed Multivariate Data." The Ohio State University, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=osu1469180629.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Habtu, Simon. "Indexing file metadata using a distributed search engine for searching files on a public cloud storage." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-232064.

Full text

Abstract:

Visma Labs AB or Visma wanted to conduct experiments to see if file metadata could be indexed for searching files on a public cloud storage. Given that storing files in a public cloud storage is cheaper than the current storage solution, the implementation could save Visma money otherwise spent on expensive storage costs. The thesis is therefore to find and evaluate an approach chosen for indexing file metadata and searching files on a public cloud storage with the chosen distributed search engine Elasticsearch. The architecture of the proposed solution is similar to a file service and was implemented using several containerized services for it to function. The results show that the file service solution is indeed feasible but would need further tuning and more resources to function according to the demands of Visma.
Visma Labs AB eller Visma ville genomföra experiment för att se om filmetadata skulle kunna indexeras för att söka efter filer på ett publikt moln. Med tanke på att lagring av filer på ett publikt moln är billigare än den nuvarande lagringslösningen, kan implementeringen spara Visma pengar som spenderas på dyra lagringskostnader. Denna studie är därför till för att hitta och utvärdera ett tillvägagångssätt valt för att indexera filmetadata och söka filer på ett offentligt molnlagring med den utvalda distribuerade sökmotorn Elasticsearch. Arkitekturen för den föreslagna lösningen har likenelser av en filtjänst och implementerades med flera containeriserade tjänster för att den ska fungera. Resultaten visar att filservicelösningen verkligen är möjlig men skulle behöva ytterligare modifikationer och fler resurser att fungera enligt Vismas krav.

APA, Harvard, Vancouver, ISO, and other styles

6

Singh, Aameek. "Secure Management of Networked Storage Services: Models and Techniques." Diss., Available online, Georgia Institute of Technology, 2007, 2007. http://etd.gatech.edu/theses/available/etd-04092007-004039/.

Full text

Abstract:

Thesis (Ph. D.)--Computing, Georgia Institute of Technology, 2008.
Liu, Ling, Committee Chair ; Aberer, Karl, Committee Member ; Ahamad, Mustaque, Committee Member ; Blough, Douglas, Committee Member ; Pu, Calton, Committee Member ; Voruganti, Kaladhar, Committee Member.

APA, Harvard, Vancouver, ISO, and other styles

7

Paul, Arnab Kumar. "An Application-Attuned Framework for Optimizing HPC Storage Systems." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99793.

Full text

Abstract:

High performance computing (HPC) is routinely employed in diverse domains such as life sciences, and Geology, to simulate and understand the behavior of complex phenomena. Big data driven scientific simulations are resource intensive and require both computing and I/O capabilities at scale. There is a crucial need for revisiting the HPC I/O subsystem to better optimize for and manage the increased pressure on the underlying storage systems from big data processing. Extant HPC storage systems are designed and tuned for a specific set of applications targeting a range of workload characteristics, but they lack the flexibility in adapting to the ever-changing application behaviors. The complex nature of modern HPC storage systems along with the ever-changing application behaviors present unique opportunities and engineering challenges. In this dissertation, we design and develop a framework for optimizing HPC storage systems by making them application-attuned. We select three different kinds of HPC storage systems - in-memory data analytics frameworks, parallel file systems and object storage. We first analyze the HPC application I/O behavior by studying real-world I/O traces. Next we optimize parallelism for applications running in-memory, then we design data management techniques for HPC storage systems, and finally focus on low-level I/O load balance for improving the efficiency of modern HPC storage systems.
Doctor of Philosophy
Clusters of multiple computers connected through internet are often deployed in industry and laboratories for large scale data processing or computation that cannot be handled by standalone computers. In such a cluster, resources such as CPU, memory, disks are integrated to work together. With the increase in popularity of applications that read and write a tremendous amount of data, we need a large number of disks that can interact effectively in such clusters. This forms the part of high performance computing (HPC) storage systems. Such HPC storage systems are used by a diverse set of applications coming from organizations from a vast range of domains from earth sciences, financial services, telecommunication to life sciences. Therefore, the HPC storage system should be efficient to perform well for the different read and write (I/O) requirements from all the different sets of applications. But current HPC storage systems do not cater to the varied I/O requirements. To this end, this dissertation designs and develops a framework for HPC storage systems that is application-attuned and thus provides much improved performance than other state-of-the-art HPC storage systems without such optimizations.

APA, Harvard, Vancouver, ISO, and other styles

8

Regin, Måns, and Gunnarsson Emil. "Refactoring Existing Database Layers for Improved Performance, Readability and Simplicity." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105277.

Full text

Abstract:

Since the late 90s, support and services at SAAB have produced and maintained a product called ELDIS. ELDIS is an application used by Swedish armed forces at air bases in Sweden and flight technicians at air bases. It displays electrical information, wire diagrams, and detailed information for cables, electrical equipment, and other electrical devices. The main problem for ELDIS is that when drawing wire diagrams in the application, it takes too long of a time when the stored procedures are retrieving information from the database. There are two significant areas in this project, analyzing and optimizing stored procedures and implementing a client-side solution. This project aims to guide SAAB to choose the right approach for solving the performance issue of the application and display some of the problems that can exist with slow stored procedures for companies in general. This project has optimized the most used stored procedure at SAAB and compared it to a client-side solution and the original application. The result of this project is that both the optimized stored procedure implementation and the client-side implementation is a faster option than the original implementation. It also highlights that when trying to optimize the stored procedures, indexing on the database should be considered for increasing the performance of a stored procedure.

APA, Harvard, Vancouver, ISO, and other styles

9

Chan, Wing Sze. "Semantic search of multimedia data objects through collaborative intelligence." HKBU Institutional Repository, 2010. http://repository.hkbu.edu.hk/etd_ra/1171.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Tandon, Ashish. "Analysis and optimization of data storage using enhanced object models in the .NET framework." Thesis, Edinburgh Napier University, 2007. http://researchrepository.napier.ac.uk/Output/4047.

Full text

Abstract:

The purpose of thesis is to benchmark the database to examine and analyze the performance using the Microsoft COM+ the most commonly used component framework heavily used for developing component based applications. The prototype application based on Microsoft Visual C#.NET language used to benchmark the database performance on Microsoft .NET Framework environment 2.0 and 3.0 using the different sizes of data range from low (100 Rows) to high volume (10000 Rows) of data with five or ten number of users connections. There are different type of application used like COM+, Non-COM+ and .NET based application to show their performance on the different volume of data with specified numbers of user on the .NET Framework 2.0 and 3.0. The result has been analyzed and collected using the performance counter variables of an operating system and used Microsoft .NET class libraries which help in collecting system's level performance information as well. This can be beneficial to developers, stakeholders and management to decide the right technology to be used in conjunction with a database. The results and experiments conducted in this project results in the substantial gain in the performance, scalability and availability of component based application using the Microsoft COM+ features like object pooling, application pooling, role- based, transactions isolation and constructor enabled. The outcome of this project is that Microsoft COM+ component based application provides optimized database performance results using the SQL Server. There is a performance gain of at least 10% in the COM+ based application as compared to the Non COM+ based application. COM+ services features come at the performance penalty. It has been noticed that there is a performance difference between the COM+ based application and the application based on role based security, constructor enable and transaction isolation of around 15%, 20% and 35% respectively. The COM+ based application provides performance gain of around 15% and 45% on the low and medium volume of data on a .NET Framework 2.0 in comparison to 3.0. There is a significant gain in the COM+ Server based application on .NET Framework 3.0 of around 10% using high volume of data. This depicts that high volume of data application works better with Framework 3.0 as compared to 2.0 on SQL Server. The application performance type results represents that COM+ component based application provides better performance results over Non-COM+ and .NET based application. The difference between the performance of COM+ application based on low and medium volume of data was around 20% and 30%. .NET based application performs better on the high volume of data results in performance gain of around 10%. Similarly more over the same results provided on the test conducted on the MS Access. Where COM+ based application running under .NET Framework 2.0 performs better result other than the Non-COM+ and .NET based application on a low and medium volume of data and .NET Framework 3.0 based COM+ application performs better results on high volume of data.

APA, Harvard, Vancouver, ISO, and other styles

11

Fritz, Eric Ryan. "Relational database models and other software and their importance in data analysis, storage, and communication." [Ames, Iowa : Iowa State University], 2009. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:1468081.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Caliguri, Ryan P. "Comparison of Sensible Water Cooling, Ice building, and Phase Change Material in Thermal Energy Storage Tank Charging: Analytical Models and Experimental Data." University of Cincinnati / OhioLINK, 2021. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1627666292483648.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Wu, Bruce Jiinpo. "The effects of data models and conceptual models of the structured query language on the task of query writing by end users." Thesis, University of North Texas, 1991. https://digital.library.unt.edu/ark:/67531/metadc332680/.

Full text

Abstract:

This research is an empirical investigation of human factors on the use of database systems. The problem motivating the study is the difficulty encountered by end-users in retrieving data from a database.

APA, Harvard, Vancouver, ISO, and other styles

14

Nobles, Royce Anthony. "Evaluation of spelling correction and concept-based searching models in a data entry application." View electronic thesis (PDF), 2009. http://dl.uncw.edu/etd/2009-2/noblesr/roycenobles.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Slabber, Frans Bresler. "Semi-automated extraction of structural orientation data from aerospace imagery combined with digital elevation models." Thesis, Rhodes University, 1996. http://hdl.handle.net/10962/d1005614.

Full text

Abstract:

A computer-based method for determining the orientation of planar geological structures from remotely sensed images, utilizing digital geological images and digital elevation models (DEMs), is developed and assessed. The method relies on operator skill and experience to recognize geological structure traces on images, and then employs software routines (GEOSTRUC©) to calculate the orientation of selected structures. The operator selects three points on the trace of a planar geological feature as seen on a digital geological image that is co registered with a DEM of the same area. The orientation of the plane that contains the three points is determined using vector algebra equations. The program generates an ASCII data file which contains the orientation data as well as the geographical location of the measurements. This ASCII file can then be utilized in further analysis of the orientation data. The software development kit (SDK) for TNTmips v5.00, from MicroImages Inc. and operating in the X Windows environment, was employed to construct the software. The Watcom C\C++ Development Environment was used to generate the executable program, GEOSTRUC© . GEOSTRUC© was tested in two case studies. The case studies utilized digital data derived from the use of different techniques and from different sources which varied in scale and resolution. This was done to illustrate the versatility of the program and its application to a wide range of data types. On the whole, the results obtained using the GEOSTRUC© analyses compare favourably to field data from each test area. Use of the method to determine the orientation of axial planes in the case study revealed the usefulness of the method as a powerful analytic tool for use on a macroscopic scale. The method should not he applied in area with low variation in relief as the method proved to be less accurate in these areas. Advancements in imaging technology will serve to create images with better resolution, which will, in turn, improve the overall accuracy of the method.

APA, Harvard, Vancouver, ISO, and other styles

16

Maples, Glenn (Glenn Edward). "Information System Quality: An Examination of Service-Based Models and Alternatives." Thesis, University of North Texas, 1997. https://digital.library.unt.edu/ark:/67531/metadc277952/.

Full text

Abstract:

Service quality as a component of overall Information Systems quality is examined. Three related studies test the SERVQUAL and related instruments (SERVPERF and Importance-weighted SERVPERF) using Information System users. SERVPERF outperformed SERVQUAL in all three studies.

APA, Harvard, Vancouver, ISO, and other styles

17

Munalula, Themba. "Measuring the applicability of Open Data Standards to a single distributed organisation: an application to the COMESA Secretariat." Thesis, University of Cape Town, 2008. http://pubs.cs.uct.ac.za/archive/00000461/.

Full text

Abstract:

Open data standardization has many known benefits, including the availability of tools for standard encoding formats, interoperability among systems and long term preservation of data. Mark-up languages and their use on the World Wide Web have implied further ease for data sharing. The Extensible Markup Language (XML), in particular, has succeeded due to its simplicity and ease of use. Its primary purpose is to facilitate the sharing of data across different information systems, particularly systems connected via the Internet. Whether open and standardized or not, organizations generate data daily. Offline exchange of documents and data is undertaken using existing formats that are typically defined by the organizations that generate the data in the documents. With the Internet, the realization of data exchange has had a direct implication on the need for interoperability and comparability. As much as standardization is the accepted approach for online data exchange, little is understood about how a specific organization’s data “fits” a given data standard. This dissertation develops data metrics that represent the extent to which data standards can be applied to an organization’s data. The research identified key issues that affect data interoperability or the feasibility of a move towards interoperability. This research tested the unwritten rule that organizational setups tend to regard and design data requirements more from internal needs than interoperability needs. Essentially, by generating metrics that affect a number of data attributes, the research quantified the extent of the gap that exists between organizational data and data standards. Key data attributes, i.e. completeness, concise representation, relevance and complexity, were selected and used as the basis for metric generation. Additional to the generation of attribute-based metrics, hybrid metrics representing a measure of the “goodness of fit” of the source data to standard data were generated. Regarding the completeness attribute, it was found that most Common Market for Eastern and Southern Africa (COMESA) head office data clusters had lower than desired metrics to match the gap highlighted above. The same applied to the concise representation attribute. Most data clusters had more concise representation for the COMESA data than the data standard. The complexity metrics generated confirmed the fact that the number of data elements is a key determinant in any move towards the adoption of data standards. This fact was also borne out by the magnitude of the hybrid metrics which to some extent depended on the complexity metrics. An additional contribution of the research was the inclusion of expert users’ weights to the data elements and recalculation of all metrics. A comparison with the unweighted metrics yielded a mixed picture. Among the completeness metrics and for the data retention rate in particular, increases were recorded for data clusters for which greater weight was allocated to mapped elements than to those that were not mapped. The same applied to the relative elements ratio. The complexity metrics showed general declines when user-weighted elements were used in the computation as opposed to the unweighted elements. This again was due to the fact that these metrics are dependent on the number of elements. Hence for the former case, the weights were evenly distributed while for the latter case, some elements were given lower weights by the expert users, hence leading to an overall decline in the metric. A number of implications emerged for COMESA. COMESA would have to determine the extent to which its source data rely on data sources for which international standards are being promoted. Secondly, an inventory of users and collectors of the data COMESA uses is necessary in order to determine who would be the beneficiary of a standards-based information system. Thirdly, and from an organizational perspective, COMESA needs to designate a team to guide the process of creation of such a standards-based information system. Lastly there is need for involvement in consortia that are responsible for these data standards. This has an implication on organizational resources. In totality, this research provided a methodology for determination of the feasibility of a move towards standardization and hence makes it possible to answer the critical first stage questions such a move begs answers to.

APA, Harvard, Vancouver, ISO, and other styles

18

Dawson, Linda Louise 1954. "An investigation of the use of object-oriented models in requirements engineering practice." Monash University, School of Information Management and Systems, 2001. http://arrow.monash.edu.au/hdl/1959.1/8031.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Xiong, Li. "Resilient Reputation and Trust Management: Models and Techniques." Diss., Georgia Institute of Technology, 2005. http://hdl.handle.net/1853/7483.

Full text

Abstract:

The continued advances in service-oriented computing and global communications have created a strong technology push for online information sharing and business transactions among enterprises, organizations and individuals. While these communities offer enormous opportunities, they also present potential threats due to a lack of trust. Reputation systems provide a way for building trust through social control by harnessing the community knowledge in the form of feedback. Although feedback-based reputation systems help community participants decide who to trust and encourage trustworthy behavior, they also introduce vulnerabilities due to potential manipulations by dishonest or malicious players. Therefore, building an effective and resilient reputation system remains a big challenge for the wide deployment of service-oriented computing. This dissertation proposes a decentralized reputation based trust supporting framework called PeerTrust, focusing on models and techniques for resilient reputation management against feedback aggregation related vulnerabilities, especially feedback sparsity with potential feedback manipulation, feedback oscillation, and loss of feedback privacy. This dissertation research has made three unique contributions for building a resilient decentralized reputation system. First, we develop a core reputation model with important trust parameters and a coherent trust metric for quantifying and comparing the trustworthiness of participants. We develop decentralized strategies for implementing the trust model in an efficient and secure manner. Second, we develop techniques countering potential vulnerabilities associated with feedback aggregation, including a similarity inference scheme to counter feedback sparsity with potential feedback manipulations, and a novel metric based on Proportional, Integral, and Derivative (PID) model to handle strategic oscillating behavior of participants. Third but not the least, we develop privacy-conscious trust management models and techniques to address the loss of feedback privacy. We develop a set of novel probabilistic decentralized privacy-preserving computation protocols for important primitive operations. We show how feedback aggregation can be divided into individual steps that utilize above primitive protocols through an example reputation algorithm based on kNN classification. We perform experimental evaluations for each of the schemes we proposed and show the feasibility, effectiveness, and cost of our approach. The PeerTrust framework presents an important step forward with respect to developing attack-resilient reputation trust systems.

APA, Harvard, Vancouver, ISO, and other styles

20

Mohammed, Jafaru. "Impact of Solar Resource and Atmospheric Constituents on Energy Yield Models for Concentrated Photovoltaic Systems." Thèse, Université d'Ottawa / University of Ottawa, 2013. http://hdl.handle.net/10393/24342.

Full text

Abstract:

Global economic trends suggest that there is a need to generate sustainable renewable energy to meet growing global energy demands. Solar energy harnessed by concentrated photovoltaic (CPV) systems has a potential for strong contributions to future energy supplies. However, as a relatively new technology, there is still a need for considerable research into the relationship between the technology and the solar resource. Research into CPV systems was carried out at the University of Ottawa’s Solar Cells and Nanostructured Device Laboratory (SUNLAB), focusing on the acquisition and assessment of meteorological and local solar resource datasets as inputs to more complex system (cell) models for energy yield assessment. An algorithm aimed at estimating the spectral profile of direct normal irradiance (DNI) was created. The algorithm was designed to use easily sourced low resolution meteorological datasets, temporal band pass filter measurement and an atmospheric radiative transfer model to determine a location specific solar spectrum. Its core design involved the use of an optical depth parameterization algorithm based on a published objective regression algorithm. Initial results showed a spectral agreement that corresponds to 0.56% photo-current difference in a modeled CPV cell when compared to measured spectrum. The common procedures and datasets used for long term CPV energy yield assessment was investigated. The aim was to quantitatively de-convolute various factors, especially meteorological factors responsible for error bias in CPV energy yield evaluation. Over the time period from June 2011 to August 2012, the analysis found that neglecting spectral variations resulted in a ~2% overestimation of energy yields. It was shown that clouds have the dominant impact on CPV energy yields, at the 60% level.

APA, Harvard, Vancouver, ISO, and other styles

21

Camacho, Rodriguez Jesus. "Efficient techniques for large-scale Web data management." Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112229/document.

Full text

Abstract:

Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation
The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies

APA, Harvard, Vancouver, ISO, and other styles

22

Černý, Petr. "Vyhledávání ve videu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2012. http://www.nusl.cz/ntk/nusl-236590.

Full text

Abstract:

This thesis summarizes the information retrieval theory, the relational model basic and focuses on the data indexing in relational database systems. The thesis focuses on multimedia data searching. It includes description of automatic multimedia data content extraction and multimedia data indexing. Practical part discusses design and solution implementation for improving query effectivity for multidimensional vector similarity which describes multimedia data. Thesis final part discusses experiments with this solution.

APA, Harvard, Vancouver, ISO, and other styles

23

Zampetakis, Stamatis. "Scalable algorithms for cloud-based Semantic Web data management." Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112199/document.

Full text

Abstract:

Afin de construire des systèmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les données avec sémantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sémantique, qui propose des technologies standards pour représenter et interroger les données avec sémantique. RDF est le modèle répandu destiné à décrire de façon formelle les ressources Web, et SPARQL est le langage de requête qui permet de rechercher, d’ajouter, de modifier ou de supprimer des données RDF. Être capable de stocker et de rechercher des données avec sémantique a engendré le développement des nombreux systèmes de gestion des données RDF.L’évolution rapide du Web sémantique a provoqué le passage de systèmes de gestion des données centralisées à ceux distribués. Les premiers systèmes étaient fondés sur les architectures pair-à-pair et client-serveur, alors que récemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et développement dans les systèmes distribués. Les fournisseurs de cloud offrent des infrastructures distribuées autonomes pouvant être utilisées pour le stockage et le traitement des données. Les principales caractéristiques du cloud computing impliquent l’évolutivité́, la tolérance aux pannes et l’allocation élastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thèse étudie la conception et la mise en œuvre d’algorithmes et de systèmes passant à l’échelle pour la gestion des données du Web sémantique sur des platformes cloud. Plus particulièrement, nous étudions la performance et le coût d’exploitation des services de cloud computing pour construire des entrepôts de données du Web sémantique, ainsi que l’optimisation de requêtes SPARQL pour les cadres massivement parallèles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système
In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system

APA, Harvard, Vancouver, ISO, and other styles

24

Amaral, Simone Silmara Werner Gurgel do. "Modelos lineares mistos para análise de dados longitudinais bivariados provenientes de ensaios agropecuários." Universidade de São Paulo, 2013. http://www.teses.usp.br/teses/disponiveis/11/11134/tde-22112013-105455/.

Full text

Abstract:

Em estudos longitudinais, repetidas observações de uma mesma variável resposta são coletadas na mesma unidade experimental, em diferentes ocasiões. Como diferentes observações são realizadas na mesma unidade, espera-se que estas sejam correlacionadas, e que exista uma heterogeneidade de variâncias nas diferentes ocasiões. Dados longitudinais multivariados são obtidos quando um conjunto de diferentes variáveis respostas são mensuradas na mesma unidade experimental repetidas vezes ao longo do tempo; nesse caso, além da correlação entre observações realizadas na mesma unidade experimental, deve-se considerar também a correlação entre diferentes variáveis respostas. Uma forma de analisar dados longitudinais bivariados é empregar um modelo misto para cada uma das variáveis respostas e uni-los em um modelo misto bivariado especificando a distribuição conjunta para os efeitos aleatórios. As estimativas dos parâmetros desta distribuição comum podem ser usadas para avaliar a relação entre as diferentes respostas. Para exemplificar a utilização da técnica, foram utilizados dados de armazenamento de leite UAT. Os modelos lineares mistos bivariados foram ajustados por meio do software SAS e a análise gráfica foi realizada por meio do software R. Para seleção dos modelos empregou-se os Critérios de Informação de Akaike (AIC) e Bayesiano (BIC), e o teste da razão de verossimilhanças para comparação de modelos encaixados. A utilização do modelo linear misto bivariado permitiu modelar a heterogeneidade de variâncias entre ocasiões e a correlação entre diferentes medidas na mesma unidade experimental, bem como a correlação entre as variáveis respostas.
In longitudinal studies, repeated measurements of a response variable are taken in the same experimental unit over time. . Since different observations are measured on the same experimental unit, it is expected that there is correlation among the repeated measurements and heterogeneity of variances in different occasions. Multivariate Longitudinal Data are obtained when we measure a number of different response variables in the same experimental unit repeatedly over time; in this case, we should also observe a correlation between the different response variables. One way to analyze bivariate longitudinal data is to use a mixed model for each of the response variables, and unite them in bivariate mixed models specifying the joint distribution for random effects. Parameter estimates of this common distribution may be used to evaluate the relationship between different responses. As an example of the use of the technique, UHT milk storage data were used. Models were fitted using SAS software and the graphical analysis was done with software R. To model selection, Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) were used, and maximum likelihood ratio test was used to compare nested models. The use of bivariate mixed linear model allowed to model the heteroscedasticity of the occasions, the correlation between the different measurements in the same experimental unit and also the correlation between the different response variables.

APA, Harvard, Vancouver, ISO, and other styles

25

Douieb, Karim. "Hotlinks and dictionaries." Doctoral thesis, Universite Libre de Bruxelles, 2008. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210471.

Full text

Abstract:

Knowledge has always been a decisive factor of humankind's social evolutions. Collecting the world's knowledge is one of the greatest challenges of our civilization. Knowledge involves the use of information but information is not knowledge. It is a way of acquiring and understanding information. Improving the visibility and the accessibility of information requires to organize it efficiently. This thesis focuses on this general purpose.

A fundamental objective of computer science is to store and retrieve information efficiently. This is known as the dictionary problem. A dictionary asks for a data structure which allows essentially the search operation. In general, information that is important and popular at a given time has to be accessed faster than less relevant information. This can be achieved by dynamically managing the data structure periodically such that relevant information is located closer from the search starting point. The second part of this thesis is devoted to the development and the understanding of self-adjusting dictionaries in various models of computation. In particular, we focus our attention on dictionaries which do not have any knowledge of the future accesses. Those dictionaries have to auto-adapt themselves to be competitive with dictionaries specifically tuned for a given access sequence.

This approach, which transforms the information structure, is not always feasible. Reasons can be that the structure is based on the semantic of the information such as categorization. In this context, the search procedure is linked to the structure itself and modifying the structure will affect how a search is performed. A solution developed to improve search in static structure is the hotlink assignment. It is a way to enhance a structure without altering its original design. This approach speeds up the search by creating shortcuts in the structure. The first part of this thesis is devoted to this approach.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

26

Ton, That Dai Hai. "Gestion efficace et partage sécurisé des traces de mobilité." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLV003/document.

Full text

Abstract:

Aujourd'hui, les progrès dans le développement d'appareils mobiles et des capteurs embarqués ont permis un essor sans précédent de services à l'utilisateur. Dans le même temps, la plupart des appareils mobiles génèrent, enregistrent et de communiquent une grande quantité de données personnelles de manière continue. La gestion sécurisée des données personnelles dans les appareils mobiles reste un défi aujourd’hui, que ce soit vis-à-vis des contraintes inhérentes à ces appareils, ou par rapport à l’accès et au partage sûrs et sécurisés de ces informations. Cette thèse adresse ces défis et se focalise sur les traces de localisation. En particulier, s’appuyant sur un serveur de données relationnel embarqué dans des appareils mobiles sécurisés, cette thèse offre une extension de ce serveur à la gestion des données spatio-temporelles (types et operateurs). Et surtout, elle propose une méthode d'indexation spatio-temporelle (TRIFL) efficace et adaptée au modèle de stockage en mémoire flash. Par ailleurs, afin de protéger les traces de localisation personnelles de l'utilisateur, une architecture distribuée et un protocole de collecte participative préservant les données de localisation ont été proposés dans PAMPAS. Cette architecture se base sur des dispositifs hautement sécurisés pour le calcul distribué des agrégats spatio-temporels sur les données privées collectées
Nowadays, the advances in the development of mobile devices, as well as embedded sensors have permitted an unprecedented number of services to the user. At the same time, most mobile devices generate, store and communicate a large amount of personal information continuously. While managing personal information on the mobile devices is still a big challenge, sharing and accessing these information in a safe and secure way is always an open and hot topic. Personal mobile devices may have various form factors such as mobile phones, smart devices, stick computers, secure tokens or etc. It could be used to record, sense, store data of user's context or environment surrounding him. The most common contextual information is user's location. Personal data generated and stored on these devices is valuable for many applications or services to user, but it is sensitive and needs to be protected in order to ensure the individual privacy. In particular, most mobile applications have access to accurate and real-time location information, raising serious privacy concerns for their users.In this dissertation, we dedicate the two parts to manage the location traces, i.e. the spatio-temporal data on mobile devices. In particular, we offer an extension of spatio-temporal data types and operators for embedded environments. These data types reconcile the features of spatio-temporal data with the embedded requirements by offering an optimal data presentation called Spatio-temporal object (STOB) dedicated for embedded devices. More importantly, in order to optimize the query processing, we also propose an efficient indexing technique for spatio-temporal data called TRIFL designed for flash storage. TRIFL stands for TRajectory Index for Flash memory. It exploits unique properties of trajectory insertion, and optimizes the data structure for the behavior of flash and the buffer cache. These ideas allow TRIFL to archive much better performance in both Flash and magnetic storage compared to its competitors.Additionally, we also investigate the protect user's sensitive information in the remaining part of this thesis by offering a privacy-aware protocol for participatory sensing applications called PAMPAS. PAMPAS relies on secure hardware solutions and proposes a user-centric privacy-aware protocol that fully protects personal data while taking advantage of distributed computing. For this to be done, we also propose a partitioning algorithm an aggregate algorithm in PAMPAS. This combination drastically reduces the overall costs making it possible to run the protocol in near real-time at a large scale of participants, without any personal information leakage

APA, Harvard, Vancouver, ISO, and other styles

27

Pacheco, Urubatan Rocha. "Análise de redes sociais em dados bibliográficos." [s.n.], 2010. http://repositorio.unicamp.br/jspui/handle/REPOSIP/275784.

Full text

Abstract:

Orientador: Ricardo de Oliveira Anido
Dissertação (mestrado) - Universidade Estadual de Campinas, Instituto de Computação
Made available in DSpace on 2018-08-17T02:21:18Z (GMT). No. of bitstreams: 1 Pacheco_UrubatanRocha_M.pdf: 1174940 bytes, checksum: d2b5f4af6749eb4a1c7c6a1810b9749a (MD5) Previous issue date: 2010
Resumo: O foco deste trabalho é viabilizar a análise estrutural em redes sociais de colaboração científica a partir de bases de dados bibliográficos. Os dados bibliográficos são utilizados para obter redes sociais de afiliação dos autores a instituições de pesquisa científica, e das publicações são extraídas as suas relações com ontologias de áreas de pesquisa. Foram estudados e aplicados métodos que utilizam a análise das redes sociais para solução/redução de ambiguidades em identidades de nomes de pesquisadores, instituições, e veículos científicos. Outro assunto estudado foi a abordagem de medida da qualidade dos resultados e os problemas que afetam a sua qualidade. Concretizando o objetivo deste trabalho, foram construídas métricas e ferramentas que permitem a comparação da produção científica entre instituições, departamentos, áreas de pesquisa, países, etc. As ferramentas também produziram um ranking de universidades baseado no prestígio dos pesquisadores destas universidades na rede social de co-autoria. Este resultado permitiu demonstrar que a informação estrutural de prestígio foi devidamente capturada ao correlacionar este ranking com outros que avaliam a qualidade da produção científica das universidades utilizando critérios semelhantes.
Abstract: This work performs social network analysis of the scientific collaborations extracted from bibliographic data bases. The analysis also includes the authors' scientific institution afiliation, and its relation with the main scientific publications and with research subject ontologies. We studied and applied methods that use social network analysis to solve or mitigate the problem of ambiguity in researchers' identities. We also applied the methods for ambiguity resolution for names of institutions, scientific meeting venues, country/state names, etc. Another study subject was measuring the quality of the results. Finally we developed metrics and implemented tools that allow the comparison of the scientific production of institutions, researcher groups, research subjects fields, countries, etc. The tools also produced a ranking of universities based on the prestige of these universities researchers at the co-authorship social network. These results demonstrated that prestige structural information was properly captured showing its correlation with other works that assess the quality of scientific production of universities using similar criteria.
Mestrado
Metodologia e Tecnicas da Computação
Mestre em Ciência da Computação

APA, Harvard, Vancouver, ISO, and other styles

28

De, Vega Rodrigo Miguel. "Modeling future all-optical networks without buffering capabilities." Doctoral thesis, Universite Libre de Bruxelles, 2008. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210455.

Full text

Abstract:

In this thesis we provide a model for a bufferless optical burst switching (OBS) and an optical packet switching (OPS) network. The thesis is divided in three parts.

In the first part we introduce the basic functionality and structure of OBS and OPS networks. We identify the blocking probability as the main performance parameter of interest.

In the second part we study the statistical properties of the traffic that will likely run through these networks. We use for this purpose a set of traffic traces obtained from the Universidad Politécnica de Catalunya. Our conclusion is that traffic entering the optical domain in future OBS/OPS networks will be long-range dependent (LRD).

In the third part we present the model for bufferless OBS/OPS networks. This model takes into account the results from the second part of the thesis concerning the LRD nature of traffic. It also takes into account specific issues concerning the functionality of a typical bufferless packet-switching network. The resulting model presents scalability problems, so we propose an approximative method to compute the blocking probability from it. We empirically evaluate the accuracy of this method, as well as its scalability.
Doctorat en Sciences de l'ingénieur
info:eu-repo/semantics/nonPublished

APA, Harvard, Vancouver, ISO, and other styles

29

Fan, Yang, Hidehiko Masuhara, Tomoyuki Aotani, Flemming Nielson, and Hanne Riis Nielson. "AspectKE*: Security aspects with program analysis for distributed systems." Universität Potsdam, 2010. http://opus.kobv.de/ubp/volltexte/2010/4136/.

Full text

Abstract:

Enforcing security policies to distributed systems is difficult, in particular, when a system contains untrusted components. We designed AspectKE*, a distributed AOP language based on a tuple space, to tackle this issue. In AspectKE*, aspects can enforce access control policies that depend on future behavior of running processes. One of the key language features is the predicates and functions that extract results of static program analysis, which are useful for defining security aspects that have to know about future behavior of a program. AspectKE* also provides a novel variable binding mechanism for pointcuts, so that pointcuts can uniformly specify join points based on both static and dynamic information about the program. Our implementation strategy performs fundamental static analysis at load-time, so as to retain runtime overheads minimal. We implemented a compiler for AspectKE*, and demonstrate usefulness of AspectKE* through a security aspect for a distributed chat system.

APA, Harvard, Vancouver, ISO, and other styles

30

Samoladas, Vasilis. "On indexing large databases for advanced data models." 2001. http://hdl.handle.net/2152/10823.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

"Redundancy on content-based indexing." 1997. http://library.cuhk.edu.hk/record=b5889125.

Full text

Abstract:

by Cheung King Lum Kingly.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.
Includes bibliographical references (leaves 108-110).
Abstract --- p.ii
Acknowledgement --- p.iii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Motivation --- p.1
Chapter 1.2 --- Problems in Content-Based Indexing --- p.2
Chapter 1.3 --- Contributions --- p.3
Chapter 1.4 --- Thesis Organization --- p.4
Chapter 2 --- Content-Based Indexing Structures --- p.5
Chapter 2.1 --- R-Tree --- p.6
Chapter 2.2 --- R+-Tree --- p.8
Chapter 2.3 --- R*-Tree --- p.11
Chapter 3 --- Searching in Both R-Tree and R*-Tree --- p.15
Chapter 3.1 --- Exact Search --- p.15
Chapter 3.2 --- Nearest Neighbor Search --- p.19
Chapter 3.2.1 --- Definition of Searching Metrics --- p.19
Chapter 3.2.2 --- Pruning Heuristics --- p.21
Chapter 3.2.3 --- Nearest Neighbor Search Algorithm --- p.24
Chapter 3.2.4 --- Generalization to N-Nearest Neighbor Search --- p.25
Chapter 4 --- An Improved Nearest Neighbor Search Algorithm for R-Tree --- p.29
Chapter 4.1 --- Introduction --- p.29
Chapter 4.2 --- New Pruning Heuristics --- p.31
Chapter 4.3 --- An Improved Nearest Neighbor Search Algorithm --- p.34
Chapter 4.4 --- Replacing Heuristics --- p.36
Chapter 4.5 --- N-Nearest Neighbor Search --- p.41
Chapter 4.6 --- Performance Evaluation --- p.45
Chapter 5 --- Overlapping Nodes in R-Tree and R*-Tree --- p.53
Chapter 5.1 --- Overlapping Nodes --- p.54
Chapter 5.2 --- Problem Induced By Overlapping Nodes --- p.57
Chapter 5.2.1 --- Backtracking --- p.57
Chapter 5.2.2 --- Inefficient Exact Search --- p.57
Chapter 5.2.3 --- Inefficient Nearest Neighbor Search --- p.60
Chapter 6 --- Redundancy On R-Tree --- p.64
Chapter 6.1 --- Motivation --- p.64
Chapter 6.2 --- Adding Redundancy on Index Tree --- p.65
Chapter 6.3 --- R-Tree with Redundancy --- p.66
Chapter 6.3.1 --- Previous Models of R-Tree with Redundancy --- p.66
Chapter 6.3.2 --- Redundant R-Tree --- p.70
Chapter 6.3.3 --- Level List --- p.71
Chapter 6.3.4 --- Inserting Redundancy to R-Tree --- p.72
Chapter 6.3.5 --- Properties of Redundant R-Tree --- p.77
Chapter 7 --- Searching in Redundant R-Tree --- p.82
Chapter 7.1 --- Exact Search --- p.82
Chapter 7.2 --- Nearest Neighbor Search --- p.86
Chapter 7.3 --- Avoidance of Multiple Accesses --- p.89
Chapter 8 --- Experiment --- p.90
Chapter 8.1 --- Experimental Setup --- p.90
Chapter 8.2 --- Exact Search --- p.91
Chapter 8.2.1 --- Clustered Data --- p.91
Chapter 8.2.2 --- Real Data --- p.93
Chapter 8.3 --- Nearest Neighbor Search --- p.95
Chapter 8.3.1 --- Clustered Data --- p.95
Chapter 8.3.2 --- Uniform Data --- p.98
Chapter 8.3.3 --- Real Data --- p.100
Chapter 8.4 --- Discussion --- p.102
Chapter 9 --- Conclusions and Future Research --- p.105
Chapter 9.1 --- Conclusions --- p.105
Chapter 9.2 --- Future Research --- p.106
Bibliography --- p.108

APA, Harvard, Vancouver, ISO, and other styles

32

Wang, Chun-Jen, and 王俊仁. "Chinese Speech Information Retrieval--Data-Driven and Predefined Indexing Features,Different Retrieval Models and Improved Approaches." Thesis, 2002. http://ndltd.ncl.edu.tw/handle/24116093457135065144.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Sadoghi, Hamedani Mohammad. "An Efficient, Extensible, Hardware-aware Indexing Kernel." Thesis, 2013. http://hdl.handle.net/1807/65515.

Full text

Abstract:

Modern hardware has the potential to play a central role in scalable data management systems. A realization of this potential arises in the context of indexing queries, a recurring theme in real-time data analytics, targeted advertising, algorithmic trading, and data-centric workflows, and of indexing data, a challenge in multi-version analytical query processing. To enhance query and data indexing, in this thesis, we present an efficient, extensible, and hardware-aware indexing kernel. This indexing kernel rests upon novel data structures and (parallel) algorithms that utilize the capabilities offered by modern hardware, especially abundance of main memory, multi-core architectures, hardware accelerators, and solid state drives. This thesis focuses on presenting our query indexing techniques to cope with processing queries in data-intensive applications that are susceptible to ever increasing data volume and velocity. At the core of our query indexing kernel lies the BE-Tree family of memory-resident indexing structures that scales by overcoming the curse of dimensionality through a novel two-phase space-cutting technique, an effective Top-k processing, and adaptive parallel algorithms to operate directly on compressed data (that exploits the multi-core architecture). Furthermore, we achieve line-rate processing by harnessing the unprecedented degrees of parallelism and pipelining only available through low-level logic design using FPGAs. Finally, we present a comprehensive evaluation that establishes the superiority of BE-Tree in comparison with state-of-the-art algorithms. In this thesis, we further expand the scope of our indexing kernel and describe how to accelerate analytical queries on (multi-version) databases by enabling indexes on the most recent data. Our goal is to reduce the overhead of index maintenance, so that indexes can be used effectively for analytical queries without being a heavy burden on transaction throughput. To achieve this end, we re-design the data structures in the storage hierarchy to employ an extra level of indirection over solid state drives. This indirection layer dramatically reduces the amount of magnetic disk I/Os that is needed for updating indexes and localizes the index maintenance. As a result, by rethinking how data is indexed, we eliminate the dilemma between update vs. query performance and reduce index maintenance and query processing cost substantially.

APA, Harvard, Vancouver, ISO, and other styles

34

Jiang, Hou-Sian, and 江侯弦. "The Study of Wireless Senesor Network Data Storage and Web Models for Presentation." Thesis, 2010. http://ndltd.ncl.edu.tw/handle/27803366827990914919.

Full text

Abstract:

碩士
國立臺灣海洋大學
系統工程暨造船學系
98
In this study, we aim to build a web-page model for WSN data presentation and reduced query time in massive WSN data to improve the relative operation efficiency. The system was combined by two parts: The data storing program for fast query mechanism assistance and the web-page model for representation.By using Visual Basic.NET as the script language, we developed both the data storing program and the web-page model. Moreover, the data storing program integrates database index and partition of database performance tuning technology. It will automatically analyze and coordinate the data from remote-end, store the data into the database, and interact with the database server regularly to maintain the data structure.In the web-page model, users are able to design a web-page with simple functions and check the location status of sensors in real time through intuitive graphic interface and scalable visual graphic data. Furthermore, the “Add sensors information” function can store sensor’s location information in any field when it’s covered by wireless network. The model can forbid the cognitive differences between in-field workers and web-page developers, which include paperwork mistakes; therefore, the system will realize the function of fast display interface building.

APA, Harvard, Vancouver, ISO, and other styles

35

"ACTION: automatic classification for Chinese documents." Chinese University of Hong Kong, 1994. http://library.cuhk.edu.hk/record=b5895378.

Full text

Abstract:

by Jacqueline, Wai-ting Wong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.
Includes bibliographical references (p. 107-109).
Abstract --- p.i
Acknowledgement --- p.iii
List of Tables --- p.viii
List of Figures --- p.ix
Chapter 1 --- Introduction --- p.1
Chapter 2 --- Chinese Information Processing --- p.6
Chapter 2.1 --- Chinese Word Segmentation --- p.7
Chapter 2.1.1 --- Statistical Method --- p.8
Chapter 2.1.2 --- Probabilistic Method --- p.9
Chapter 2.1.3 --- Linguistic Method --- p.10
Chapter 2.2 --- Automatic Indexing --- p.10
Chapter 2.2.1 --- Title Indexing --- p.11
Chapter 2.2.2 --- Free-Text Searching --- p.11
Chapter 2.2.3 --- Citation Indexing --- p.12
Chapter 2.3 --- Information Retrieval Systems --- p.13
Chapter 2.3.1 --- Users' Assessment of IRS --- p.13
Chapter 2.4 --- Concluding Remarks --- p.15
Chapter 3 --- Survey on Classification --- p.16
Chapter 3.1 --- Text Classification --- p.17
Chapter 3.2 --- Survey on Classification Schemes --- p.18
Chapter 3.2.1 --- Commonly Used Classification Systems --- p.18
Chapter 3.2.2 --- Classification of Newspapers --- p.31
Chapter 3.3 --- Concluding Remarks --- p.37
Chapter 4 --- System Models and the ACTION Algorithm --- p.38
Chapter 4.1 --- Factors Affecting Systems Performance --- p.38
Chapter 4.1.1 --- Specificity --- p.39
Chapter 4.1.2 --- Exhaustivity --- p.40
Chapter 4.2 --- Assumptions and Scope --- p.42
Chapter 4.2.1 --- Assumptions --- p.42
Chapter 4.2.2 --- System Scope ´ؤ Data Flow Diagrams --- p.44
Chapter 4.3 --- System Models --- p.48
Chapter 4.3.1 --- Article --- p.48
Chapter 4.3.2 --- Matching Table --- p.49
Chapter 4.3.3 --- Forest --- p.51
Chapter 4.3.4 --- Matching --- p.53
Chapter 4.4 --- Classification Rules --- p.54
Chapter 4.5 --- The ACTION Algorithm --- p.56
Chapter 4.5.1 --- Algorithm Design Objectives --- p.56
Chapter 4.5.2 --- Measuring Node Significance --- p.56
Chapter 4.5.3 --- Pseudocodes --- p.61
Chapter 4.6 --- Concluding Remarks --- p.64
Chapter 5 --- Analysis of Results and Validation --- p.66
Chapter 5.1 --- Seeking for Exhaustivity Rather Than Specificity --- p.67
Chapter 5.1.1 --- The News Article --- p.67
Chapter 5.1.2 --- The Matching Results --- p.68
Chapter 5.1.3 --- The Keyword Values --- p.68
Chapter 5.1.4 --- Analysis of Classification Results --- p.71
Chapter 5.2 --- Catering for Hierarchical Relationships Between Classes and Subclasses --- p.72
Chapter 5.2.1 --- The News Article --- p.72
Chapter 5.2.2 --- The Matching Results --- p.73
Chapter 5.2.3 --- The Keyword Values --- p.74
Chapter 5.2.4 --- Analysis of Classification Results --- p.75
Chapter 5.3 --- A Representative With Zero Occurrence --- p.78
Chapter 5.3.1 --- The News Article --- p.78
Chapter 5.3.2 --- The Matching Results --- p.79
Chapter 5.3.3 --- The Keyword Values --- p.80
Chapter 5.3.4 --- Analysis of Classification Results --- p.81
Chapter 5.4 --- Statistical Analysis --- p.83
Chapter 5.4.1 --- Classification Results with Highest Occurrence Frequency --- p.83
Chapter 5.4.2 --- Classification Results with Zero Occurrence Frequency --- p.85
Chapter 5.4.3 --- Distribution of Classification Results on Level Numbers --- p.86
Chapter 5.5 --- Concluding Remarks --- p.87
Chapter 5.5.1 --- Advantageous Characteristics of ACTION --- p.88
Chapter 6 --- Conclusion --- p.93
Chapter 6.1 --- Perspectives in Document Representation --- p.93
Chapter 6.2 --- Classification Schemes --- p.95
Chapter 6.3 --- Classification System Model --- p.95
Chapter 6.4 --- The ACTION Algorithm --- p.96
Chapter 6.5 --- Advantageous Characteristics of the ACTION Algorithm --- p.96
Chapter 6.6 --- Testing and Validating the ACTION algorithm --- p.98
Chapter 6.7 --- Future Work --- p.99
Chapter 6.8 --- A Final Remark --- p.100
Chapter A --- System Models --- p.102
Chapter B --- Classification Rules --- p.104
Chapter C --- Node Significance Definitions --- p.105
References --- p.107

APA, Harvard, Vancouver, ISO, and other styles

36

Du, Lan. "Non-parametric bayesian methods for structured topic models." Phd thesis, 2011. http://hdl.handle.net/1885/149800.

Full text

Abstract:

The proliferation of large electronic document archives requires new techniques for automatically analysing large collections, which has posed several new and interesting research challenges. Topic modelling, as a promising statistical technique, has gained significant momentum in recent years in information retrieval, sentiment analysis, images processing, etc. Besides existing topic models, the field of topic modelling still needs to be further explored using more powerful tools. One potentially useful area is to directly consider the document structure ranging from semantically high-level segments (e.g., chapters, sections, or paragraphs) to low-level segments (e.g., sentences or words) in topic modeling. This thesis introduces a family of structured topic models for statistically modeling text documents together with their intrinsic document structures. These models take advantage of non-parametric Bayesian techniques (e.g., the two-parameter Poisson-Dirichlet process (PDP)) and Markov chain Monte Carlo methods. Two preliminary contributions of this thesis are 1. The Compound Poisson-Dirichlet process (CPDP): it is an extension of the PDP that can be applied to multiple input distributions. 2. Two Gibbs sampling algorithms for the PDP in a finite state space: these two samplers are based on the Chinese restaurant process that provides an elegant analogy of incremental sampling for the PDP. The first, a two-stage Gibbs sampler, arises from a table multiplicity representation for the PDP. The second is built on top of a table indicator representation. In a simply controlled environment of multinomial sampling, the two new samplers have fast convergence speed. These support the major contribution of this thesis, which is a set of structured topic models: Segmented Topic Model (STM) which models a simple document structure with a four-level hierarchy by mapping the document layout to a hierarchical subject structure. It performs significantly better than the latent Dirichlet allocation model and other segmented models at predicting unseen words. Sequential Latent Dirichlet Allocation (SeqLDA) which is motivated by topical correlations among adjacent segments (i.e., the sequential document structure). This new model uses the PDP and a simple first-order Markov chain to link a set of LDAs together. It provides a novel approach for exploring the topic evolution within each individual document. Adaptive Topic Model (AdaTM) which embeds the CPDP in a simple directed acyclic graph to jointly model both hierarchical and sequential document structures. This new model demonstrates in terms of per-word predictive accuracy and topic distribution profile analysis that it is beneficial to consider both forms of structures in topic modelling. - provided by Candidate.

APA, Harvard, Vancouver, ISO, and other styles

37

"Unsupervised extraction and normalization of product attributes from web pages." 2010. http://library.cuhk.edu.hk/record=b5894490.

Full text

Abstract:

Xiong, Jiani.
"July 2010."
Thesis (M.Phil.)--Chinese University of Hong Kong, 2010.
Includes bibliographical references (p. 59-63).
Abstracts in English and Chinese.
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Background --- p.1
Chapter 1.2 --- Motivation --- p.4
Chapter 1.3 --- Our Approach --- p.8
Chapter 1.4 --- Potential Applications --- p.12
Chapter 1.5 --- Research Contributions --- p.13
Chapter 1.6 --- Thesis Organization --- p.15
Chapter 2 --- Literature Survey --- p.16
Chapter 2.1 --- Supervised Extraction Approaches --- p.16
Chapter 2.2 --- Unsupervised Extraction Approaches --- p.19
Chapter 2.3 --- Attribute Normalization --- p.21
Chapter 2.4 --- Integrated Approaches --- p.22
Chapter 3 --- Problem Definition and Preliminaries --- p.24
Chapter 3.1 --- Problem Definition --- p.24
Chapter 3.2 --- Preliminaries --- p.27
Chapter 3.2.1 --- Web Pre-processing --- p.27
Chapter 3.2.2 --- Overview of Our Framework --- p.31
Chapter 3.2.3 --- Background of Graphical Models --- p.32
Chapter 4 --- Our Proposed Framework --- p.36
Chapter 4.1 --- Our Proposed Graphical Model --- p.36
Chapter 4.2 --- Inference --- p.41
Chapter 4.3 --- Product Attribute Information Determination --- p.47
Chapter 5 --- Experiments and Results --- p.49
Chapter 6 --- Conclusion --- p.57
Bibliography --- p.59
Chapter A --- Dirichlet Process --- p.64
Chapter B --- Hidden Markov Models --- p.68

APA, Harvard, Vancouver, ISO, and other styles

38

"Parameter free document stream classification." Thesis, 2006. http://library.cuhk.edu.hk/record=b6074286.

Full text

Abstract:

Extensive experiments are conducted to evaluate the effectiveness PFreeBT and PNLH by using a stream of two-year news stories and three benchmarks. The results showed that the patterns of the bursty features and the bursty topics which are identified by PFreeBT match our expectations, whereas PNLH demonstrates significant improvements over all of the existing heuristics. These favorable results indicated that both PFreeBT and PNLH are highly effective and feasible.
For the problem of bursty topics identification, PFreeBT adopts an approach, in which we term it as feature-pivot clustering approach. Given a document stream, PFreeBT first identifies a set of bursty features from there. The identification process is based on computing the probability distributions. According to the patterns of the bursty features and two newly defined concepts (equivalent and map-to), a set of bursty topics can be extracted.
For the problem of constructing a reliable classifier, we formulate it as a partially supervised classification problem. In this classification problem, only a few training examples are labeled as positive (P). All other training examples (U) are remained unlabeled. Here, U is mixed with the negative examples (N) and some other positive examples (P'). Existing techniques that tackle this problem all focus on finding N from U. None of them attempts to extract P' from U. In fact, it is difficult to succeed as the topics in U are diverse and the features in there are sparse. In this dissertation, PNLH is proposed for extracting a high quality of P' and N from U.
In this dissertation, two heuristics, PFreeBT and PNLH, are proposed to tackle the aforementioned problems. PFreeBT aims at identifying the bursty topics in a document stream, whereas PNLH aims at constructing a reliable classifier for a given bursty topic. It is worth noting that both heuristics are parameter free. Users do not need to provide any parameter explicitly. All of the required variables can be computed base on the given document stream automatically.
In this information overwhelming century, information becomes ever more pervasive. A new class of data-intensive application arises where data is modeled best as an open-ended stream. We call such kind of data as data stream. Document stream is a variation of data stream, which consists of a sequence of chronological ordered documents. A fundamental problem of mining document streams is to extract meaningful structure from there, so as to help us to organize the contents systematically. In this dissertation, we focus on such a problem. Specifically, this dissertation studies two problems: to identify the bursty topics in a document stream and to construct a classifiers for the bursty topics. A bursty topic is one of the topics resides in the document stream, such that a large number of documents would be related to it during a bounded time interval.
Fung Pui Cheong Gabriel.
"August 2006."
Adviser: Jeffrey Xu Yu.
Source: Dissertation Abstracts International, Volume: 68-03, Section: B, page: 1720.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2006.
Includes bibliographical references (p. 122-130).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstracts in English and Chinese.
School code: 1307.

APA, Harvard, Vancouver, ISO, and other styles

39

"Data organization for routing on the multi-modal public transportation system: a GIS-T prototype of Hong Kong Island." 2001. http://library.cuhk.edu.hk/record=b5890808.

Full text

Abstract:

Yu Hongbo.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 130-138).
Abstracts in English and Chinese.
ABSTRACT IN ENGLISH --- p.i-ii
ABSTRACT IN CHINESE --- p.iii
ACKNOWLEDGEMENTS --- p.iv-v
TABLE OF CONTENTS --- p.vi-viii
LIST OF TABLES --- p.ix
LIST OF FIGURES --- p.x-xi
Chapter CHAPTER I --- INTRODUCTION
Chapter 1.1 --- Problem Statement --- p.1
Chapter 1.2 --- Research Purpose --- p.5
Chapter 1.3 --- Significance --- p.7
Chapter 1.4 --- Methodology --- p.8
Chapter 1.5 --- Outline of the Thesis --- p.9
Chapter CHAPTER II --- LITERATURE REVIEW
Chapter 2.1 --- Introduction --- p.12
Chapter 2.2 --- Origin of GIS --- p.12
Chapter 2.3 --- Development of GIS-T --- p.15
Chapter 2.4 --- Capabilities of GIS-T --- p.18
Chapter 2.5 --- Structure of a GIS-T --- p.19
Chapter 2.5.1 --- Data Models for GIS-T --- p.19
Chapter 2.5.2 --- Relational DBMS and Dueker-Butler's Data Model for Transportation --- p.22
Chapter 2.5.3 --- Objected-oriented Approach --- p.25
Chapter 2.6 --- Main Techniques of GIS-T --- p.26
Chapter 2.6.1 --- Linear Location Reference System --- p.26
Chapter 2.6.2 --- Dynamic Segmentation --- p.27
Chapter 2.6.3 --- Planar and Non-planar Networks --- p.28
Chapter 2.6.4 --- Turn-table --- p.28
Chapter 2.7 --- Algorithms for Finding Shortest Paths on a Network --- p.29
Chapter 2.7.1 --- Overview of Routing Algorithms --- p.29
Chapter 2.7.2 --- Dijkstra's Algorithm --- p.31
Chapter 2.7.3 --- Routing Models for the Multi-modal Network --- p.32
Chapter 2.8 --- Recent Researches on GIS Data Models for the Multi-modal Transportation System --- p.33
Chapter 2.9 --- Main Software Packages for GIS-T --- p.36
Chapter 2.10 --- Summary --- p.37
Chapter CHAPTER III --- MODELING THE MULTI-MODAL PUBLIC TRANSPORTATION SYSTEM
Chapter 3.1 --- Introduction --- p.40
Chapter 3.2 --- Elaborated Stages and Methods for GIS Modeling --- p.40
Chapter 3.3 --- Application Domain: The Multi-modal Public Transportation System --- p.43
Chapter 3.3.1 --- Definition of a Multi-modal Public Transportation System --- p.43
Chapter 3.3.2 --- Descriptions of the Multi-modal Public transportation System --- p.44
Chapter 3.3.3 --- Objective of the Modeling Work --- p.46
Chapter 3.4 --- A Layer-cake Based Application Domain Model for the Multi- modal Public Transportation System --- p.46
Chapter 3.5 --- A Conceptual Model for the Multi-modal Public Transportation System --- p.49
Chapter 3.6 --- Logical and Physical Implementation of the Data Model for the Multi-modal Public Transportation System --- p.54
Chapter 3.7 --- Criteria for Routing on the Multi-modal Public Transportation System --- p.57
Chapter 3.7.1 --- Least-time Routing --- p.58
Chapter 3.7.2 --- Least-fare Routing --- p.60
Chapter 3.7.3 --- Least-transfer Routing --- p.60
Chapter 3.8 --- Summary --- p.61
Chapter CHAPTER IV --- DATA PREPARATION FOR THE STUDY AREA
Chapter 4.1 --- Introduction --- p.53
Chapter 4.2 --- The Study Area: Hong Kong Island --- p.63
Chapter 4.2.1 --- General Information of the Transportation System on Hong Kong Island --- p.63
Chapter 4.2.2 --- Reasons for Choosing Hong Kong Island as the Study Area --- p.66
Chapter 4.2.3 --- Mass Transit Routes Selected for the Prototype --- p.67
Chapter 4.3 --- Data Source and Data Collection --- p.67
Chapter 4.4 --- Geographical Data Preparation --- p.71
Chapter 4.4.1 --- Data Conversion --- p.73
Chapter 4.4.2 --- Geographical Data Input --- p.79
Chapter 4.5 --- Attribute Data Input --- p.86
Chapter 4.6 --- Summary --- p.88
Chapter CHAPTER V --- IMPLEMENTATION OF THE PROTOTYPE
Chapter 5.1 --- Introduction --- p.89
Chapter 5.2 --- Construction of the Route Service Network --- p.89
Chapter 5.2.1 --- Generation of the Geographical Network --- p.90
Chapter 5.2.2 --- Setting Attribute Data for the Route Service Network --- p.95
Chapter 5.3 --- A GIS-T Prototype for the Study Area --- p.102
Chapter 5.4 --- General GIS Functions of the Prototype --- p.104
Chapter 5.4.1 --- Information Retrieve --- p.104
Chapter 5.4.2 --- Display --- p.105
Chapter 5.4.3 --- Data Query --- p.105
Chapter 5.5 --- Routing in the Prototype --- p.105
Chapter 5.5.1 --- Routing Procedure --- p.108
Chapter 5.5.2 --- Examples and Results --- p.110
Chapter 5.5.3 --- Comparison and Analysis --- p.113
Chapter 5.6 --- Summary --- p.118
Chapter CHAPTER VI --- CONCLUSION
Chapter 6.1 --- Research Findings --- p.123
Chapter 6.2 --- Research Limitations --- p.126
Chapter 6.3 --- Direction of Further Studies --- p.128
BIBLIOGRAPHY --- p.130

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data models, storage and indexing'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles