Thèses sur le sujet « Storage and indexing »

Pour voir les autres types de publications sur ce sujet consultez le lien suivant : Storage and indexing.

Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres

Choisissez une source :

Consultez les 39 meilleures thèses pour votre recherche sur le sujet « Storage and indexing ».

À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.

Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.

Parcourez les thèses sur diverses disciplines et organisez correctement votre bibliographie.

1

Munishwar, Vikram P. « Storage and indexing issues in sensor networks ». Diss., Online access via UMI:, 2006.

Trouver le texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
2

Schmidt, Karsten [Verfasser]. « Self-Tuning Storage and Indexing for Native XML DBMSs / Karsten Schmidt ». München : Verlag Dr. Hut, 2011. http://d-nb.info/1018981071/34.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
3

Mick, Alan A. « Knowledge based text indexing and retrieval utilizing case based reasoning / ». Online version of thesis, 1994. http://hdl.handle.net/1850/11715.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
4

Habtu, Simon. « Indexing file metadata using a distributed search engine for searching files on a public cloud storage ». Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-232064.

Texte intégral
Résumé :
Visma Labs AB or Visma wanted to conduct experiments to see if file metadata could be indexed for searching files on a public cloud storage. Given that storing files in a public cloud storage is cheaper than the current storage solution, the implementation could save Visma money otherwise spent on expensive storage costs. The thesis is therefore to find and evaluate an approach chosen for indexing file metadata and searching files on a public cloud storage with the chosen distributed search engine Elasticsearch. The architecture of the proposed solution is similar to a file service and was implemented using several containerized services for it to function. The results show that the file service solution is indeed feasible but would need further tuning and more resources to function according to the demands of Visma.
Visma Labs AB eller Visma ville genomföra experiment för att se om filmetadata skulle kunna indexeras för att söka efter filer på ett publikt moln. Med tanke på att lagring av filer på ett publikt moln är billigare än den nuvarande lagringslösningen, kan implementeringen spara Visma pengar som spenderas på dyra lagringskostnader. Denna studie är därför till för att hitta och utvärdera ett tillvägagångssätt valt för att indexera filmetadata och söka filer på ett offentligt molnlagring med den utvalda distribuerade sökmotorn Elasticsearch. Arkitekturen för den föreslagna lösningen har likenelser av en filtjänst och implementerades med flera containeriserade tjänster för att den ska fungera. Resultaten visar att filservicelösningen verkligen är möjlig men skulle behöva ytterligare modifikationer och fler resurser att fungera enligt Vismas krav.
Styles APA, Harvard, Vancouver, ISO, etc.
5

Teng, Shyh Wei 1973. « Image indexing and retrieval based on vector quantization ». Monash University, Gippsland School of Computing and Information Technology, 2003. http://arrow.monash.edu.au/hdl/1959.1/5764.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
6

Yapp, Lawrence. « Content-based indexing of MPEG video through the analysis of the accompanying audio / ». Thesis, Connect to this title online ; UW restricted, 1997. http://hdl.handle.net/1773/5835.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
7

Tekli, Joe, Richard Chbeir, Agma J. M. Traina, Caetano Traina, Kokou Yetongnon, Carlos Raymundo Ibanez, Assad Marc Al et Christian Kallas. « Full-fledged semantic indexing and querying model designed for seamless integration in legacy RDBMS ». Elsevier B.V, 2018. http://hdl.handle.net/10757/624626.

Texte intégral
Résumé :
El texto completo de este trabajo no está disponible en el Repositorio Académico UPC por restricciones de la casa editorial donde ha sido publicado.
In the past decade, there has been an increasing need for semantic-aware data search and indexing in textual (structured and NoSQL) databases, as full-text search systems became available to non-experts where users have no knowledge about the data being searched and often formulate query keywords which are different from those used by the authors in indexing relevant documents, thus producing noisy and sometimes irrelevant results. In this paper, we address the problem of semantic-aware querying and provide a general framework for modeling and processing semantic-based keyword queries in textual databases, i.e., considering the lexical and semantic similarities/disparities when matching user query and data index terms. To do so, we design and construct a semantic-aware inverted index structure called SemIndex, extending the standard inverted index by constructing a tightly coupled inverted index graph that combines two main resources: a semantic network and a standard inverted index on a collection of textual data. We then provide a general keyword query model with specially tailored query processing algorithms built on top of SemIndex, in order to produce semantic-aware results, allowing the user to choose the results' semantic coverage and expressiveness based on her needs. To investigate the practicality and effectiveness of SemIndex, we discuss its physical design within a standard commercial RDBMS allowing to create, store, and query its graph structure, thus enabling the system to easily scale up and handle large volumes of data. We have conducted a battery of experiments to test the performance of SemIndex, evaluating its construction time, storage size, query processing time, and result quality, in comparison with legacy inverted index. Results highlight both the effectiveness and scalability of our approach.
This study is partly funded by the National Council for Scientific Research - Lebanon (CNRS-L), by the Lebanese American University (LAU), and the Research Support Foundation of the State of Sao Paulo ( FAPESP ). Appendix SemIndex Weighting Scheme We propose a set of weighting functions to assign weight scores to SemIndex entries, including: index nodes , index edges, data nodes , and data edges . The weighting functions are used to select and rank semantically relevant results w.r.t. the user's query (cf. SemIndex query processing in Section 5). Other weight functions could be later added to cater to the index designer's needs.
Revisión por pares
Styles APA, Harvard, Vancouver, ISO, etc.
8

Liu, Hain-Ching. « Automatic scene detection in MPEG digital video for random access indexing and MPEG compression optimization / ». Thesis, Connect to this title online ; UW restricted, 1995. http://hdl.handle.net/1773/6001.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
9

Vasaitis, Vasileios. « Novel storage architectures and pointer-free search trees for database systems ». Thesis, University of Edinburgh, 2012. http://hdl.handle.net/1842/6240.

Texte intégral
Résumé :
Database systems research is an old and well-established field in computer science. Many of the key concepts appeared as early as the 60s, while the core of relational databases, which have dominated the database world for a while now, was solidified during the 80s. However, the underlying hardware has not displayed such stability in the same period, which means that a lot of assumptions that were made about the hardware by early database systems are not necessarily true for modern computer architectures. In particular, over the last few decades there have been two notable consistent trends in the evolution of computer hardware. The first is that the memory hierarchy of mainstream computer systems has been getting deeper, with its different levels moving away from each other, and new levels being added in between as a result, in particular cache memories. The second is that, when it comes to data transfers between any two adjacent levels of the memory hierarchy, access latencies have not been keeping up with transfer rates. The challenge is therefore to adapt database index structures so that they become immune to these two trends. The latter is addressed by gradually increasing the size of the data transfer unit; the former, by organizing the data so that it exhibits good locality for memory transfers across multiple memory boundaries. We have developed novel structures that facilitate both of these strategies. We started our investigation with the venerable B+-tree, which is the cornerstone order-preserving index of any database system, and we have developed a novel pointer-free tree structure for its pages that optimizes its cache performance and makes it immune to the page size. We then adapted our approach to the R-tree and the GiST, making it applicable to multi-dimensional data indexes as well as generalized indexes for any abstract data type. Finally, we have investigated our structure in the context of main memory alone, and have demonstrated its superiority over the established approaches in that setting too. While our research has its roots in data structures and algorithms theory, we have conducted it with a strong experimental focus, as the complex interactions within the memory hierarchy of a modern computer system can be quite challenging to model and theorize about effectively. Our findings are therefore backed by solid experimental results that verify our hypotheses and prove the superiority of our structures over competing approaches.
Styles APA, Harvard, Vancouver, ISO, etc.
10

Paul, Arnab Kumar. « An Application-Attuned Framework for Optimizing HPC Storage Systems ». Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99793.

Texte intégral
Résumé :
High performance computing (HPC) is routinely employed in diverse domains such as life sciences, and Geology, to simulate and understand the behavior of complex phenomena. Big data driven scientific simulations are resource intensive and require both computing and I/O capabilities at scale. There is a crucial need for revisiting the HPC I/O subsystem to better optimize for and manage the increased pressure on the underlying storage systems from big data processing. Extant HPC storage systems are designed and tuned for a specific set of applications targeting a range of workload characteristics, but they lack the flexibility in adapting to the ever-changing application behaviors. The complex nature of modern HPC storage systems along with the ever-changing application behaviors present unique opportunities and engineering challenges. In this dissertation, we design and develop a framework for optimizing HPC storage systems by making them application-attuned. We select three different kinds of HPC storage systems - in-memory data analytics frameworks, parallel file systems and object storage. We first analyze the HPC application I/O behavior by studying real-world I/O traces. Next we optimize parallelism for applications running in-memory, then we design data management techniques for HPC storage systems, and finally focus on low-level I/O load balance for improving the efficiency of modern HPC storage systems.
Doctor of Philosophy
Clusters of multiple computers connected through internet are often deployed in industry and laboratories for large scale data processing or computation that cannot be handled by standalone computers. In such a cluster, resources such as CPU, memory, disks are integrated to work together. With the increase in popularity of applications that read and write a tremendous amount of data, we need a large number of disks that can interact effectively in such clusters. This forms the part of high performance computing (HPC) storage systems. Such HPC storage systems are used by a diverse set of applications coming from organizations from a vast range of domains from earth sciences, financial services, telecommunication to life sciences. Therefore, the HPC storage system should be efficient to perform well for the different read and write (I/O) requirements from all the different sets of applications. But current HPC storage systems do not cater to the varied I/O requirements. To this end, this dissertation designs and develops a framework for HPC storage systems that is application-attuned and thus provides much improved performance than other state-of-the-art HPC storage systems without such optimizations.
Styles APA, Harvard, Vancouver, ISO, etc.
11

Oukid, Ismail. « Architectural Principles for Database Systems on Storage-Class Memory ». Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2018. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-232482.

Texte intégral
Résumé :
Database systems have long been optimized to hide the higher latency of storage media, yielding complex persistence mechanisms. With the advent of large DRAM capacities, it became possible to keep a full copy of the data in DRAM. Systems that leverage this possibility, such as main-memory databases, keep two copies of the data in two different formats: one in main memory and the other one in storage. The two copies are kept synchronized using snapshotting and logging. This main-memory-centric architecture yields nearly two orders of magnitude faster analytical processing than traditional, disk-centric ones. The rise of Big Data emphasized the importance of such systems with an ever-increasing need for more main memory. However, DRAM is hitting its scalability limits: It is intrinsically hard to further increase its density. Storage-Class Memory (SCM) is a group of novel memory technologies that promise to alleviate DRAM’s scalability limits. They combine the non-volatility, density, and economic characteristics of storage media with the byte-addressability and a latency close to that of DRAM. Therefore, SCM can serve as persistent main memory, thereby bridging the gap between main memory and storage. In this dissertation, we explore the impact of SCM as persistent main memory on database systems. Assuming a hybrid SCM-DRAM hardware architecture, we propose a novel software architecture for database systems that places primary data in SCM and directly operates on it, eliminating the need for explicit IO. This architecture yields many benefits: First, it obviates the need to reload data from storage to main memory during recovery, as data is discovered and accessed directly in SCM. Second, it allows replacing the traditional logging infrastructure by fine-grained, cheap micro-logging at data-structure level. Third, secondary data can be stored in DRAM and reconstructed during recovery. Fourth, system runtime information can be stored in SCM to improve recovery time. Finally, the system may retain and continue in-flight transactions in case of system failures. However, SCM is no panacea as it raises unprecedented programming challenges. Given its byte-addressability and low latency, processors can access, read, modify, and persist data in SCM using load/store instructions at a CPU cache line granularity. The path from CPU registers to SCM is long and mostly volatile, including store buffers and CPU caches, leaving the programmer with little control over when data is persisted. Therefore, there is a need to enforce the order and durability of SCM writes using persistence primitives, such as cache line flushing instructions. This in turn creates new failure scenarios, such as missing or misplaced persistence primitives. We devise several building blocks to overcome these challenges. First, we identify the programming challenges of SCM and present a sound programming model that solves them. Then, we tackle memory management, as the first required building block to build a database system, by designing a highly scalable SCM allocator, named PAllocator, that fulfills the versatile needs of database systems. Thereafter, we propose the FPTree, a highly scalable hybrid SCM-DRAM persistent B+-Tree that bridges the gap between the performance of transient and persistent B+-Trees. Using these building blocks, we realize our envisioned database architecture in SOFORT, a hybrid SCM-DRAM columnar transactional engine. We propose an SCM-optimized MVCC scheme that eliminates write-ahead logging from the critical path of transactions. Since SCM -resident data is near-instantly available upon recovery, the new recovery bottleneck is rebuilding DRAM-based data. To alleviate this bottleneck, we propose a novel recovery technique that achieves nearly instant responsiveness of the database by accepting queries right after recovering SCM -based data, while rebuilding DRAM -based data in the background. Additionally, SCM brings new failure scenarios that existing testing tools cannot detect. Hence, we propose an online testing framework that is able to automatically simulate power failures and detect missing or misplaced persistence primitives. Finally, our proposed building blocks can serve to build more complex systems, paving the way for future database systems on SCM.
Styles APA, Harvard, Vancouver, ISO, etc.
12

Regin, Måns, et Gunnarsson Emil. « Refactoring Existing Database Layers for Improved Performance, Readability and Simplicity ». Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-105277.

Texte intégral
Résumé :
Since the late 90s, support and services at SAAB have produced and maintained a product called ELDIS. ELDIS is an application used by Swedish armed forces at air bases in Sweden and flight technicians at air bases. It displays electrical information, wire diagrams, and detailed information for cables, electrical equipment, and other electrical devices. The main problem for ELDIS is that when drawing wire diagrams in the application, it takes too long of a time when the stored procedures are retrieving information from the database. There are two significant areas in this project, analyzing and optimizing stored procedures and implementing a client-side solution. This project aims to guide SAAB to choose the right approach for solving the performance issue of the application and display some of the problems that can exist with slow stored procedures for companies in general. This project has optimized the most used stored procedure at SAAB and compared it to a client-side solution and the original application. The result of this project is that both the optimized stored procedure implementation and the client-side implementation is a faster option than the original implementation. It also highlights that when trying to optimize the stored procedures, indexing on the database should be considered for increasing the performance of a stored procedure.
Styles APA, Harvard, Vancouver, ISO, etc.
13

Deschler, Kurt W. « MASS a multi-axis storage structure for large XML documents ». Link to electronic thesis, 2002. http://www.wpi.edu/Pubs/ETD/Available/etd-0506102-113510.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
14

Chan, Wing Sze. « Semantic search of multimedia data objects through collaborative intelligence ». HKBU Institutional Repository, 2010. http://repository.hkbu.edu.hk/etd_ra/1171.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
15

Langley, Joseph R. « SCRIBE a clustering approach to semantic information retrieval / ». Master's thesis, Mississippi State : Mississippi State University, 2006. http://sun.library.msstate.edu/ETD-db/ETD-browse/browse.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
16

Neal, Diane Rasmussen. « News photography image retrieval practices : Locus of control in two contexts ». Thesis, University of North Texas, 2006. https://digital.library.unt.edu/ark:/67531/metadc5591/.

Texte intégral
Résumé :
This is the first known study to explore the image retrieval preferences of news photographers and news photo editors in work contexts. Survey participants (n=102) provided opinions regarding 11 photograph searching methods. The quantitative survey data were analyzed using descriptive statistics, while content analysis was used to evaluate the qualitative survey data. In addition, news photographers and news photo editors (n=11) participated in interviews. Data from the interviews were analyzed with phenomenography. The survey data demonstrated that most participants prefer searching by events taking place in the photograph, objects that exist in the photograph, photographer-provided keywords, and relevant metadata, such as the date the picture was taken. They also prefer browsing. Respondents had mixed opinions about searching by emotions elicited in a photograph, as well as the environmental conditions represented in a photograph. Participants' lowest-rated methods included color and light, lines and shapes, and depth, shadow, or perspective. They also expressed little interest in technical information about a photograph, such as shutter speed and aperture. Interview participants' opinions about the search methods reflected the survey respondents' views. They discussed other aspects of news photography as well, including the stories told by the pictures, technical concerns about digital photography, and digital archiving and preservation issues. These stated preferences for keyword searching, browsing, and photographer-provided keywords illustrate a desire for a strong internal locus of control in digital photograph archives. Such methods allow users more control over access to their photographs, while the methods deemed less favorable by survey participants offer less control. Participants believe they can best find their photographs if they can control how they index and search for them. Therefore, it would be useful to design online photograph archives that allow users to control representation and access. Future research possibilities include determining the preferences of other image retrieval system users, performing user studies with moving image information retrieval systems, and uniting content-based and concept-based image retrieval research.
Styles APA, Harvard, Vancouver, ISO, etc.
17

Weldeghebriel, Zemichael Fesahatsion. « Evaluating and comparing search engines in retrieving text information from the web ». Thesis, Stellenbosch : Stellenbosch University, 2004. http://hdl.handle.net/10019.1/53740.

Texte intégral
Résumé :
Thesis (MPhil)--Stellenbosch University, 2004
ENGLISH ABSTRACT: With the introduction of the Internet and the World Wide Web (www), information can be easily accessed and retrieved from the web using information retrieval systems such as web search engines or simply search engines. There are a number of search engines that have been developed to provide access to the resources available on the web and to help users in retrieving relevant information from the web. In particular, they are essential for finding text information on the web for academic purposes. But, how effective and efficient are those search engines in retrieving the most relevant text information from the web? Which of the search engines are more effective and efficient? So, this study was conducted to see how effective and efficient search engines are and to see which search engines are most effective and efficient in retrieving the required text information from the web. It is very important to know the most effective and efficient search engines because such search engines can be used to retrieve a higher number of the most relevant text web pages with minimum time and effort. The study was based on nine major search engines, four search queries and relevancy judgments as relevant/partly-relevanUnon-relevant. Precision and recall were calculated based on the experimental or test results and these were used as basis for the statistical evaluation and comparisons of the retrieval effectiveness of the nine search engines. Duplicated items and broken links were also recorded and examined separately and were used as an additional measure of search engine effectiveness. A response time was also recorded and used as a base for the statistical evaluation and comparisons of the retrieval efficiency of the nine search engines. Additionally, since search engines involve indexing and searching in the information retrieval processes from the web, this study first discusses, from the theoretical point of view, how the indexing and searching processes are performed in an information retrieval environment. It also discusses the influences of indexing and searching processes on the effectiveness and efficiency of information retrieval systems in general and search engines in particular in retrieving the most relevant text information from the web.
AFRIKAANSE OPSOMMING: Met die koms van die Internet en die Wêreldwye Web (www) is inligting maklik bekombaar. Dit kan herwin word deur gebruik te maak van inligtingherwinningsisteme soos soekenjins. Daar is 'n hele aantal sulke soekenjins wat ontwikkel is om toegang te verleen tot die hulpbronne beskikbaar op die web en om gebruikers te help om relevante inligting vanaf die web in te win. Dit is veral noodsaaklik vir die verkryging van teksinligting vir akademiese doeleindes. Maar hoe effektief en doelmatig is die soekenjins in die herwinning van die mees relevante teksinligting vanaf die web? Watter van die soekenjins is die effektiefste? Hierdie studie is onderneem om te kyk watter soekenjins die effektiefste en doelmatigste is in die herwinning van die nodige teksinligting. Dit is belangrik om te weet watter soekenjin die effektiefste is want so 'n enjin kan gebruik word om 'n hoër getal van die mees relevante tekswebblaaie met die minimum van tyd en moeite te herwin. Heirdie studie is baseer op die sewe hoofsoekenjins, vier soektogte, en toepasliksheidsoordele soos relevant /gedeeltelik relevant/ en nie- relevant. Presiesheid en herwinningsvermoë is bereken baseer op die eksperimente en toetsresultate en dit is gebruik as basis vir statistiese evaluasie en vergelyking van die herwinningseffektiwiteit van die nege soekenjins. Gedupliseerde items en gebreekte skakels is ook aangeteken en apart ondersoek en is gebruik as bykomende maatstaf van effektiwiteit. Die reaksietyd is ook aangeteken en is gebruik as basis vir statistiese evaluasie en die vergelyking van die herwinningseffektiwiteit van die nege soekenjins. Aangesien soekenjins betrokke is by indeksering en soekprosesse, bespreek hierdie studie eers uit 'n teoretiese oogpunt, hoe indeksering en soekprosesse uitgevoer word in 'n inligtingherwinningsomgewing. Die invloed van indeksering en soekprosesse op die doeltreffendheid van herwinningsisteme in die algemeen en veral van soekenjins in die herwinning van die mees relevante teksinligting vanaf die web, word ook bespreek.
Styles APA, Harvard, Vancouver, ISO, etc.
18

Molinari, Alberto Heitor. « Indexação de acórdãos por meio de uma ontologia jurisprudencial populada a partir de um corpus jurídico real ». Universidade Tecnológica Federal do Paraná, 2011. http://repositorio.utfpr.edu.br/jspui/handle/1/360.

Texte intégral
Résumé :
Visando a uma melhoria na qualidade e precisão dos serviços de pesquisa jurisprudencial dos tribunais de justiça, a presente dissertação propõe a identificação automática de termos, ou sentenças, relevantes em um corpus de documentos jurisprudenciais. Cada termo identificado deve ser relacionado a um conceito que represente o seu significado no domínio da jurisprudência, compondo assim um indivíduo deste domínio. Ao final, cada indivíduo deve ser armazenado em uma base de conhecimentos, possibilitando, assim, pesquisas semânticas sobre os documentos. Para alcançar tais objetivos, são propostos métodos para extração de sentenças relevantes e para a construção de uma ontologia de aplicação para representação de acórdãos. Na seqüência são propostos ainda métodos de navegação e pesquisa na ontologia. Na extração de sentenças são abordadas técnicas de Mineração de Textos, tais como Extração de Sentenças, Análise de Expressões Regulares, Stemming, Stop-words e Vocabulários Controlados. A construção da ontologia segue a metodologia OTKM, utilizando-se também de linguagens de representação de conhecimento, tais com DL, RDF e OWL. Par a navegação na ontologia é abordada a framework Jena. Para pesquisas na ontologia é abordada a linguagem de consultas SPARQL. Para validar os métodos aqui propostos, foram construídos uma ontologia de aplicação para o domínio de acórdãos, bem como um aplicativo para gestão do conhecimento dos acórdãos baseado na ontologia. O aplicativo inclui rotinas de extração de conhecimento de um corpus de acórdãos, de população da ontologia com os conhecimentos extraídos e de pesquisa semântica sobre a ontologia populada. A ontologia bem como o conhecimento extraído de 50 acórdãos foram submetidos à crítica por especialistas em jurisprudência. Ao final, a rotina de pesquisa semântica foi submetida a uma experimentação com a ontologia populada por 15 mil acórdãos, todos extraídos da base de jurisprudência real do Tribunal de Justiça do Estado do Paraná. Os resultados obtidos nos experimentos demonstraram que a abordagem foi satisfatória tanto na indexação dos documentos como na pesquisa semântica, demonstrando que a ontologia desenvolvida responde aos requisitos da aplicação.
Aiming at improving the quality and accuracy of jurisprudential search services of the courts of justice, this dissertation proposes the automatic identification of relevant terms, or sentences, in a corpus of jurisprudential documents. Each identified term should be related to a concept that represents its meaning in the domain of jurisprudence, thus instancing an individual of that domain. At the end, each individual must be stored in knowledge base, thus enabling semantic queries over the documents. To achieve these objectives, methods were proposed for the extraction of relevant sentences and to build an application ontology for the representation of judgments. Furthermore, ontology navigation and search methods were proposed. For the extraction of sentences, several Text Mining techniques were used, such as Sentence Extraction, Analysis of Regular Expressions, Stemming, Stop-words and Controlled Vocabularies. The construction of the ontology followed the methodology OTKM and used some knowledge representation languages, such as DL, RDF and OWL. The Jena framework was applied for navigation in the ontology. SPARQL queries language was applied for search in the ontology. To validate the methods here proposed, an application ontology for the domain of judgments were built, as well as a knowledge management application to the judgments based on the ontology. The application includes routines for knowledge extraction from a corpus of judgments, population of the ontology with the extracted knowledge and then semantic search on the populated ontology. The ontology. The ontology and the knowledge extracted from 50 judgments were submitted to criticism by experts in jurisprudence. At the end, the semantic search routine was experimented with the ontology populated by 15,000 judgments, all extracted from the actual jurisprudence base of the Court of Paraná, a state of Brazil. The results obtained in the experiments demonstrated that the approach was satisfactory in both the indexing of documents as well as in the semantic search, showing that the developed ontology meets the application requirements.
Styles APA, Harvard, Vancouver, ISO, etc.
19

Mohamed, Aamer S. S. « From content-based to semantic image retrieval. Low level feature extraction, classification using image processing and neural networks, content based image retrieval, hybrid low level and high level based image retrieval in the compressed DCT domain ». Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4438.

Texte intégral
Résumé :
Digital image archiving urgently requires advanced techniques for more efficient storage and retrieval methods because of the increasing amount of digital. Although JPEG supply systems to compress image data efficiently, the problems of how to organize the image database structure for efficient indexing and retrieval, how to index and retrieve image data from DCT compressed domain and how to interpret image data semantically are major obstacles for further development of digital image database system. In content-based image, image analysis is the primary step to extract useful information from image databases. The difficulty in content-based image retrieval is how to summarize the low-level features into high-level or semantic descriptors to facilitate the retrieval procedure. Such a shift toward a semantic visual data learning or detection of semantic objects generates an urgent need to link the low level features with semantic understanding of the observed visual information. To solve such a -semantic gap¿ problem, an efficient way is to develop a number of classifiers to identify the presence of semantic image components that can be connected to semantic descriptors. Among various semantic objects, the human face is a very important example, which is usually also the most significant element in many images and photos. The presence of faces can usually be correlated to specific scenes with semantic inference according to a given ontology. Therefore, face detection can be an efficient tool to annotate images for semantic descriptors. In this thesis, a paradigm to process, analyze and interpret digital images is proposed. In order to speed up access to desired images, after accessing image data, image features are presented for analysis. This analysis gives not only a structure for content-based image retrieval but also the basic units ii for high-level semantic image interpretation. Finally, images are interpreted and classified into some semantic categories by semantic object detection categorization algorithm.
Styles APA, Harvard, Vancouver, ISO, etc.
20

Mohamed, Aamer Saleh Sahel. « From content-based to semantic image retrieval : low level feature extraction, classification using image processing and neural networks, content based image retrieval, hybrid low level and high level based image retrieval in the compressed DCT domain ». Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/4438.

Texte intégral
Résumé :
Digital image archiving urgently requires advanced techniques for more efficient storage and retrieval methods because of the increasing amount of digital. Although JPEG supply systems to compress image data efficiently, the problems of how to organize the image database structure for efficient indexing and retrieval, how to index and retrieve image data from DCT compressed domain and how to interpret image data semantically are major obstacles for further development of digital image database system. In content-based image, image analysis is the primary step to extract useful information from image databases. The difficulty in content-based image retrieval is how to summarize the low-level features into high-level or semantic descriptors to facilitate the retrieval procedure. Such a shift toward a semantic visual data learning or detection of semantic objects generates an urgent need to link the low level features with semantic understanding of the observed visual information. To solve such a 'semantic gap' problem, an efficient way is to develop a number of classifiers to identify the presence of semantic image components that can be connected to semantic descriptors. Among various semantic objects, the human face is a very important example, which is usually also the most significant element in many images and photos. The presence of faces can usually be correlated to specific scenes with semantic inference according to a given ontology. Therefore, face detection can be an efficient tool to annotate images for semantic descriptors. In this thesis, a paradigm to process, analyze and interpret digital images is proposed. In order to speed up access to desired images, after accessing image data, image features are presented for analysis. This analysis gives not only a structure for content-based image retrieval but also the basic units ii for high-level semantic image interpretation. Finally, images are interpreted and classified into some semantic categories by semantic object detection categorization algorithm.
Styles APA, Harvard, Vancouver, ISO, etc.
21

Ton, That Dai Hai. « Gestion efficace et partage sécurisé des traces de mobilité ». Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLV003/document.

Texte intégral
Résumé :
Aujourd'hui, les progrès dans le développement d'appareils mobiles et des capteurs embarqués ont permis un essor sans précédent de services à l'utilisateur. Dans le même temps, la plupart des appareils mobiles génèrent, enregistrent et de communiquent une grande quantité de données personnelles de manière continue. La gestion sécurisée des données personnelles dans les appareils mobiles reste un défi aujourd’hui, que ce soit vis-à-vis des contraintes inhérentes à ces appareils, ou par rapport à l’accès et au partage sûrs et sécurisés de ces informations. Cette thèse adresse ces défis et se focalise sur les traces de localisation. En particulier, s’appuyant sur un serveur de données relationnel embarqué dans des appareils mobiles sécurisés, cette thèse offre une extension de ce serveur à la gestion des données spatio-temporelles (types et operateurs). Et surtout, elle propose une méthode d'indexation spatio-temporelle (TRIFL) efficace et adaptée au modèle de stockage en mémoire flash. Par ailleurs, afin de protéger les traces de localisation personnelles de l'utilisateur, une architecture distribuée et un protocole de collecte participative préservant les données de localisation ont été proposés dans PAMPAS. Cette architecture se base sur des dispositifs hautement sécurisés pour le calcul distribué des agrégats spatio-temporels sur les données privées collectées
Nowadays, the advances in the development of mobile devices, as well as embedded sensors have permitted an unprecedented number of services to the user. At the same time, most mobile devices generate, store and communicate a large amount of personal information continuously. While managing personal information on the mobile devices is still a big challenge, sharing and accessing these information in a safe and secure way is always an open and hot topic. Personal mobile devices may have various form factors such as mobile phones, smart devices, stick computers, secure tokens or etc. It could be used to record, sense, store data of user's context or environment surrounding him. The most common contextual information is user's location. Personal data generated and stored on these devices is valuable for many applications or services to user, but it is sensitive and needs to be protected in order to ensure the individual privacy. In particular, most mobile applications have access to accurate and real-time location information, raising serious privacy concerns for their users.In this dissertation, we dedicate the two parts to manage the location traces, i.e. the spatio-temporal data on mobile devices. In particular, we offer an extension of spatio-temporal data types and operators for embedded environments. These data types reconcile the features of spatio-temporal data with the embedded requirements by offering an optimal data presentation called Spatio-temporal object (STOB) dedicated for embedded devices. More importantly, in order to optimize the query processing, we also propose an efficient indexing technique for spatio-temporal data called TRIFL designed for flash storage. TRIFL stands for TRajectory Index for Flash memory. It exploits unique properties of trajectory insertion, and optimizes the data structure for the behavior of flash and the buffer cache. These ideas allow TRIFL to archive much better performance in both Flash and magnetic storage compared to its competitors.Additionally, we also investigate the protect user's sensitive information in the remaining part of this thesis by offering a privacy-aware protocol for participatory sensing applications called PAMPAS. PAMPAS relies on secure hardware solutions and proposes a user-centric privacy-aware protocol that fully protects personal data while taking advantage of distributed computing. For this to be done, we also propose a partitioning algorithm an aggregate algorithm in PAMPAS. This combination drastically reduces the overall costs making it possible to run the protocol in near real-time at a large scale of participants, without any personal information leakage
Styles APA, Harvard, Vancouver, ISO, etc.
22

Douieb, Karim. « Hotlinks and dictionaries ». Doctoral thesis, Universite Libre de Bruxelles, 2008. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210471.

Texte intégral
Résumé :
Knowledge has always been a decisive factor of humankind's social evolutions. Collecting the world's knowledge is one of the greatest challenges of our civilization. Knowledge involves the use of information but information is not knowledge. It is a way of acquiring and understanding information. Improving the visibility and the accessibility of information requires to organize it efficiently. This thesis focuses on this general purpose.

A fundamental objective of computer science is to store and retrieve information efficiently. This is known as the dictionary problem. A dictionary asks for a data structure which allows essentially the search operation. In general, information that is important and popular at a given time has to be accessed faster than less relevant information. This can be achieved by dynamically managing the data structure periodically such that relevant information is located closer from the search starting point. The second part of this thesis is devoted to the development and the understanding of self-adjusting dictionaries in various models of computation. In particular, we focus our attention on dictionaries which do not have any knowledge of the future accesses. Those dictionaries have to auto-adapt themselves to be competitive with dictionaries specifically tuned for a given access sequence.

This approach, which transforms the information structure, is not always feasible. Reasons can be that the structure is based on the semantic of the information such as categorization. In this context, the search procedure is linked to the structure itself and modifying the structure will affect how a search is performed. A solution developed to improve search in static structure is the hotlink assignment. It is a way to enhance a structure without altering its original design. This approach speeds up the search by creating shortcuts in the structure. The first part of this thesis is devoted to this approach.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

Styles APA, Harvard, Vancouver, ISO, etc.
23

Camacho, Rodriguez Jesus. « Efficient techniques for large-scale Web data management ». Thesis, Paris 11, 2014. http://www.theses.fr/2014PA112229/document.

Texte intégral
Résumé :
Le développement récent des offres commerciales autour du cloud computing a fortement influé sur la recherche et le développement des plateformes de distribution numérique. Les fournisseurs du cloud offrent une infrastructure de distribution extensible qui peut être utilisée pour le stockage et le traitement des données.En parallèle avec le développement des plates-formes de cloud computing, les modèles de programmation qui parallélisent de manière transparente l'exécution des tâches gourmandes en données sur des machines standards ont suscité un intérêt considérable, à commencer par le modèle MapReduce très connu aujourd'hui puis par d'autres frameworks plus récents et complets. Puisque ces modèles sont de plus en plus utilisés pour exprimer les tâches de traitement de données analytiques, la nécessité se fait ressentir dans l'utilisation des langages de haut niveau qui facilitent la charge de l'écriture des requêtes complexes pour ces systèmes.Cette thèse porte sur des modèles et techniques d'optimisation pour le traitement efficace de grandes masses de données du Web sur des infrastructures à grande échelle. Plus particulièrement, nous étudions la performance et le coût d'exploitation des services de cloud computing pour construire des entrepôts de données Web ainsi que la parallélisation et l'optimisation des langages de requêtes conçus sur mesure selon les données déclaratives du Web.Tout d'abord, nous présentons AMADA, une architecture d'entreposage de données Web à grande échelle dans les plateformes commerciales de cloud computing. AMADA opère comme logiciel en tant que service, permettant aux utilisateurs de télécharger, stocker et interroger de grands volumes de données Web. Sachant que les utilisateurs du cloud prennent en charge les coûts monétaires directement liés à leur consommation de ressources, notre objectif n'est pas seulement la minimisation du temps d'exécution des requêtes, mais aussi la minimisation des coûts financiers associés aux traitements de données. Plus précisément, nous étudions l'applicabilité de plusieurs stratégies d'indexation de contenus et nous montrons qu'elles permettent non seulement de réduire le temps d'exécution des requêtes mais aussi, et surtout, de diminuer les coûts monétaires liés à l'exploitation de l'entrepôt basé sur le cloud.Ensuite, nous étudions la parallélisation efficace de l'exécution de requêtes complexes sur des documents XML mis en œuvre au sein de notre système PAXQuery. Nous fournissons de nouveaux algorithmes montrant comment traduire ces requêtes dans des plans exprimés par le modèle de programmation PACT (PArallelization ConTracts). Ces plans sont ensuite optimisés et exécutés en parallèle par le système Stratosphere. Nous démontrons l'efficacité et l'extensibilité de notre approche à travers des expérimentations sur des centaines de Go de données XML.Enfin, nous présentons une nouvelle approche pour l'identification et la réutilisation des sous-expressions communes qui surviennent dans les scripts Pig Latin. Notre algorithme, nommé PigReuse, agit sur les représentations algébriques des scripts Pig Latin, identifie les possibilités de fusion des sous-expressions, sélectionne les meilleurs à exécuter en fonction du coût et fusionne d'autres expressions équivalentes pour partager leurs résultats. Nous apportons plusieurs extensions à l'algorithme afin d’améliorer sa performance. Nos résultats expérimentaux démontrent l'efficacité et la rapidité de nos algorithmes basés sur la réutilisation et des stratégies d'optimisation
The recent development of commercial cloud computing environments has strongly impacted research and development in distributed software platforms. Cloud providers offer a distributed, shared-nothing infrastructure, that may be used for data storage and processing.In parallel with the development of cloud platforms, programming models that seamlessly parallelize the execution of data-intensive tasks over large clusters of commodity machines have received significant attention, starting with the MapReduce model very well known by now, and continuing through other novel and more expressive frameworks. As these models are increasingly used to express analytical-style data processing tasks, the need for higher-level languages that ease the burden of writing complex queries for these systems arises.This thesis investigates the efficient management of Web data on large-scale infrastructures. In particular, we study the performance and cost of exploiting cloud services to build Web data warehouses, and the parallelization and optimization of query languages that are tailored towards querying Web data declaratively.First, we present AMADA, an architecture for warehousing large-scale Web data in commercial cloud platforms. AMADA operates in a Software as a Service (SaaS) approach, allowing users to upload, store, and query large volumes of Web data. Since cloud users support monetary costs directly connected to their consumption of resources, our focus is not only on query performance from an execution time perspective, but also on the monetary costs associated to this processing. In particular, we study the applicability of several content indexing strategies, and show that they lead not only to reducing query evaluation time, but also, importantly, to reducing the monetary costs associated with the exploitation of the cloud-based warehouse.Second, we consider the efficient parallelization of the execution of complex queries over XML documents, implemented within our system PAXQuery. We provide novel algorithms showing how to translate such queries into plans expressed in the PArallelization ConTracts (PACT) programming model. These plans are then optimized and executed in parallel by the Stratosphere system. We demonstrate the efficiency and scalability of our approach through experiments on hundreds of GB of XML data.Finally, we present a novel approach for identifying and reusing common subexpressions occurring in Pig Latin scripts. In particular, we lay the foundation of our reuse-based algorithms by formalizing the semantics of the Pig Latin query language with extended nested relational algebra for bags. Our algorithm, named PigReuse, operates on the algebraic representations of Pig Latin scripts, identifies subexpression merging opportunities, selects the best ones to execute based on a cost function, and merges other equivalent expressions to share its result. We bring several extensions to the algorithm to improve its performance. Our experiment results demonstrate the efficiency and effectiveness of our reuse-based algorithms and optimization strategies
Styles APA, Harvard, Vancouver, ISO, etc.
24

Zampetakis, Stamatis. « Scalable algorithms for cloud-based Semantic Web data management ». Thesis, Paris 11, 2015. http://www.theses.fr/2015PA112199/document.

Texte intégral
Résumé :
Afin de construire des systèmes intelligents, où les machines sont capables de raisonner exactement comme les humains, les données avec sémantique sont une exigence majeure. Ce besoin a conduit à l’apparition du Web sémantique, qui propose des technologies standards pour représenter et interroger les données avec sémantique. RDF est le modèle répandu destiné à décrire de façon formelle les ressources Web, et SPARQL est le langage de requête qui permet de rechercher, d’ajouter, de modifier ou de supprimer des données RDF. Être capable de stocker et de rechercher des données avec sémantique a engendré le développement des nombreux systèmes de gestion des données RDF.L’évolution rapide du Web sémantique a provoqué le passage de systèmes de gestion des données centralisées à ceux distribués. Les premiers systèmes étaient fondés sur les architectures pair-à-pair et client-serveur, alors que récemment l’attention se porte sur le cloud computing.Les environnements de cloud computing ont fortement impacté la recherche et développement dans les systèmes distribués. Les fournisseurs de cloud offrent des infrastructures distribuées autonomes pouvant être utilisées pour le stockage et le traitement des données. Les principales caractéristiques du cloud computing impliquent l’évolutivité́, la tolérance aux pannes et l’allocation élastique des ressources informatiques et de stockage en fonction des besoins des utilisateurs.Cette thèse étudie la conception et la mise en œuvre d’algorithmes et de systèmes passant à l’échelle pour la gestion des données du Web sémantique sur des platformes cloud. Plus particulièrement, nous étudions la performance et le coût d’exploitation des services de cloud computing pour construire des entrepôts de données du Web sémantique, ainsi que l’optimisation de requêtes SPARQL pour les cadres massivement parallèles.Tout d’abord, nous introduisons les concepts de base concernant le Web sémantique et les principaux composants des systèmes fondés sur le cloud. En outre, nous présentons un aperçu des systèmes de gestion des données RDF (centralisés et distribués), en mettant l’accent sur les concepts critiques de stockage, d’indexation, d’optimisation des requêtes et d’infrastructure.Ensuite, nous présentons AMADA, une architecture de gestion de données RDF utilisant les infrastructures de cloud public. Nous adoptons le modèle de logiciel en tant que service (software as a service - SaaS), où la plateforme réside dans le cloud et des APIs appropriées sont mises à disposition des utilisateurs, afin qu’ils soient capables de stocker et de récupérer des données RDF. Nous explorons diverses stratégies de stockage et d’interrogation, et nous étudions leurs avantages et inconvénients au regard de la performance et du coût monétaire, qui est une nouvelle dimension importante à considérer dans les services de cloud public.Enfin, nous présentons CliqueSquare, un système distribué de gestion des données RDF basé sur Hadoop. CliqueSquare intègre un nouvel algorithme d’optimisation qui est capable de produire des plans massivement parallèles pour des requêtes SPARQL. Nous présentons une famille d’algorithmes d’optimisation, s’appuyant sur les équijointures n- aires pour générer des plans plats, et nous comparons leur capacité à trouver les plans les plus plats possibles. Inspirés par des techniques de partitionnement et d’indexation existantes, nous présentons une stratégie de stockage générique appropriée au stockage de données RDF dans HDFS (Hadoop Distributed File System). Nos résultats expérimentaux valident l’effectivité et l’efficacité de l’algorithme d’optimisation démontrant également la performance globale du système
In order to build smart systems, where machines are able to reason exactly like humans, data with semantics is a major requirement. This need led to the advent of the Semantic Web, proposing standard ways for representing and querying data with semantics. RDF is the prevalent data model used to describe web resources, and SPARQL is the query language that allows expressing queries over RDF data. Being able to store and query data with semantics triggered the development of many RDF data management systems. The rapid evolution of the Semantic Web provoked the shift from centralized data management systems to distributed ones. The first systems to appear relied on P2P and client-server architectures, while recently the focus moved to cloud computing.Cloud computing environments have strongly impacted research and development in distributed software platforms. Cloud providers offer distributed, shared-nothing infrastructures that may be used for data storage and processing. The main features of cloud computing involve scalability, fault-tolerance, and elastic allocation of computing and storage resources following the needs of the users.This thesis investigates the design and implementation of scalable algorithms and systems for cloud-based Semantic Web data management. In particular, we study the performance and cost of exploiting commercial cloud infrastructures to build Semantic Web data repositories, and the optimization of SPARQL queries for massively parallel frameworks.First, we introduce the basic concepts around Semantic Web and the main components and frameworks interacting in massively parallel cloud-based systems. In addition, we provide an extended overview of existing RDF data management systems in the centralized and distributed settings, emphasizing on the critical concepts of storage, indexing, query optimization, and infrastructure. Second, we present AMADA, an architecture for RDF data management using public cloud infrastructures. We follow the Software as a Service (SaaS) model, where the complete platform is running in the cloud and appropriate APIs are provided to the end-users for storing and retrieving RDF data. We explore various storage and querying strategies revealing pros and cons with respect to performance and also to monetary cost, which is a important new dimension to consider in public cloud services. Finally, we present CliqueSquare, a distributed RDF data management system built on top of Hadoop, incorporating a novel optimization algorithm that is able to produce massively parallel plans for SPARQL queries. We present a family of optimization algorithms, relying on n-ary (star) equality joins to build flat plans, and compare their ability to find the flattest possibles. Inspired by existing partitioning and indexing techniques we present a generic storage strategy suitable for storing RDF data in HDFS (Hadoop’s Distributed File System). Our experimental results validate the efficiency and effectiveness of the optimization algorithm demonstrating also the overall performance of the system
Styles APA, Harvard, Vancouver, ISO, etc.
25

« Redundancy on content-based indexing ». 1997. http://library.cuhk.edu.hk/record=b5889125.

Texte intégral
Résumé :
by Cheung King Lum Kingly.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1997.
Includes bibliographical references (leaves 108-110).
Abstract --- p.ii
Acknowledgement --- p.iii
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Motivation --- p.1
Chapter 1.2 --- Problems in Content-Based Indexing --- p.2
Chapter 1.3 --- Contributions --- p.3
Chapter 1.4 --- Thesis Organization --- p.4
Chapter 2 --- Content-Based Indexing Structures --- p.5
Chapter 2.1 --- R-Tree --- p.6
Chapter 2.2 --- R+-Tree --- p.8
Chapter 2.3 --- R*-Tree --- p.11
Chapter 3 --- Searching in Both R-Tree and R*-Tree --- p.15
Chapter 3.1 --- Exact Search --- p.15
Chapter 3.2 --- Nearest Neighbor Search --- p.19
Chapter 3.2.1 --- Definition of Searching Metrics --- p.19
Chapter 3.2.2 --- Pruning Heuristics --- p.21
Chapter 3.2.3 --- Nearest Neighbor Search Algorithm --- p.24
Chapter 3.2.4 --- Generalization to N-Nearest Neighbor Search --- p.25
Chapter 4 --- An Improved Nearest Neighbor Search Algorithm for R-Tree --- p.29
Chapter 4.1 --- Introduction --- p.29
Chapter 4.2 --- New Pruning Heuristics --- p.31
Chapter 4.3 --- An Improved Nearest Neighbor Search Algorithm --- p.34
Chapter 4.4 --- Replacing Heuristics --- p.36
Chapter 4.5 --- N-Nearest Neighbor Search --- p.41
Chapter 4.6 --- Performance Evaluation --- p.45
Chapter 5 --- Overlapping Nodes in R-Tree and R*-Tree --- p.53
Chapter 5.1 --- Overlapping Nodes --- p.54
Chapter 5.2 --- Problem Induced By Overlapping Nodes --- p.57
Chapter 5.2.1 --- Backtracking --- p.57
Chapter 5.2.2 --- Inefficient Exact Search --- p.57
Chapter 5.2.3 --- Inefficient Nearest Neighbor Search --- p.60
Chapter 6 --- Redundancy On R-Tree --- p.64
Chapter 6.1 --- Motivation --- p.64
Chapter 6.2 --- Adding Redundancy on Index Tree --- p.65
Chapter 6.3 --- R-Tree with Redundancy --- p.66
Chapter 6.3.1 --- Previous Models of R-Tree with Redundancy --- p.66
Chapter 6.3.2 --- Redundant R-Tree --- p.70
Chapter 6.3.3 --- Level List --- p.71
Chapter 6.3.4 --- Inserting Redundancy to R-Tree --- p.72
Chapter 6.3.5 --- Properties of Redundant R-Tree --- p.77
Chapter 7 --- Searching in Redundant R-Tree --- p.82
Chapter 7.1 --- Exact Search --- p.82
Chapter 7.2 --- Nearest Neighbor Search --- p.86
Chapter 7.3 --- Avoidance of Multiple Accesses --- p.89
Chapter 8 --- Experiment --- p.90
Chapter 8.1 --- Experimental Setup --- p.90
Chapter 8.2 --- Exact Search --- p.91
Chapter 8.2.1 --- Clustered Data --- p.91
Chapter 8.2.2 --- Real Data --- p.93
Chapter 8.3 --- Nearest Neighbor Search --- p.95
Chapter 8.3.1 --- Clustered Data --- p.95
Chapter 8.3.2 --- Uniform Data --- p.98
Chapter 8.3.3 --- Real Data --- p.100
Chapter 8.4 --- Discussion --- p.102
Chapter 9 --- Conclusions and Future Research --- p.105
Chapter 9.1 --- Conclusions --- p.105
Chapter 9.2 --- Future Research --- p.106
Bibliography --- p.108
Styles APA, Harvard, Vancouver, ISO, etc.
26

« Feature-based indexing in visual information systems ». 1997. http://library.cuhk.edu.hk/record=b6073015.

Texte intégral
Résumé :
by Donald Asogu Adjeroh.
Thesis (Ph.D.)--Chinese University of Hong Kong, 1997.
Includes bibliographical references (p. 202-216).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Mode of access: World Wide Web.
Styles APA, Harvard, Vancouver, ISO, etc.
27

Chou, Yun-Huan, et 周韻寰. « A Study on Storage Organizations : Cartesian Product File Optimization and Iconic Indexing ». Thesis, 1996. http://ndltd.ncl.edu.tw/handle/84386591036483765078.

Texte intégral
Résumé :
博士
國立交通大學
資訊科學學系
84
Users'' queries determine what data will be retrieved, therefore good storage organization must be based on queries. In general, distinguished into partial match queries and orthogonal range partial match query (PMQ) is a query in which fields arengle values. If a query in which fields are restricted to a range of than a single value, then this kind of queries is called an query (ORQ). Because index structures only built on some onea, it is not guaranteed that those index structures are helpful for which queries involve more than one attribute of data. Thus, nds of data structures proposed to support systems where these types common: partitioned hashing file and search tree structure.t file (CPF) is a typical hashed files by partitioned hashing, and commonly used for contemporary queries and can be regarded asf PMQs. Hence our study intends to focus on the design andmal CPFs for ORQs. We intend to explore this problem in various find the optimal two-attribute CPF for ORQs, extract properties N-attribute CPF for ORQs, and develop a very fast algorithm to optimal CPFs for ORQs. The relative relationships among objects be organized into a kind of index for images. This kind ofonic index, not only speeds up the image retrieval, but also supports way. The most famous iconic indices proposed include 2D-string Iconic indices are stored in main memory, thus how to design a iconic index is very important. In this dissertation, we also new kind of iconic index, which saves more space and can be used image data more efficiently.
Styles APA, Harvard, Vancouver, ISO, etc.
28

« Fuzzy clustering for content-based indexing in multimedia databases ». 2001. http://library.cuhk.edu.hk/record=b5890806.

Texte intégral
Résumé :
Yue Ho-Yin.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.
Includes bibliographical references (leaves 129-137).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Problem Definition --- p.7
Chapter 1.2 --- Contributions --- p.8
Chapter 1.3 --- Thesis Organization --- p.10
Chapter 2 --- Literature Review --- p.11
Chapter 2.1 --- "Content-based Retrieval, Background and Indexing Problem" --- p.11
Chapter 2.1.1 --- Feature Extraction --- p.12
Chapter 2.1.2 --- Nearest-neighbor Search --- p.13
Chapter 2.1.3 --- Content-based Indexing Methods --- p.15
Chapter 2.2 --- Indexing Problems --- p.25
Chapter 2.3 --- Data Clustering Methods for Indexing --- p.26
Chapter 2.3.1 --- Probabilistic Clustering --- p.27
Chapter 2.3.2 --- Possibilistic Clustering --- p.34
Chapter 3 --- Fuzzy Clustering Algorithms --- p.37
Chapter 3.1 --- Fuzzy Competitive Clustering --- p.38
Chapter 3.2 --- Sequential Fuzzy Competitive Clustering --- p.40
Chapter 3.3 --- Experiments --- p.43
Chapter 3.3.1 --- Experiment 1: Data set with different number of samples --- p.44
Chapter 3.3.2 --- Experiment 2: Data set on different dimensionality --- p.46
Chapter 3.3.3 --- Experiment 3: Data set with different number of natural clusters inside --- p.55
Chapter 3.3.4 --- Experiment 4: Data set with different noise level --- p.56
Chapter 3.3.5 --- Experiment 5: Clusters with different geometry size --- p.60
Chapter 3.3.6 --- Experiment 6: Clusters with different number of data instances --- p.67
Chapter 3.3.7 --- Experiment 7: Performance on real data set --- p.71
Chapter 3.4 --- Discussion --- p.72
Chapter 3.4.1 --- "Differences Between FCC, SFCC, and Others Clustering Algorithms" --- p.72
Chapter 3.4.2 --- Variations on SFCC --- p.75
Chapter 3.4.3 --- Why SFCC? --- p.75
Chapter 4 --- Hierarchical Indexing based on Natural Clusters Information --- p.77
Chapter 4.1 --- The Hierarchical Approach --- p.77
Chapter 4.2 --- The Sequential Fuzzy Competitive Clustering Binary Tree (SFCC- b-tree) --- p.79
Chapter 4.2.1 --- Data Structure of SFCC-b-tree --- p.80
Chapter 4.2.2 --- Tree Building of SFCC-b-Tree --- p.82
Chapter 4.2.3 --- Insertion of SFCC-b-tree --- p.83
Chapter 4.2.4 --- Deletion of SFCC-b-Tree --- p.84
Chapter 4.2.5 --- Searching in SFCC-b-Tree --- p.84
Chapter 4.3 --- Experiments --- p.88
Chapter 4.3.1 --- Experimental Setting --- p.88
Chapter 4.3.2 --- Experiment 8: Test for different leaf node sizes --- p.90
Chapter 4.3.3 --- Experiment 9: Test for different dimensionality --- p.97
Chapter 4.3.4 --- Experiment 10: Test for different sizes of data sets --- p.104
Chapter 4.3.5 --- Experiment 11: Test for different data distributions --- p.109
Chapter 4.4 --- Summary --- p.113
Chapter 5 --- A Case Study on SFCC-b-tree --- p.114
Chapter 5.1 --- Introduction --- p.114
Chapter 5.2 --- Data Collection --- p.115
Chapter 5.3 --- Data Pre-processing --- p.116
Chapter 5.4 --- Experimental Results --- p.119
Chapter 5.5 --- Summary --- p.121
Chapter 6 --- Conclusion --- p.122
Chapter 6.1 --- An Efficiency Formula --- p.122
Chapter 6.1.1 --- Motivation --- p.122
Chapter 6.1.2 --- Regression Model --- p.123
Chapter 6.1.3 --- Discussion --- p.124
Chapter 6.2 --- Future Directions --- p.127
Chapter 6.3 --- Conclusion --- p.128
Bibliography --- p.129
Styles APA, Harvard, Vancouver, ISO, etc.
29

Ward, Douglas Ross. « Indexing information for knowledge building in a student-generated database ». 1996. http://books.google.com/books?id=kH3kAAAAMAAJ.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
30

Kontostathis, April. « A term co-occurrence based framework for understanding LSI [i.e. latent semantic indexing] : theory and practice / ». Diss., 2003. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3117161.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
31

Sadoghi, Hamedani Mohammad. « An Efficient, Extensible, Hardware-aware Indexing Kernel ». Thesis, 2013. http://hdl.handle.net/1807/65515.

Texte intégral
Résumé :
Modern hardware has the potential to play a central role in scalable data management systems. A realization of this potential arises in the context of indexing queries, a recurring theme in real-time data analytics, targeted advertising, algorithmic trading, and data-centric workflows, and of indexing data, a challenge in multi-version analytical query processing. To enhance query and data indexing, in this thesis, we present an efficient, extensible, and hardware-aware indexing kernel. This indexing kernel rests upon novel data structures and (parallel) algorithms that utilize the capabilities offered by modern hardware, especially abundance of main memory, multi-core architectures, hardware accelerators, and solid state drives. This thesis focuses on presenting our query indexing techniques to cope with processing queries in data-intensive applications that are susceptible to ever increasing data volume and velocity. At the core of our query indexing kernel lies the BE-Tree family of memory-resident indexing structures that scales by overcoming the curse of dimensionality through a novel two-phase space-cutting technique, an effective Top-k processing, and adaptive parallel algorithms to operate directly on compressed data (that exploits the multi-core architecture). Furthermore, we achieve line-rate processing by harnessing the unprecedented degrees of parallelism and pipelining only available through low-level logic design using FPGAs. Finally, we present a comprehensive evaluation that establishes the superiority of BE-Tree in comparison with state-of-the-art algorithms. In this thesis, we further expand the scope of our indexing kernel and describe how to accelerate analytical queries on (multi-version) databases by enabling indexes on the most recent data. Our goal is to reduce the overhead of index maintenance, so that indexes can be used effectively for analytical queries without being a heavy burden on transaction throughput. To achieve this end, we re-design the data structures in the storage hierarchy to employ an extra level of indirection over solid state drives. This indirection layer dramatically reduces the amount of magnetic disk I/Os that is needed for updating indexes and localizes the index maintenance. As a result, by rethinking how data is indexed, we eliminate the dilemma between update vs. query performance and reduce index maintenance and query processing cost substantially.
Styles APA, Harvard, Vancouver, ISO, etc.
32

Oukid, Ismail. « Architectural Principles for Database Systems on Storage-Class Memory ». Doctoral thesis, 2017. https://tud.qucosa.de/id/qucosa%3A30750.

Texte intégral
Résumé :
Database systems have long been optimized to hide the higher latency of storage media, yielding complex persistence mechanisms. With the advent of large DRAM capacities, it became possible to keep a full copy of the data in DRAM. Systems that leverage this possibility, such as main-memory databases, keep two copies of the data in two different formats: one in main memory and the other one in storage. The two copies are kept synchronized using snapshotting and logging. This main-memory-centric architecture yields nearly two orders of magnitude faster analytical processing than traditional, disk-centric ones. The rise of Big Data emphasized the importance of such systems with an ever-increasing need for more main memory. However, DRAM is hitting its scalability limits: It is intrinsically hard to further increase its density. Storage-Class Memory (SCM) is a group of novel memory technologies that promise to alleviate DRAM’s scalability limits. They combine the non-volatility, density, and economic characteristics of storage media with the byte-addressability and a latency close to that of DRAM. Therefore, SCM can serve as persistent main memory, thereby bridging the gap between main memory and storage. In this dissertation, we explore the impact of SCM as persistent main memory on database systems. Assuming a hybrid SCM-DRAM hardware architecture, we propose a novel software architecture for database systems that places primary data in SCM and directly operates on it, eliminating the need for explicit IO. This architecture yields many benefits: First, it obviates the need to reload data from storage to main memory during recovery, as data is discovered and accessed directly in SCM. Second, it allows replacing the traditional logging infrastructure by fine-grained, cheap micro-logging at data-structure level. Third, secondary data can be stored in DRAM and reconstructed during recovery. Fourth, system runtime information can be stored in SCM to improve recovery time. Finally, the system may retain and continue in-flight transactions in case of system failures. However, SCM is no panacea as it raises unprecedented programming challenges. Given its byte-addressability and low latency, processors can access, read, modify, and persist data in SCM using load/store instructions at a CPU cache line granularity. The path from CPU registers to SCM is long and mostly volatile, including store buffers and CPU caches, leaving the programmer with little control over when data is persisted. Therefore, there is a need to enforce the order and durability of SCM writes using persistence primitives, such as cache line flushing instructions. This in turn creates new failure scenarios, such as missing or misplaced persistence primitives. We devise several building blocks to overcome these challenges. First, we identify the programming challenges of SCM and present a sound programming model that solves them. Then, we tackle memory management, as the first required building block to build a database system, by designing a highly scalable SCM allocator, named PAllocator, that fulfills the versatile needs of database systems. Thereafter, we propose the FPTree, a highly scalable hybrid SCM-DRAM persistent B+-Tree that bridges the gap between the performance of transient and persistent B+-Trees. Using these building blocks, we realize our envisioned database architecture in SOFORT, a hybrid SCM-DRAM columnar transactional engine. We propose an SCM-optimized MVCC scheme that eliminates write-ahead logging from the critical path of transactions. Since SCM -resident data is near-instantly available upon recovery, the new recovery bottleneck is rebuilding DRAM-based data. To alleviate this bottleneck, we propose a novel recovery technique that achieves nearly instant responsiveness of the database by accepting queries right after recovering SCM -based data, while rebuilding DRAM -based data in the background. Additionally, SCM brings new failure scenarios that existing testing tools cannot detect. Hence, we propose an online testing framework that is able to automatically simulate power failures and detect missing or misplaced persistence primitives. Finally, our proposed building blocks can serve to build more complex systems, paving the way for future database systems on SCM.
Styles APA, Harvard, Vancouver, ISO, etc.
33

Zhang, Ruofei. « Semantics-oriented modeling and retrieval in image databases ». 2005. http://wwwlib.umi.com/cr/binghamton/main.

Texte intégral
Styles APA, Harvard, Vancouver, ISO, etc.
34

« ACTION : automatic classification for Chinese documents ». Chinese University of Hong Kong, 1994. http://library.cuhk.edu.hk/record=b5895378.

Texte intégral
Résumé :
by Jacqueline, Wai-ting Wong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.
Includes bibliographical references (p. 107-109).
Abstract --- p.i
Acknowledgement --- p.iii
List of Tables --- p.viii
List of Figures --- p.ix
Chapter 1 --- Introduction --- p.1
Chapter 2 --- Chinese Information Processing --- p.6
Chapter 2.1 --- Chinese Word Segmentation --- p.7
Chapter 2.1.1 --- Statistical Method --- p.8
Chapter 2.1.2 --- Probabilistic Method --- p.9
Chapter 2.1.3 --- Linguistic Method --- p.10
Chapter 2.2 --- Automatic Indexing --- p.10
Chapter 2.2.1 --- Title Indexing --- p.11
Chapter 2.2.2 --- Free-Text Searching --- p.11
Chapter 2.2.3 --- Citation Indexing --- p.12
Chapter 2.3 --- Information Retrieval Systems --- p.13
Chapter 2.3.1 --- Users' Assessment of IRS --- p.13
Chapter 2.4 --- Concluding Remarks --- p.15
Chapter 3 --- Survey on Classification --- p.16
Chapter 3.1 --- Text Classification --- p.17
Chapter 3.2 --- Survey on Classification Schemes --- p.18
Chapter 3.2.1 --- Commonly Used Classification Systems --- p.18
Chapter 3.2.2 --- Classification of Newspapers --- p.31
Chapter 3.3 --- Concluding Remarks --- p.37
Chapter 4 --- System Models and the ACTION Algorithm --- p.38
Chapter 4.1 --- Factors Affecting Systems Performance --- p.38
Chapter 4.1.1 --- Specificity --- p.39
Chapter 4.1.2 --- Exhaustivity --- p.40
Chapter 4.2 --- Assumptions and Scope --- p.42
Chapter 4.2.1 --- Assumptions --- p.42
Chapter 4.2.2 --- System Scope ´ؤ Data Flow Diagrams --- p.44
Chapter 4.3 --- System Models --- p.48
Chapter 4.3.1 --- Article --- p.48
Chapter 4.3.2 --- Matching Table --- p.49
Chapter 4.3.3 --- Forest --- p.51
Chapter 4.3.4 --- Matching --- p.53
Chapter 4.4 --- Classification Rules --- p.54
Chapter 4.5 --- The ACTION Algorithm --- p.56
Chapter 4.5.1 --- Algorithm Design Objectives --- p.56
Chapter 4.5.2 --- Measuring Node Significance --- p.56
Chapter 4.5.3 --- Pseudocodes --- p.61
Chapter 4.6 --- Concluding Remarks --- p.64
Chapter 5 --- Analysis of Results and Validation --- p.66
Chapter 5.1 --- Seeking for Exhaustivity Rather Than Specificity --- p.67
Chapter 5.1.1 --- The News Article --- p.67
Chapter 5.1.2 --- The Matching Results --- p.68
Chapter 5.1.3 --- The Keyword Values --- p.68
Chapter 5.1.4 --- Analysis of Classification Results --- p.71
Chapter 5.2 --- Catering for Hierarchical Relationships Between Classes and Subclasses --- p.72
Chapter 5.2.1 --- The News Article --- p.72
Chapter 5.2.2 --- The Matching Results --- p.73
Chapter 5.2.3 --- The Keyword Values --- p.74
Chapter 5.2.4 --- Analysis of Classification Results --- p.75
Chapter 5.3 --- A Representative With Zero Occurrence --- p.78
Chapter 5.3.1 --- The News Article --- p.78
Chapter 5.3.2 --- The Matching Results --- p.79
Chapter 5.3.3 --- The Keyword Values --- p.80
Chapter 5.3.4 --- Analysis of Classification Results --- p.81
Chapter 5.4 --- Statistical Analysis --- p.83
Chapter 5.4.1 --- Classification Results with Highest Occurrence Frequency --- p.83
Chapter 5.4.2 --- Classification Results with Zero Occurrence Frequency --- p.85
Chapter 5.4.3 --- Distribution of Classification Results on Level Numbers --- p.86
Chapter 5.5 --- Concluding Remarks --- p.87
Chapter 5.5.1 --- Advantageous Characteristics of ACTION --- p.88
Chapter 6 --- Conclusion --- p.93
Chapter 6.1 --- Perspectives in Document Representation --- p.93
Chapter 6.2 --- Classification Schemes --- p.95
Chapter 6.3 --- Classification System Model --- p.95
Chapter 6.4 --- The ACTION Algorithm --- p.96
Chapter 6.5 --- Advantageous Characteristics of the ACTION Algorithm --- p.96
Chapter 6.6 --- Testing and Validating the ACTION algorithm --- p.98
Chapter 6.7 --- Future Work --- p.99
Chapter 6.8 --- A Final Remark --- p.100
Chapter A --- System Models --- p.102
Chapter B --- Classification Rules --- p.104
Chapter C --- Node Significance Definitions --- p.105
References --- p.107
Styles APA, Harvard, Vancouver, ISO, etc.
35

« Automatic index generation for the free-text based database ». Chinese University of Hong Kong, 1992. http://library.cuhk.edu.hk/record=b5887040.

Texte intégral
Résumé :
by Leung Chi Hong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 1992.
Includes bibliographical references (leaves 183-184).
Chapter Chapter one: --- Introduction --- p.1
Chapter Chapter two: --- Background knowledge and linguistic approaches of automatic indexing --- p.5
Chapter 2.1 --- Definition of index and indexing --- p.5
Chapter 2.2 --- Indexing methods and problems --- p.7
Chapter 2.3 --- Automatic indexing and human indexing --- p.8
Chapter 2.4 --- Different approaches of automatic indexing --- p.10
Chapter 2.5 --- Example of semantic approach --- p.11
Chapter 2.6 --- Example of syntactic approach --- p.14
Chapter 2.7 --- Comments on semantic and syntactic approaches --- p.18
Chapter Chapter three: --- Rationale and methodology of automatic index generation --- p.19
Chapter 3.1 --- Problems caused by natural language --- p.19
Chapter 3.2 --- Usage of word frequencies --- p.20
Chapter 3.3 --- Brief description of rationale --- p.24
Chapter 3.4 --- Automatic index generation --- p.27
Chapter 3.4.1 --- Training phase --- p.27
Chapter 3.4.1.1 --- Selection of training documents --- p.28
Chapter 3.4.1.2 --- Control and standardization of variants of words --- p.28
Chapter 3.4.1.3 --- Calculation of associations between words and indexes --- p.30
Chapter 3.4.1.4 --- Discarding false associations --- p.33
Chapter 3.4.2 --- Indexing phase --- p.38
Chapter 3.4.3 --- Example of automatic indexing --- p.41
Chapter 3.5 --- Related researches --- p.44
Chapter 3.6 --- Word diversity and its effect on automatic indexing --- p.46
Chapter 3.7 --- Factors affecting performance of automatic indexing --- p.60
Chapter 3.8 --- Application of semantic representation --- p.61
Chapter 3.8.1 --- Problem of natural language --- p.61
Chapter 3.8.2 --- Use of concept headings --- p.62
Chapter 3.8.3 --- Example of using concept headings in automatic indexing --- p.65
Chapter 3.8.4 --- Advantages of concept headings --- p.68
Chapter 3.8.5 --- Disadvantages of concept headings --- p.69
Chapter 3.9 --- Correctness prediction for proposed indexes --- p.78
Chapter 3.9.1 --- Example of using index proposing rate --- p.80
Chapter 3.10 --- Effect of subject matter on automatic indexing --- p.83
Chapter 3.11 --- Comparison with other indexing methods --- p.85
Chapter 3.12 --- Proposal for applying Chinese medical knowledge --- p.90
Chapter Chapter four: --- Simulations of automatic index generation --- p.93
Chapter 4.1 --- Training phase simulations --- p.93
Chapter 4.1.1 --- Simulation of association calculation (word diversity uncontrolled) --- p.94
Chapter 4.1.2 --- Simulation of association calculation (word diversity controlled) --- p.102
Chapter 4.1.3 --- Simulation of discarding false associations --- p.107
Chapter 4.2 --- Indexing phase simulation --- p.115
Chapter 4.3 --- Simulation of using concept headings --- p.120
Chapter 4.4 --- Simulation for testing performance of predicting index correctness --- p.125
Chapter 4.5 --- Summary --- p.128
Chapter Chapter five: --- Real case study in database of Chinese Medicinal Material Research Center --- p.130
Chapter 5.1 --- Selection of real documents --- p.130
Chapter 5.2 --- Case study one: Overall performance using real data --- p.132
Chapter 5.2.1 --- Sample results of automatic indexing for real documents --- p.138
Chapter 5.3 --- Case study two: Using multi-word terms --- p.148
Chapter 5.4 --- Case study three: Using concept headings --- p.152
Chapter 5.5 --- Case study four: Prediction of proposed index correctness --- p.156
Chapter 5.6 --- Case study five: Use of (Σ ΔRij) Fi to determine false association --- p.159
Chapter 5.7 --- Case study six: Effect of word diversity --- p.162
Chapter 5.8 --- Summary --- p.166
Chapter Chapter six: --- Conclusion --- p.168
Appendix A: List of stopwords --- p.173
Appendix B: Index terms used in case studies --- p.174
References --- p.183
Styles APA, Harvard, Vancouver, ISO, etc.
36

Salvador, Tiago Emanuel Almeida. « Content Management Mobile Application using a Metadata Cloud Server ». Master's thesis, 2015. http://hdl.handle.net/10316/97333.

Texte intégral
Résumé :
Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia da Universidade de Coimbra.
The dissertation proposed subject is an implementation of a Metadata Cloud Server, which can communicate with clients and process Metadata from given contents and show this information in state of the art and innovative ways. The contents can be stored in different kinds of services, with a focus on Dropbox for this project. These types of content are crucial in people’s lives in terms of culture and entertainment and how they are managed is important. By extracting Metadata that already exists in the contents or processing the Metadata in more complex ways like Smart Albums for Photos or Document Indexing, it is possible to give and present the content in different, more useful ways then simple lists or views. By building a complex Server application that communicates with the Metadata database and is accessible by REST API, it will be possible for any kind of client application to query the server and get the location of the content. Cloud Storage services don’t usually support smart features or fast content browsing apart from regular Operating System folder like structure. By building a system as a complement to the Cloud Storage system it is possible for the user to access their data in more state of the art ways.
Styles APA, Harvard, Vancouver, ISO, etc.
37

Bhavsar, Rajul D. « Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes ». Thesis, 2011. https://etd.iisc.ac.in/handle/2005/2124.

Texte intégral
Résumé :
Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties -- suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings. In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure. To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings. We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our study and experimental results indicate that through careful choice of node implementations and layouts, the disk access locality of suffix-trees can be improved to the extent that upto an order-of-magnitude improvements in search-times may result relative to the classical implementations.
Styles APA, Harvard, Vancouver, ISO, etc.
38

Bhavsar, Rajul D. « Search-Optimized Disk Layouts For Suffix-Tree Genomic Indexes ». Thesis, 2011. http://etd.iisc.ernet.in/handle/2005/2124.

Texte intégral
Résumé :
Over the last decade, biological sequence repositories have been growing at an exponential rate. Sophisticated indexing techniques are required to facilitate efficient searching through these humongous genetic repositories. A particularly attractive index structure for such sequence processing is the classical suffix-tree, a vertically compressed trie structure built over the set of all suffixes of a sequence. Its attractiveness stems from its linearity properties -- suffix-tree construction times are linear in the size of the indexed sequences, while search times are linear in the size of the query strings. In practice, however, the promise of suffix-trees is not realized for extremely long sequences, such as the human genome, that run into the billions of characters. This is because suffix-trees, which are typically an order of magnitude larger than the indexed sequence, necessarily have to be disk-resident for such elongated sequences, and their traditional construction and traversal algorithms result in random disk accesses. We investigate, in this thesis, post-construction techniques for disk-based suffix-tree storage optimization, with the objective of maximizing disk-reference locality during query processing. We begin by focusing on the layout reorganization in which the node-to-block assignments and sequence of blocks are reworked. Our proposed algorithm is based on combining the breadth-first layout approach advocated in the recent literature with probabilistic techniques for minimizing the physical distance between successive block accesses, based on an analysis of node traversal patterns. In our next step, we consider techniques for reducing the space overheads incurred by suffix-trees. In particular, we propose an embedding strategy whereby leaf nodes can be completely represented within their parent internal nodes, without requiring any space extension of the parent node's structure. To quantitatively evaluate the benefits of our reorganized and restructured layouts, we have conducted extensive experiments on complete human genome sequences, with complex and computationally expensive user queries that involve finding the maximal common substring matches of the query strings. We show, for the first time, that the layout reorganization approach can be scaled to entire genomes, including the human genome. In the layout reorganization, with careful choice of node-to-block assignment condition and optimized sequence of blocks, search-time improvements ranging from 25% to 75% can be achieved with respect to the construction layouts on such genomes. While the layout reorganization does take considerable time, it is a one-time process whereas searches will be repeatedly invoked on this index. The internalization of leaf nodes results in a 25% reduction in the suffix-tree space occupancy. More importantly, when applied to the construction layout, it provides search-time improvements ranging from 25% to 85%, and in conjunction with the reorganized layout, searches are speeded up by 50% to 90%. Overall, our study and experimental results indicate that through careful choice of node implementations and layouts, the disk access locality of suffix-trees can be improved to the extent that upto an order-of-magnitude improvements in search-times may result relative to the classical implementations.
Styles APA, Harvard, Vancouver, ISO, etc.
39

Ricardo, André Parreira. « Building a scalable index and a web search engine for music on the Internet using Open Source software ». Master's thesis, 2010. http://hdl.handle.net/10071/2871.

Texte intégral
Résumé :
The Internet has made possible the access to thousands of freely available music tracks with Creative Commons or Public Domain licenses. Actually, this number keeps growing every year. In practical terms, it is very difficult to browse this music collection, because it is wide and disperse in hundreds of websites. To address the music recommendation issue, a case study on existing systems was made, to put the problem in context in order to identify necessary building blocks. This thesis is mainly focused on the problem of indexing this large collection of music. The reason to focus on this problem, is that there is no database or index holding information about this music material, thus making this research on the subject extremely difficult. In order to figure out what software could help solve this problem, the state of the art in “Open Source tools for web crawling and indexing” was assessed. Based on the conclusions from the state of the art, a prototype was developed and implemented using the most appropriate software framework. The created solution proved it was capable of crawling the web pages, while parsing and indexing MP3 files. The produced index is available through a web search engine interface also producing results in XML format. The results obtained lead to the conclusion that it is attainable to build a scalable index and web search engine for music in the Internet using Open Source software. This is supported by the proof of concept achieved with the working prototype.
A Internet tornou possível o acesso a milhares de faixas musicais disponíveis gratuitamente segundo uma licença Creative Commons ou de Domínio Público. Na realidade, este número continua a aumentar em cada ano. Em termos práticos, é muito difícil navegar nesta colecção de música, pois a mesma é vasta e encontra-se dispersa em milhares de sites na Web. Para abordar o assunto da recomendação de música, um caso de estudo sobre sistemas de recomendação de música existentes foi elaborado, para contextualizar o problema e identificar os grandes blocos que os constituem. Esta tese foca-se na problemática da indexação de uma grande colecção de música, pela razão de que, não existe uma base de dados ou índice que contenha informação sobre este repositório musical, tornando muito difícil o estudo nesta matéria. De forma a compreender que software poderia ajudar a resolver o problema, foi avaliado o estado da arte em ferramentas de rastreio de conteúdos web e indexação de código aberto. Com base nas conclusões do estado da arte, o protótipo foi desenvolvido e implementado, utilizando o software mais apropriado para a tarefa. A solução criada provou que era possível percorrer as páginas Web, enquanto se analisavam e indexavam MP3. O índice produzido encontra-se disponível através de um motor de busca online e também com resultados no formato XML. Os resultados obtidos levam a concluir que é possível, construir um índice escalável e motor de busca na web para música na Internet utilizando software Open Source. Estes resultados são fundamentados pela prova de conceito obtida com o protótipo funcional.
Styles APA, Harvard, Vancouver, ISO, etc.
Nous offrons des réductions sur tous les plans premium pour les auteurs dont les œuvres sont incluses dans des sélections littéraires thématiques. Contactez-nous pour obtenir un code promo unique!

Vers la bibliographie