Dissertations / Theses on the topic 'Distributed information retrieval'

To see the other types of publications on this topic, follow the link: Distributed information retrieval.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Distributed information retrieval.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

Craswell, Nicholas Eric, and Nick Craswell@anu edu au. "Methods for Distributed Information Retrieval." The Australian National University. Faculty of Engineering and Information Technology, 2001. http://thesis.anu.edu.au./public/adt-ANU20020315.142540.

Full text
Abstract:
Published methods for distributed information retrieval generally rely on cooperation from search servers. But most real servers, particularly the tens of thousands available on the Web, are not engineered for such cooperation. This means that the majority of methods proposed, and evaluated in simulated environments of homogeneous cooperating servers, are never applied in practice. ¶ This thesis introduces new methods for server selection and results merging. The methods do not require search servers to cooperate, yet are as effective as the best methods which do. Two large experiments evaluate the new methods against many previously published methods. In contrast to previous experiments they simulate a Web-like environment, where servers employ varied retrieval algorithms and tend not to sub-partition documents from a single source. ¶ The server selection experiment uses pages from 956 real Web servers, three different retrieval systems and TREC ad hoc topics. Results show that a broker using queries to sample servers’ documents can perform selection over non-cooperating servers without loss of effectiveness. However, using the same queries to estimate the effectiveness of servers, in order to favour servers with high quality retrieval systems, did not consistently improve selection effectiveness. ¶ The results merging experiment uses documents from five TREC sub-collections, five different retrieval systems and TREC ad hoc topics. Results show that a broker using a reference set of collection statistics, rather than relying on cooperation to collate true statistics, can perform merging without loss of effectiveness. Since application of the reference statistics method requires that the broker download the documents to be merged, experiments were also conducted on effective merging based on partial documents. The new ranking method developed was not highly effective on partial documents, but showed some promise on fully downloaded documents. ¶ Using the new methods, an effective search broker can be built, capable of addressing any given set of available search servers, without their cooperation.
APA, Harvard, Vancouver, ISO, and other styles
2

Powell, Allison L. "Database selection in distributed information retrieval a study of multi-collection information retrieval /." Full text, Acrobat Reader required, 2001. http://viva.lib.virginia.edu/etd/diss/SEAS/ComputerScience/2001/Powell/etd.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Baumgarten, Christoph. "Probabilistic information retrieval in a distributed heterogeneous environment." Doctoral thesis, [S.l. : s.n.], 1999. http://deposit.ddb.de/cgi-bin/dokserv?idn=963555316.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Baumgarten, Christoph. "Probabilistic information retrieval in a distributed heterogeneous environment." Doctoral thesis, Technische Universität Dresden, 1998. https://tud.qucosa.de/id/qucosa%3A24785.

Full text
Abstract:
This thesis describes a probabilistic model for optimum information retrieval in a distributed heterogeneous environment. The model assumes the collection of documents offered by the environment to be hierarchically partitioned into subcollections. Documents as well as subcollections have to be indexed. At this indexing methods using different indexing vocabularies can be employed. A query provided by a user is answered in terms of a ranked list of documents. The model determines a procedure for ranking the documents that stems from the Probability Ranking Principle: For each subcollection the subcollection´s elements are ranked; the resulting ranked lists are combined into a final ranked list of documents where the ordering is determined by the documents´ probabilities of being relevant with respect to the user´s query. Various probabilistic ranking methods may be involved in the distributed ranking process. The underlying data volume is arbitrarily scalable. A criterion for effectively limiting the ranking process to a subset of subcollections extends the model. The model´s applicability is experimentally confirmed. When exploiting the degrees of freedom provided by the model experiments showed evidence that the model even outperforms comparable models for the non-distributed case with respect to retrieval effectiveness. An architecture for a distributed information retrieval system is presented that realizes the probabilistic model. The system provides access to an arbitrary number of dynamic multimedia databases.
APA, Harvard, Vancouver, ISO, and other styles
5

Yang, Hui. "Methodologies for information source selection under distributed information environments." Access electronically, 2005. http://www.library.uow.edu.au/adt-NWU/public/adt-NWU20060511.123303/index.html.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Liu, Yang. "A resource aware distributed LSI algorithm for scalable information retrieval." Thesis, Brunel University, 2011. http://bura.brunel.ac.uk/handle/2438/5559.

Full text
Abstract:
Latent Semantic Indexing (LSI) is one of the popular techniques in the information retrieval fields. Different from the traditional information retrieval techniques, LSI is not based on the keyword matching simply. It uses statistics and algebraic computations. Based on Singular Value Decomposition (SVD), the higher dimensional matrix is converted to a lower dimensional approximate matrix, of which the noises could be filtered. And also the issues of synonymy and polysemy in the traditional techniques can be overcome based on the investigations of the terms related with the documents. However, it is notable that LSI suffers a scalability issue due to the computing complexity of SVD. This thesis presents a resource aware distributed LSI algorithm MR-LSI which can solve the scalability issue using Hadoop framework based on the distributed computing model MapReduce. It also solves the overhead issue caused by the involved clustering algorithm. The evaluations indicate that MR-LSI can gain significant enhancement compared to the other strategies on processing large scale of documents. One remarkable advantage of Hadoop is that it supports heterogeneous computing environments so that the issue of unbalanced load among nodes is highlighted. Therefore, a load balancing algorithm based on genetic algorithm for balancing load in static environment is proposed. The results show that it can improve the performance of a cluster according to heterogeneity levels. Considering dynamic Hadoop environments, a dynamic load balancing strategy with varying window size has been proposed. The algorithm works depending on data selecting decision and modeling Hadoop parameters and working mechanisms. Employing improved genetic algorithm for achieving optimized scheduler, the algorithm enhances the performance of a cluster with certain heterogeneity levels.
APA, Harvard, Vancouver, ISO, and other styles
7

Fu, R. "The quality of probabilistic search in unstructured distributed information retrieval systems." Thesis, University College London (University of London), 2012. http://discovery.ucl.ac.uk/1370031/.

Full text
Abstract:
Searching the web is critical to the Web's success. However, the frequency of searches together with the size of the index prohibit a single computer being able to cope with the computational load. Consequently, a variety of distributed architectures have been proposed. Commercial search engines such as Google, usually use an architecture where the the index is distributed but centrally managed over a number of disjoint partitions. This centralized architecture has a high capital and operating cost that presents a significant barrier preventing any new competitor from entering the search market. The dominance of a few Web search giants brings concerns about the objectivity of search results and the privacy of the user. A promising solution to eliminate the high cost of entry is to conduct the search on a peer-to-peer (P2P) architecture. Peer-to-peer architectures offer a more geographically dispersed arrangement of machines that are not centrally managed. This has the benefit of not requiring an expensive centralized server facility. However, the lack of a centralized management can complicate the communication process. And the storage and computational capabilities of peers may be much less than for nodes in a commercial search engine. P2P architectures are commonly categorized into two broad classes, structured and unstructured. Structured architectures guarantee that the entire index is searched for a query, but suffer high communication cost during retrieval and maintenance. In comparison, unstructured architectures do not guarantee the entire index is searched, but require less maintenance cost and are more robust to attacks. In this thesis we study the quality of the probabilistic search in an unstructured distributed network since such a network has potential for developing a low cost and robust large scale information retrieval system. Search in an unstructured distributed network is a challenge, since a single machine normally can only store a subset of documents, and a query is only sent to a subset of machines, due to limitations on computational and communication resources. Thus, IR systems built on such network do not guarantee that a query finds the required documents in the collection, and the search has to be probabilistic and non-deterministic. The search quality is measured by a new metric called accuracy, defined as the fraction of documents retrieved by a constrained, probabilistic search compared with those that would have been retrieved by an exhaustive search. We propose a mathematical framework for modeling search in an unstructured distributed network, and present a non-deterministic distributed search architecture called Probably Approximately Correct (PAC) search, We provide formulas to estimate the search quality based on different system parameters, and show that PAC can achieve good performance when using the same amount of resources of a centrally managed deterministic distributed information retrieval system. We also study the effects of node selection in a centralized PAC architecture. We theoretically and empirically analyze the search performance across query iterations, and show that the search accuracy can be improved by caching good performing nodes in a centralized PAC architecture. Experiments on a real document collection and query log support our analysis. We then investigate the effects of different document replication policies in a PAC IR system. We show that the traditional square-root replication policy is not optimum for maximizing accuracy, and give an optimality criterion for accuracy. A non-uniform distribution of documents improves the retrieval performance of popular documents at the expense of less popular documents. To compensate for this, we propose a hybrid replication policy consisting of a combination of uniform and non-uniform distributions. Theoretical and experimental results show that such an arrangement significantly improves the accuracy of less popular documents at the expense of only a small degradation in accuracy averaged over all queries. We finally explore the effects of query caching in the PAC architecture. We empirically analyze the search performance of queries being issued from a query log, and show that the search accuracy can be improved by caching the top-k documents on each node. Simulations on a real document collection and query log support our analysis.
APA, Harvard, Vancouver, ISO, and other styles
8

Schatz, Bruce Raymond. "Interactive retrieval in information spaces distributed across a wide-area network." Diss., The University of Arizona, 1991. http://hdl.handle.net/10150/185363.

Full text
Abstract:
The potential to provide interactive data manipulation across high-speed nationwide networks is stimulating development of new database technology. An information space is a data model that can support rapid browsing of large amounts of information contained in a digital library physically distributed across many disparate sources. This dissertation discusses supporting interactive retrieval of objects inside an information space across the nationwide scientific network. Implementing such interactive retrieval requires designing caching policies that enable fetching requested objects into a local user workstation from a remote file server with sufficiently short response time to support effective browsing interaction. An adequate caching policy should utilize properties of user perception and data representation within an information space. This dissertation describes a series of new techniques for caching objects within an information space and gives measurements of their performance across the NSFNET. These policies take advantage of special features of interactive retrieval within information spaces, such as initially fetching only the subset of requested objects that will be immediately displayed and prefetching additional objects during idle time when the user is considering which command to issue next. A prototype built by the author, the Telesophy System, supports interactive retrieval for information spaces across local-area networks and serves as a basis for identification of special features. To consider additional needs for efficient implementation across wide-area networks, the significant parameters and policies in implementing caching are systematically identified. Specific values of these caching parameters are used to evaluate the performance of a range of caching policies under a variety of interactions relevant to browsing information spaces. Finally, an incremental caching policy is proposed, which combines many techniques taking advantage of special features of interacting with information spaces. Measurements of the performance of this policy under a variety of conditions demonstrate that interactive retrieval is possible across wide-area networks and that appropriate optimization of the caching policy can produce performance comparable to that across local-area networks.
APA, Harvard, Vancouver, ISO, and other styles
9

Tran, Allen Quoc-Luan. "A network management facility for a fault-tolerant distributed information retrieval system." Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape3/PQDD_0010/MQ53394.pdf.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Macfarlane, Andrew. "Distributed inverted files and performance : a study of parallelism and data distribution methods in IR." Thesis, City University London, 2000. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.342722.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Shokouhi, Milad, and milads@microsoft com. "Federated Text Retrieval from Independent Collections." RMIT University. Computer Science and Information Technology, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20080521.151632.

Full text
Abstract:
Federated information retrieval is a technique for searching multiple text collections simultaneously. Queries are submitted to a subset of collections that are most likely to return relevant answers. The results returned by selected collections are integrated and merged into a single list. Federated search is preferred over centralized search alternatives in many environments. For example, commercial search engines such as Google cannot index uncrawlable hidden web collections; federated information retrieval systems can search the contents of hidden web collections without crawling. In enterprise environments, where each organization maintains an independent search engine, federated search techniques can provide parallel search over multiple collections. There are three major challenges in federated search. For each query, a subset of collections that are most likely to return relevant documents are selected. This creates the collection selection problem. To be able to select suitable collections, federated information retrieval systems acquire some knowledge about the contents of each collection, creating the collection representation problem. The results returned from the selected collections are merged before the final presentation to the user. This final step is the result merging problem. In this thesis, we propose new approaches for each of these problems. Our suggested methods, for collection representation, collection selection, and result merging, outperform state-of-the-art techniques in most cases. We also propose novel methods for estimating the number of documents in collections, and for pruning unnecessary information from collection representations sets. Although management of document duplication has been cited as one of the major problems in federated search, prior research in this area often assumes that collections are free of overlap. We investigate the effectiveness of federated search on overlapped collections, and propose new methods for maximizing the number of distinct relevant documents in the final merged results. In summary, this thesis introduces several new contributions to the field of federated information retrieval, including practical solutions to some historically unsolved problems in federated search, such as document duplication management. We test our techniques on multiple testbeds that simulate both hidden web and enterprise search environments.
APA, Harvard, Vancouver, ISO, and other styles
12

Scherle, Ryan. "Looking for a haystack selecting data sources in a distributed retrieval system /." [Bloomington, Ind.] : Indiana University, 2006. http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqdiss&rft_dat=xri:pqdiss:3240033.

Full text
Abstract:
Thesis (Ph.D.)--Indiana University, Dept. of Computer Science and Cognitive Science, 2006.
"Title from dissertation home page (viewed July 17, 2007)." Source: Dissertation Abstracts International, Volume: 67-10, Section: B, page: 5859. Advisers: David B. Leake; Michael Gasser.
APA, Harvard, Vancouver, ISO, and other styles
13

Al-Shakarchi, Ahmad. "Scalable audio processing across heterogeneous distributed resources : an investigation into distributed audio processing for Music Information Retrieval." Thesis, Cardiff University, 2013. http://orca.cf.ac.uk/47855/.

Full text
Abstract:
Audio analysis algorithms and frameworks for Music Information Retrieval (MIR) are expanding rapidly, providing new ways to discover non-trivial information from audio sources, beyond that which can be ascertained from unreliable metadata such as ID3 tags. MIR is a broad field and many aspects of the algorithms and analysis components that are used are more accurate given a larger dataset for analysis, and often require extensive computational resources. This thesis investigates if, through the use of modern distributed computing techniques, it is possible to design an MIR system that is scalable as the number of participants increases, which adheres to copyright laws and restrictions, whilst at the same time enabling access to a global database of music for MIR applications and research. A scalable platform for MIR analysis would be of benefit to the MIR and scientific community as a whole. A distributed MIR platform that encompasses the creation of MIR algorithms and workflows, their distribution, results collection and analysis, is presented in this thesis. The framework, called DART - Distributed Audio Retrieval using Triana - is designed to facilitate the submission of MIR algorithms and computational tasks against either remotely held music and audio content, or audio provided and distributed by the MIR researcher. Initially a detailed distributed DART architecture is presented, along with simulations to evaluate the validity and scalability of the architecture. The idea of a parameter sweep experiment to find the optimal parameters of the Sub-Harmonic Summation (SHS) algorithm is presented, in order to test the platform and use it to perform useful and real-world experiments that contribute new knowledge to the field. DART is tested on various pre-existing distributed computing platforms and the feasibility of creating a scalable infrastructure for workflow distribution is investigated throughout the thesis, along with the different workflow distribution platforms that could be integrated into the system. The DART parameter sweep experiments begin on a small scale, working up towards the goal of running experiments on thousands of nodes, in order to truly evaluate the scalability of the DART system. The result of this research is a functional and scalable distributed MIR research platform that is capable of performing real world MIR analysis, as demonstrated by the successful completion of several large scale SHS parameter sweep experiments across a variety of different input data - using various distribution methods - and through finding the optimal parameters of the implemented SHS algorithm. DART is shown to be highly adaptable both in terms of the distributed MIR analysis algorithm, as well as the distribution
APA, Harvard, Vancouver, ISO, and other styles
14

Stegmaier, Florian [Verfasser], and Harald [Akademischer Betreuer] Kosch. "Unified Retrieval in Distributed and Heterogeneous Multimedia Information Systems / Florian Stegmaier. Betreuer: Harald Kosch." Passau : Universitätsbibliothek der Universität Passau, 2014. http://d-nb.info/1053119267/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
15

Abusukhon, Ahmad Salameh. "An investigation into improving the load balance and query throughput of distributed information retrieval." Thesis, University of Bath, 2008. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.505715.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Lu, Chengye. "Peer to peer English/Chinese cross-language information retrieval." Thesis, Queensland University of Technology, 2008. https://eprints.qut.edu.au/26444/1/Chengye_Lu_Thesis.pdf.

Full text
Abstract:
Peer to peer systems have been widely used in the internet. However, most of the peer to peer information systems are still missing some of the important features, for example cross-language IR (Information Retrieval) and collection selection / fusion features. Cross-language IR is the state-of-art research area in IR research community. It has not been used in any real world IR systems yet. Cross-language IR has the ability to issue a query in one language and receive documents in other languages. In typical peer to peer environment, users are from multiple countries. Their collections are definitely in multiple languages. Cross-language IR can help users to find documents more easily. E.g. many Chinese researchers will search research papers in both Chinese and English. With Cross-language IR, they can do one query in Chinese and get documents in two languages. The Out Of Vocabulary (OOV) problem is one of the key research areas in crosslanguage information retrieval. In recent years, web mining was shown to be one of the effective approaches to solving this problem. However, how to extract Multiword Lexical Units (MLUs) from the web content and how to select the correct translations from the extracted candidate MLUs are still two difficult problems in web mining based automated translation approaches. Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalized-score based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database but do not consider the retrieval performance of the remote search engine. This thesis presents research on building a peer to peer IR system with crosslanguage IR and advance collection profiling technique for fusion features. Particularly, this thesis first presents a new Chinese term measurement and new Chinese MLU extraction process that works well on small corpora. An approach to selection of MLUs in a more accurate manner is also presented. After that, this thesis proposes a collection profiling strategy which can discover not only collection content but also retrieval performance of the remote search engine. Based on collection profiling, a web-based query classification method and two collection fusion approaches are developed and presented in this thesis. Our experiments show that the proposed strategies are effective in merging results in uncooperative peer to peer environments. Here, an uncooperative environment is defined as each peer in the system is autonomous. Peer like to share documents but they do not share collection statistics. This environment is a typical peer to peer IR environment. Finally, all those approaches are grouped together to build up a secure peer to peer multilingual IR system that cooperates through X.509 and email system.
APA, Harvard, Vancouver, ISO, and other styles
17

Lu, Chengye. "Peer to peer English/Chinese cross-language information retrieval." Queensland University of Technology, 2008. http://eprints.qut.edu.au/26444/.

Full text
Abstract:
Peer to peer systems have been widely used in the internet. However, most of the peer to peer information systems are still missing some of the important features, for example cross-language IR (Information Retrieval) and collection selection / fusion features. Cross-language IR is the state-of-art research area in IR research community. It has not been used in any real world IR systems yet. Cross-language IR has the ability to issue a query in one language and receive documents in other languages. In typical peer to peer environment, users are from multiple countries. Their collections are definitely in multiple languages. Cross-language IR can help users to find documents more easily. E.g. many Chinese researchers will search research papers in both Chinese and English. With Cross-language IR, they can do one query in Chinese and get documents in two languages. The Out Of Vocabulary (OOV) problem is one of the key research areas in crosslanguage information retrieval. In recent years, web mining was shown to be one of the effective approaches to solving this problem. However, how to extract Multiword Lexical Units (MLUs) from the web content and how to select the correct translations from the extracted candidate MLUs are still two difficult problems in web mining based automated translation approaches. Discovering resource descriptions and merging results obtained from remote search engines are two key issues in distributed information retrieval studies. In uncooperative environments, query-based sampling and normalized-score based merging strategies are well-known approaches to solve such problems. However, such approaches only consider the content of the remote database but do not consider the retrieval performance of the remote search engine. This thesis presents research on building a peer to peer IR system with crosslanguage IR and advance collection profiling technique for fusion features. Particularly, this thesis first presents a new Chinese term measurement and new Chinese MLU extraction process that works well on small corpora. An approach to selection of MLUs in a more accurate manner is also presented. After that, this thesis proposes a collection profiling strategy which can discover not only collection content but also retrieval performance of the remote search engine. Based on collection profiling, a web-based query classification method and two collection fusion approaches are developed and presented in this thesis. Our experiments show that the proposed strategies are effective in merging results in uncooperative peer to peer environments. Here, an uncooperative environment is defined as each peer in the system is autonomous. Peer like to share documents but they do not share collection statistics. This environment is a typical peer to peer IR environment. Finally, all those approaches are grouped together to build up a secure peer to peer multilingual IR system that cooperates through X.509 and email system.
APA, Harvard, Vancouver, ISO, and other styles
18

Li, Xiaodong. "RDSS ; a reliable and efficient distributed storage system." Ohio University / OhioLINK, 2004. http://www.ohiolink.edu/etd/view.cgi?ohiou1103127547.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Chilappagari, Sairam. "Role of web services for globally distributed information retrieval systems in a grid environment implementation and performance analysis of a prototype /." Fairfax, VA : George Mason University, 2008. http://hdl.handle.net/1920/3220.

Full text
Abstract:
Thesis (M.S.)--George Mason University, 2008.
Vita: p. 108. Thesis director: J. Mark Pullen. Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science. Title from PDF t.p. (viewed Aug. 28, 2008). Includes bibliographical references (p. 101-107). Also issued in print.
APA, Harvard, Vancouver, ISO, and other styles
20

Petratos, Panagiotis. "A heuristic information retrieval study : an investigation of methods for enhanced searching of distributed data objects exploiting bidirectional relevance feedback." Thesis, University of Bedfordshire, 2004. http://hdl.handle.net/10547/319931.

Full text
Abstract:
The primary aim of this research is to investigate methods of improving the effectiveness of current information retrieval systems. This aim can be achieved by accomplishing numerous supporting objectives. A foundational objective is to introduce a novel bidirectional, symmetrical fuzzy logic theory which may prove valuable to information retrieval, including internet searches of distributed data objects. A further objective is to design, implement and apply the novel theory to an experimental information retrieval system called ANACALYPSE, which automatically computes the relevance of a large number of unseen documents from expert relevance feedback on a small number of documents read. A further objective is to define a methodology used in this work as an experimental information retrieval framework consisting of multiple tables including various formulae which anow a plethora of syntheses of similarity functions, ternl weights, relative term frequencies, document weights, bidirectional relevance feedback and history adjusted term weights. The evaluation of bidirectional relevance feedback reveals a better correspondence between system ranking of documents and users' preferences than feedback free system ranking. The assessment of similarity functions reveals that the Cosine and Jaccard functions perform significantly better than the DotProduct and Overlap functions. The evaluation of history tracking of the documents visited from a root page reveals better system ranking of documents than tracking free information retrieval. The assessment of stemming reveals that system information retrieval performance remains unaffected, while stop word removal does not appear to be beneficial and can sometimes be harmful. The overall evaluation of the experimental information retrieval system in comparison to a leading edge commercial information retrieval system and also in comparison to the expert's golden standard of judged relevance according to established statistical correlation methods reveal enhanced system information retrieval effectiveness.
APA, Harvard, Vancouver, ISO, and other styles
21

Milliner, Stephen William. "Dynamic resolution of conceptual heterogenity in large scale distributed information systems." Thesis, Queensland University of Technology, 2001.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
22

Koch, Douglas J. "Positioning the Reserve Headquarters Support (RHS) system for multi-layered enterprise use." Thesis, Monterey, California : Naval Postgraduate School, 2009. http://edocs.nps.edu/npspubs/scholarly/theses/2009/Sep/09Sep%5FKoch.pdf.

Full text
Abstract:
Thesis (M.S. in Information Technology Management)--Naval Postgraduate School, September 2009.
Thesis Advisor(s): Cook, Glenn. "September 2009." Description based on title screen as viewed on 6 November 2009. Author(s) subject terms: Enterprise architecture, project management, business process transformation, operating model, IT governance, IT systems, data quality, data migration, business operating model, personnel IT systems, HRM, ERP. Includes bibliographical references (p. 89-92). Also available in print.
APA, Harvard, Vancouver, ISO, and other styles
23

Miranda, Ackerman Eduardo Jacobo. "Extracting Causal Relations between News Topics from Distributed Sources." Doctoral thesis, Saechsische Landesbibliothek- Staats- und Universitaetsbibliothek Dresden, 2013. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-130066.

Full text
Abstract:
The overwhelming amount of online news presents a challenge called news information overload. To mitigate this challenge we propose a system to generate a causal network of news topics. To extract this information from distributed news sources, a system called Forest was developed. Forest retrieves documents that potentially contain causal information regarding a news topic. The documents are processed at a sentence level to extract causal relations and news topic references, these are the phases used to refer to a news topic. Forest uses a machine learning approach to classify causal sentences, and then renders the potential cause and effect of the sentences. The potential cause and effect are then classified as news topic references, these are the phrases used to refer to a news topics, such as “The World Cup” or “The Financial Meltdown”. Both classifiers use an algorithm developed within our working group, the algorithm performs better than several well known classification algorithms for the aforementioned tasks. In our evaluations we found that participants consider causal information useful to understand the news, and that while we can not extract causal information for all news topics, it is highly likely that we can extract causal relation for the most popular news topics. To evaluate the accuracy of the extractions made by Forest, we completed a user survey. We found that by providing the top ranked results, we obtained a high accuracy in extracting causal relations between news topics.
APA, Harvard, Vancouver, ISO, and other styles
24

Paques, Henrique Wiermann. "The Ginga Approach to Adaptive Query Processing in Large Distributed Systems." Diss., Georgia Institute of Technology, 2003. http://hdl.handle.net/1853/5277.

Full text
Abstract:
Processing and optimizing ad-hoc and continual queries in an open environment with distributed, autonomous, and heterogeneous data servers (e.g., the Internet) pose several technical challenges. First, it is well known that optimized query execution plans constructed at compile time make some assumptions about the environment (e.g., network speed, data sources' availability). When such assumptions no longer hold at runtime, how can I guarantee the optimized execution of the query? Second, it is widely recognized that runtime adaptation is a complex and difficult task in terms of cost and benefit. How to develop an adaptation methodology that makes the runtime adaptation beneficial at an affordable cost? Last, but not the least, are there any viable performance metrics and performance evaluation techniques for measuring the cost and validating the benefits of runtime adaptation methods? To address the new challenges posed by Internet query and search systems, several areas of computer science (e.g., database and operating systems) are exploring the design of systems that are adaptive to their environment. However, despite the large number of adaptive systems proposed in the literature up to now, most of them present a solution for adapting the system to a specific change to the runtime environment. Typically, these solutions are not easily ``extendable' to allow the system to adapt to other runtime changes not predicted in their approach. In this dissertation, I study the problem of how to construct a framework where I can catalog the known solutions to query processing adaptation and how to develop an application that makes use of this framework. I call the solution to these two problems the Ginga approach. I provide in this dissertation three main contributions: The first contribution is the adoption of the Adaptation Space concept combined with feedback-based control mechanisms for coordinating and integrating different kinds of query adaptations to different runtime changes. The second contribution is the development of a systematic approach, called Ginga, to integrate the adaptation space with feedback control that allows me to combine the generation of predefined query plans (at compile-time) with reactive adaptive query processing (at runtime), including policies and mechanisms for determining when to adapt, what to adapt, and how to adapt. The third contribution is a detailed study on how to adapt to two important runtime changes, and their combination, encountered during the execution of distributed queries: memory constraints and end-to-end delays.
APA, Harvard, Vancouver, ISO, and other styles
25

Wang, Xinyu. "Toward Scalable Hierarchical Clustering and Co-clustering Methods : application to the Cluster Hypothesis in Information Retrieval." Thesis, Lyon, 2017. http://www.theses.fr/2017LYSE2123/document.

Full text
Abstract:
Comme une méthode d’apprentissage automatique non supervisé, la classification automatique est largement appliquée dans des tâches diverses. Différentes méthodes de la classification ont leurs caractéristiques uniques. La classification hiérarchique, par exemple, est capable de produire une structure binaire en forme d’arbre, appelée dendrogramme, qui illustre explicitement les interconnexions entre les instances de données. Le co-clustering, d’autre part, génère des co-clusters, contenant chacun un sous-ensemble d’instances de données et un sous-ensemble d’attributs de données. L’application de la classification sur les données textuelles permet d’organiser les documents et de révéler les connexions parmi eux. Cette caractéristique est utile dans de nombreux cas, par exemple, dans les tâches de recherche d’informations basées sur la classification. À mesure que la taille des données disponibles augmente, la demande de puissance du calcul augmente. En réponse à cette demande, de nombreuses plates-formes du calcul distribué sont développées. Ces plates-formes utilisent les puissances du calcul collectives des machines, pour couper les données en morceaux, assigner des tâches du calcul et effectuer des calculs simultanément.Dans cette thèse, nous travaillons sur des données textuelles. Compte tenu d’un corpus de documents, nous adoptons l’hypothèse de «bag-of-words» et applique le modèle vectoriel. Tout d’abord, nous abordons les tâches de la classification en proposant deux méthodes, Sim_AHC et SHCoClust. Ils représentent respectivement un cadre des méthodes de la classification hiérarchique et une méthode du co-clustering hiérarchique, basé sur la proximité. Nous examinons leurs caractéristiques et performances du calcul, grâce de déductions mathématiques, de vérifications expérimentales et d’évaluations. Ensuite, nous appliquons ces méthodes pour tester l’hypothèse du cluster, qui est l’hypothèse fondamentale dans la recherche d’informations basée sur la classification. Dans de tels tests, nous utilisons la recherche du cluster optimale pour évaluer l’efficacité de recherche pour tout les méthodes hiérarchiques unifiées par Sim_AHC et par SHCoClust . Nous aussi examinons l’efficacité du calcul et comparons les résultats. Afin d’effectuer les méthodes proposées sur des ensembles de données plus vastes, nous sélectionnons la plate-forme d’Apache Spark et fournissons implémentations distribuées de Sim_AHC et de SHCoClust. Pour le Sim_AHC distribué, nous présentons la procédure du calcul, illustrons les difficultés rencontrées et fournissons des solutions possibles. Et pour SHCoClust, nous fournissons une implémentation distribuée de son noyau, l’intégration spectrale. Dans cette implémentation, nous utilisons plusieurs ensembles de données qui varient en taille pour examiner l’échelle du calcul sur un groupe de noeuds
As a major type of unsupervised machine learning method, clustering has been widely applied in various tasks. Different clustering methods have different characteristics. Hierarchical clustering, for example, is capable to output a binary tree-like structure, which explicitly illustrates the interconnections among data instances. Co-clustering, on the other hand, generates co-clusters, each containing a subset of data instances and a subset of data attributes. Applying clustering on textual data enables to organize input documents and reveal connections among documents. This characteristic is helpful in many cases, for example, in cluster-based Information Retrieval tasks. As the size of available data increases, demand of computing power increases. In response to this demand, many distributed computing platforms are developed. These platforms use the collective computing powers of commodity machines to parallelize data, assign computing tasks and perform computation concurrently.In this thesis, we first address text clustering tasks by proposing two clustering methods, Sim_AHC and SHCoClust. They respectively represent a similarity-based hierarchical clustering and a similarity-based hierarchical co-clustering. We examine their properties and performances through mathematical deduction, experimental verification and evaluation. Then we apply these methods in testing the cluster hypothesis, which is the fundamental assumption in cluster-based Information Retrieval. In such tests, we apply the optimal cluster search to evaluation the retrieval effectiveness of different clustering methods. We examine the computing efficiency and compare the results of the proposed tests. In order to perform clustering on larger datasets, we select Apache Spark platform and provide distributed implementation of Sim_AHC and of SHCoClust. For distributed Sim_AHC, we present the designed computing procedure, illustrate confronted difficulties and provide possible solutions. And for SHCoClust, we provide a distributed implementation of its core, spectral embedding. In this implementation, we use several datasets that vary in size to examine scalability
APA, Harvard, Vancouver, ISO, and other styles
26

Ducrou, Amanda Joanne. "Complete interoperability in healthcare technical, semantic and process interoperability through ontology mapping and distributed enterprise integration techniques /." Access electronically, 2009. http://ro.uow.edu.au/theses/3048.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Osborn, Viola. "Identifying At-Risk Students: An Assessment Instrument for Distributed Learning Courses in Higher Education." Thesis, University of North Texas, 2000. https://digital.library.unt.edu/ark:/67531/metadc2457/.

Full text
Abstract:
The current period of rapid technological change, particularly in the area of mediated communication, has combined with new philosophies of education and market forces to bring upheaval to the realm of higher education. Technical capabilities exceed our knowledge of whether expenditures on hardware and software lead to corresponding gains in student learning. Educators do not yet possess sophisticated assessments of what we may be gaining or losing as we widen the scope of distributed learning. The purpose of this study was not to draw sweeping conclusions with respect to the costs or benefits of technology in education. The researcher focused on a single issue involved in educational quality: assessing the ability of a student to complete a course. Previous research in this area indicates that attrition rates are often higher in distributed learning environments. Educators and students may benefit from a reliable instrument to identify those students who may encounter difficulty in these learning situations. This study is aligned with research focused on the individual engaged in seeking information, assisted or hindered by the capabilities of the computer information systems that create and provide access to information. Specifically, the study focused on the indicators of completion for students enrolled in video conferencing and Web-based courses. In the final version, the Distributed Learning Survey encompassed thirteen indicators of completion. The results of this study of 396 students indicated that the Distributed Learning Survey represented a reliable and valid instrument for identifying at-risk students in video conferencing and Web-based courses where the student population is similar to the study participants. Educational level, GPA, credit hours taken in the semester, study environment, motivation, computer confidence, and the number of previous distributed learning courses accounted for most of the predictive power in the discriminant function based on student scores from the survey.
APA, Harvard, Vancouver, ISO, and other styles
28

Nguyen, The An Binh [Verfasser], Ralf [Akademischer Betreuer] Steinmetz, and Michael [Akademischer Betreuer] Zink. "Quality-aware Tasking in Mobile Opportunistic Networks - Distributed Information Retrieval and Processing utilizing Opportunistic Heterogeneous Resources. / The An Binh Nguyen ; Ralf Steinmetz, Michael Zink." Darmstadt : Universitäts- und Landesbibliothek Darmstadt, 2018. http://d-nb.info/1167926331/34.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Brand, Jacobus Edwin. "An instrument analysis system based on a modern relational database and distributed software architecture." Thesis, Stellenbosch : Stellenbosch University, 2003. http://hdl.handle.net/10019.1/53269.

Full text
Abstract:
Thesis (MBA)--Stellenbosch University, 2003.
ENGLISH ABSTRACT: This document discusses the development of a personal computer based financial instrument analysis system, based on the information from a relatively old sequential file based data source. The aim is to modernise the system to use the latest software and data storage technology. The principles used for the design of the system are discussed in Chapter 2. Principles for the development of relational databases are discussed, where after the development of personal computer based software architecture is discussed, to explain the choices made in the design of the system. Chapter 3 discusses the design and implementation of the system in more detail, based on the principles discussed in Chapter 2. Recommendations include a possible shift in architectural layout as well as recommendations for expansion of both the data stored and the analysis performed on the information.
AFRIKAANSE OPSOMMING: Hierdie dokument bespreek die ontwikkeling van ‘n persoonlike rekenaar gebaseerde finansiële instrument analise stelsel, gebaseer op inligting uit ‘n relatiewe ou sekwensiële leêr gebaseerde databron. Die doel is om die stelsel te moderniseer om sodoende van die nuutste sagteware en hardeware tegnologie gebruik te maak. Die beginsels wat gebruik is vir die ontwerp van die stelsel word kortliks in Hoofstuk 2 bespreek. Die beginsels vir die ontwerp van ‘n relasionele databasis word bespreek. Hierna word die ontwikkeling van persoonlike rekenaar gebaseerde sagteware argitektuur bespreek om meer lig te werp op die keuses wat geneem is met ontwerp van die stelsel se argitektuur. Hoofstuk 3 bespreek die ontwerp en implementering van die stelsel in meer detail, gebaseer op die beginsels bespreek in Hoofstuk 2. Voorstelle vir verbetering van die stelsel sluit in detail veranderings aan die argitektuur van die stelsel, sowel as voorstelle vir die uitbreiding van die stelsel wat betref tipe data wat gestoor word en en die analitiese vermoëns van die stelsel.
APA, Harvard, Vancouver, ISO, and other styles
30

Augusto, Luiz Daniel Creao. "Arquitetura e implementação de um sistema distribuído e recuperação de informação." Universidade de São Paulo, 2010. http://www.teses.usp.br/teses/disponiveis/45/45134/tde-12072010-110036/.

Full text
Abstract:
A busca por documentos relevantes ao usuário é um problema que se torna mais custoso conforme as bases de conhecimento crescem em seu ritmo acelerado. Este problema passou a resolvido por sistemas distribuídos, devido a sua escalabilidade e tolerância a falhas. O desenvolvimento de sistemas voltados a estas enormes bases de conhecimento -- e a maior de todas, a Internet -- é uma indústria que movimenta bilhões de dólares por ano no mundo inteiro e criou gigantes. Neste trabalho, são apresentadas e discutidas estruturas de dados e arquiteturas distribuídas que tratem o problema de indexar e buscar grandes coleções de documentos em sistemas distribuídos, alcançando grande desempenho e escalabilidade. Serão também discutidos alguns dos grandes sistemas de busca da atualidade, como o Google e o Apache Solr, além do planejamento de uma grande aplicação com protótipo em desenvolvimento. Um projeto próprio de sistema de busca distribuído foi implementado, baseado no Lucene, com idéias coletadas noutros trabalhos e outras novas. Em nossos experimentos, o sistema distribuído desenvolvido neste trabalho superou o Apache Solr com um vazão 37,4\\% superior e mostrou números muito superiores a soluções não-distribuídas em hardware de custo muito superior ao nosso cluster.
The search for relevant documents for the final user is a problem that becomes more expensive as the databases grown faster. The solution was brought by distributed systems, because of its scalability and fail tolerance. The development of systems focused on enormous databases -- including the World Wide Web -- is an industry that involves billions of dollars in the world and had created giants. In this work, will be presented and discussed data structures and distributed architectures related to the indexes and searching in great document collections in distributed systems, reaching high performance and scalability. We will also discuss some of the biggest search engines, such as Google e Apache Solr, and the planning of an application with a developing prototype. At last, a new project of a distributed searching system will be presented and implemented, based on Lucene, with ideas from other works and new ideas of our own. On our tests, the system developed in this work had throughput 37.4\\% higher than Apache Solr and revealed higher performance than non-distributed solutions in a hardware more expensive than our cluster.
APA, Harvard, Vancouver, ISO, and other styles
31

Ives, Zachary G. "Efficient query processing for data integration /." Thesis, Connect to this title online; UW restricted, 2002. http://hdl.handle.net/1773/6864.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Vilsmaier, Christian. "Contextualized access to distributed and heterogeneous multimedia data sources." Thesis, Lyon, INSA, 2014. http://www.theses.fr/2014ISAL0094/document.

Full text
Abstract:
Rendre les données multimédias disponibles en ligne devient moins cher et plus pratique sur une base quotidienne, par exemple par les utilisateurs eux-mêmes. Des phénomènes du Web comme Facebook, Twitter et Flickr bénéficient de cette évolution. Ces phénomènes et leur acceptation accrue conduisent à une multiplication du nombre d’images disponibles en ligne. La taille cumulée de ces images souvent publiques et donc consultables, est de l’ordre de plusieurs zettaoctets. L’exécution d’une requête de similarité sur de tels volumes est un défi que la communauté scientifique commence à cibler. Une approche envisagée pour faire face à ce problème propose d’utiliser un système distribué et hétérogène de recherche d’images basé sur leur contenu (CBIRs). De nombreux problèmes émergent d’un tel scénario. Un exemple est l’utilisation de formats de métadonnées distincts pour décrire le contenu des images; un autre exemple est l’information technique et structurelle inégale. Les métriques individuelles qui sont utilisées par les CBIRs pour calculer la similarité entre les images constituent un autre exemple. Le calcul de bons résultats dans ce contexte s’avère ainsi une tàche très laborieuse qui n’est pas encore scientifiquement résolue. Le problème principalement abordé dans cette thèse est la recherche de photos de CBIRs similaires à une image donnée comme réponse à une requête multimédia distribuée. La contribution principale de cette thèse est la construction d’un réseau de CBIRs sensible à la sémantique des contenus (CBIRn). Ce CBIRn sémantique est capable de collecter et fusionner les résultats issus de sources externes spécialisées. Afin d’être en mesure d’intégrer de telles sources extérieures, prêtes à rejoindre le réseau, mais pas à divulguer leur configuration, un algorithme a été développé capable d’estimer la configuration d’un CBIRS. En classant les CBIRs et en analysant les requêtes entrantes, les requêtes d’image sont exclusivement transmises aux CBIRs les plus appropriés. De cette fac ̧on, les images sans intérêt pour l’utilisateur peuvent être omises à l’avance. Les images retournées cells sont considérées comme similaires par rapport à l’image donnée pour la requête. La faisabilité de l’approche et l’amélioration obtenue par le processus de recherche sont démontrées par un développement prototypique et son évaluation utilisant des images d’ImageNet. Le nombre d’images pertinentes renvoyées par l’approche de cette thèse en réponse à une requête image est supérieur d’un facteur 4.75 par rapport au résultat obtenu par un réseau de CBIRs predéfini
Making multimedia data available online becomes less expensive and more convenient on a daily basis. This development promotes web phenomenons such as Facebook, Twitter, and Flickr. These phenomena and their increased acceptance in society in turn leads to a multiplication of the amount of available images online. This vast amount of, frequently public and therefore searchable, images already exceeds the zettabyte bound. Executing a similarity search on the magnitude of images that are publicly available and receiving a top quality result is a challenge that the scientific community has recently attempted to rise to. One approach to cope with this problem assumes the use of distributed heterogeneous Content Based Image Retrieval system (CBIRs). Following from this anticipation, the problems that emerge from a distributed query scenario must be dealt with. For example the involved CBIRs’ usage of distinct metadata formats for describing their content, as well as their unequal technical and structural information. An addition issue is the individual metrics that are used by the CBIRs to calculate the similarity between pictures, as well as their specific way of being combined. Overall, receiving good results in this environment is a very labor intensive task which has been scientifically but not yet comprehensively explored. The problem primarily addressed in this work is the collection of pictures from CBIRs, that are similar to a given picture, as a response to a distributed multimedia query. The main contribution of this thesis is the construction of a network of Content Based Image Retrieval systems that are able to extract and exploit the information about an input image’s semantic concept. This so called semantic CBIRn is mainly composed of CBIRs that are configured by the semantic CBIRn itself. Complementarily, there is a possibility that allows the integration of specialized external sources. The semantic CBIRn is able to collect and merge results of all of these attached CBIRs. In order to be able to integrate external sources that are willing to join the network, but are not willing to disclose their configuration, an algorithm was developed that approximates these configurations. By categorizing existing as well as external CBIRs and analyzing incoming queries, image queries are exclusively forwarded to the most suitable CBIRs. In this way, images that are not of any use for the user can be omitted beforehand. The hereafter returned images are rendered comparable in order to be able to merge them to one single result list of images, that are similar to the input image. The feasibility of the approach and the hereby obtained improvement of the search process is demonstrated by a prototypical implementation. Using this prototypical implementation an augmentation of the number of returned images that are of the same semantic concept as the input images is achieved by a factor of 4.75 with respect to a predefined non-semantic CBIRn
APA, Harvard, Vancouver, ISO, and other styles
33

Domínguez, Sal David. "Analysis and optimization of question answering systems." Doctoral thesis, Universitat Politècnica de Catalunya, 2010. http://hdl.handle.net/10803/78011.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Gerbier, Emilie. "Effet du type d’agencement temporel des répétitions d’une information sur la récupération explicite." Thesis, Lyon 2, 2011. http://www.theses.fr/2011LYO20029/document.

Full text
Abstract:
La façon dont une information se répète au cours du temps a une influence sur la façon dont nous nous souviendrons de cette information. Les recherches en psychologie ont mis en évidence l’effet de pratique distribuée, selon lequel on retient mieux les informations qui se répètent avec des intervalles inter-répétitions longs que celles qui se répètent avec des intervalles courts. Nos travaux ont porté spécifiquement sur les situations où l’information se répète sur plusieurs jours, et nous avons comparé l’efficacité relative de différents types d’agencement temporel des répétitions. Un agencement uniforme consiste en des répétitions se produisant à intervalles réguliers, un agencement expansif en des répétitions se produisant selon des intervalles de plus en plus espacés, et un agencement contractant en des répétitions se produisant selon des intervalles de plus en plus rapprochés. Les Expériences 1 et 2 consistaient en une phase d’apprentissage d’une semaine et ont révélé la supériorité des agencements expansif et uniforme après un délai de rétention de deux jours. L’Expérience 3 consistait en une phase d’apprentissage de deux semaines, et les sujets étaient ensuite testés lors de trois délais de rétention différents (2, 6 ou 13 jours). La supériorité de l’agencement expansif sur les deux autres agencements est apparue progressivement, suggérant que les différents agencements induisaient des taux d’oubli différents. Nous avons également tenté de tester différentes théories explicatives des effets de l’agencement temporel des répétitions sur la mémorisation, en particulier les théories de la variabilité de l’encodage (Expérience 4) et de la récupération en phase d’étude (Expérience 2). Les résultats observés tendent à confirmer la théorie de la récupération en phase d’étude. Nous insistons sur l’importance de la prise en compte des apports des autres disciplines des sciences cognitives dans l’étude de l’effet de pratique distribuée
How information is repeated over time determines future recollection of this information. Studies in psychology revealed a distributed practice effect, that is, one retains information better when its occurrences are separated by long lags rather than by short lags. Our studies focused specifically on cases in which items were repeated upon several days. We compared the efficiency of three different temporal schedules of repetitions: A uniform schedule that consisted in repetitions occurring with equal intervals, an expanding schedule that consisted in repetitions occurring with longer and longer intervals, and a contracting schedule that consisted in repetitions occurring with shorter and shorter intervals. In Experiments 1 and 2, the learning phase lasted one week and the retention interval lasted two days. It was shown that the expanding and uniform schedules were more efficient than the contracting schedule. In Experiment 3, the learning phase lasted two weeks and the retention interval lasted 2, 6, or 13 days. It was shown that the superiority of the expanding schedule over the other two schedules appeared gradually when the retention interval increased, suggesting that different schedules yielded different forgetting rates. We also tried to test major theories of the distributed practice effect, such as the encoding variability (Experiment 4) and the study-phase retrieval (Experiment 2) theories. Our results appeared to be consistent with the study-phase retrieval theory. We concluded our dissertation by emphasizing the importance of considering findings from other areas in cognitive science–especially neuroscience and computer science–in the study of the distributed practice effect
APA, Harvard, Vancouver, ISO, and other styles
35

Conte, Simone Ivan. "The Sea of Stuff : a model to manage shared mutable data in a distributed environment." Thesis, University of St Andrews, 2019. http://hdl.handle.net/10023/16827.

Full text
Abstract:
Managing data is one of the main challenges in distributed systems and computer science in general. Data is created, shared, and managed across heterogeneous distributed systems of users, services, applications, and devices without a clear and comprehensive data model. This technological fragmentation and lack of a common data model result in a poor understanding of what data is, how it evolves over time, how it should be managed in a distributed system, and how it should be protected and shared. From a user perspective, for example, backing up data over multiple devices is a hard and error-prone process, or synchronising data with a cloud storage service can result in conflicts and unpredictable behaviours. This thesis identifies three challenges in data management: (1) how to extend the current data abstractions so that content, for example, is accessible irrespective of its location, versionable, and easy to distribute; (2) how to enable transparent data storage relative to locations, users, applications, and services; and (3) how to allow data owners to protect data against malicious users and automatically control content over a distributed system. These challenges are studied in detail in relation to the current state of the art and addressed throughout the rest of the thesis. The artefact of this work is the Sea of Stuff (SOS), a generic data model of immutable self-describing location-independent entities that allow the construction of a distributed system where data is accessible and organised irrespective of its location, easy to protect, and can be automatically managed according to a set of user-defined rules. The evaluation of this thesis demonstrates the viability of the SOS model for managing data in a distributed system and using user-defined rules to automatically manage data across multiple nodes.
APA, Harvard, Vancouver, ISO, and other styles
36

Harvesf, Cyrus Mehrabaun. "The design and implementation of a robust, cost-conscious peer-to-peer lookup service." Diss., Atlanta, Ga. : Georgia Institute of Technology, 2008. http://hdl.handle.net/1853/26559.

Full text
Abstract:
Thesis (Ph.D)--Electrical and Computer Engineering, Georgia Institute of Technology, 2009.
Committee Chair: Blough, Douglas; Committee Member: Liu, Ling; Committee Member: Owen, Henry; Committee Member: Riley, George; Committee Member: Yalamanchili, Sudhakar. Part of the SMARTech Electronic Thesis and Dissertation Collection.
APA, Harvard, Vancouver, ISO, and other styles
37

Thomas, Cerqueus. "Contributions au problème d'hétérogénéité sémantique dans les systèmes pair-à-pair : application à la recherche d'information." Phd thesis, Université de Nantes, 2012. http://tel.archives-ouvertes.fr/tel-00763914.

Full text
Abstract:
Nous considérons des systèmes pair-à-pair (P2P) pour le partage de données dans lesquels chaque pair est libre de choisir l'ontologie qui correspond le mieux à ses besoins pour représenter ses données. Nous parlons alors d'hétérogénéité sémantique. Cette situation est un frein important à l'interopérabilité car les requêtes émises par les pairs peuvent être incomprises par d'autres. Dans un premier temps nous nous focalisons sur la notion d'hétérogénéité sémantique. Nous définissons un ensemble de mesures permettant de caractériser finement l'hétérogénéité d'un système suivant différentes facettes. Dans un deuxième temps nous définissons deux protocoles. Le premier, appelé CorDis, permet de réduire l'hétérogénéité sémantique liée aux disparités entre pairs. Il dissémine des correspondances dans le système afin que les pairs apprennent de nouvelles correspondances. Le second protocole, appelé GoOD-TA, permet de réduire l'hétérogénéité sémantique d'un système liée à son organisation. L'objectif est d'organiser le système de sorte que les pairs proches sémantiquement soient proches dans le système. Ainsi deux pairs deviennent voisins s'ils utilisent la même ontologie ou s'il existe de nombreuses correspondances entre leurs ontologies respectives. Enfin, dans un trois temps, nous proposons l'algorithme DiQuESH pour le routage et le traitement de requêtes top-k dans les systèmes P2P sémantiquement hétérogènes. Cet algorithme permet à un pair d'obtenir les k documents les plus pertinents de son voisinage. Nous montrons expérimentalement que les protocoles CorDis et GoOD-TA améliorent les résultats obtenus par DiQuESH.
APA, Harvard, Vancouver, ISO, and other styles
38

Villaça, Rodolfo da Silva 1974. "Hamming DHTe HCube : arquiteturas distribuídas para busca por similaridade." [s.n.], 2013. http://repositorio.unicamp.br/jspui/handle/REPOSIP/261007.

Full text
Abstract:
Orientador: Maurício Ferreira Magalhães
Tese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de Computação
Made available in DSpace on 2018-08-23T11:36:13Z (GMT). No. of bitstreams: 1 Villaca_RodolfodaSilva_D.pdf: 2446951 bytes, checksum: c6d907cab0de18a43fe707cae0e827a4 (MD5) Previous issue date: 2013
Resumo: Atualmente, a quantidade de dados disponíveis na Internet supera a casa dos Zettabytes (ZB), definindo um cenário conhecido na literatura como Big Data. Embora as soluções de banco de dados tradicionais sejam eficientes na busca e recuperação de um conteúdo específico e exato, elas são ineficientes nesse cenário de Big Data, visto que não foram projetadas para isso. Outra dificuldade é que esses dados são essencialmente não-estruturados e encontram-se diluídos em toda a vastidão da Internet. Desta forma, novas soluções de infraestruturas de bancos de dados são necessárias de modo a suportar a busca e recuperação de dados similares de maneira não exata, configurando-se a busca por similaridade, isto é, busca por grupos de dados que compartilham entre si alguma semelhança. Nesse cenário, a proposta desta tese é explorar a similaridade de Hamming existente entre identificadores de objetos gerados através da função Random Hyperplane Hashing. Essa característica presente nesses identificadores servirá de base para propostas de infra-estruturas distribuídas de armazenamento de dados capazes de suportar eficientemente a busca por similaridade. Nesta tese serão apresentadas a Hamming DHT, uma solução P2P baseada em redes sobrepostas, e o HCube, uma solução baseada em servidores para Data Center. As avaliações de ambas as soluções são apresentadas e mostram que elas são capazes de reduzir as distâncias entre conteúdos similares em ambientes distribuídos, o que contribui para o aumento da cobertura em cenários de busca por similaridade
Abstract: Nowadays, the amount of data available on the Internet is over Zettabytes (ZB). Such condition defines a scenario known in the literature as Big Data. Although traditional database solutions are very efficient for finding and retrieving a specific content, they are inefficient on Big Data scenario, since the great majority of such data is unstructured and scattered across the Internet. In this way, new databases are required in order to support queries capable of finding and recovering similar datasets, i.e., retrieving groups of data that share a common meaning. In order to handle such challenging scenario, the proposal in this thesis is to explore the Hamming similarity existent between content identifiers that are generated using the Random Hyperplane Hashing function. Such identifiers provide the basis for building distributed infrastructures that facilitate the similarity search. In this thesis, we present two different approaches: a P2P solution named Hamming DHT, and a Data Center solution named HCube. Evaluations of both solutions are presented and indicate that such solutions are capable of reducing the distance between similar content, improving the recall in a similarity search
Doutorado
Engenharia de Computação
Doutor em Engenharia Elétrica
APA, Harvard, Vancouver, ISO, and other styles
39

REIS, JUNIOR JOSE S. B. "Métodos e softwares para análise da produção científica e detecção de frentes emergentes de pesquisa." reponame:Repositório Institucional do IPEN, 2015. http://repositorio.ipen.br:8080/xmlui/handle/123456789/26929.

Full text
Abstract:
Submitted by Marco Antonio Oliveira da Silva (maosilva@ipen.br) on 2016-12-21T15:07:24Z No. of bitstreams: 0
Made available in DSpace on 2016-12-21T15:07:24Z (GMT). No. of bitstreams: 0
O progresso de projetos anteriores salientou a necessidade de tratar o problema dos softwares para detecção, a partir de bases de dados de publicações científicas, de tendências emergentes de pesquisa e desenvolvimento. Evidenciou-se a carência de aplicações computacionais eficientes dedicadas a este propósito, que são artigos de grande utilidade para um melhor planejamento de programas de pesquisa e desenvolvimento em instituições. Foi realizada, então, uma revisão dos softwares atualmente disponíveis, para poder-se delinear claramente a oportunidade de desenvolver novas ferramentas. Como resultado, implementou-se um aplicativo chamado Citesnake, projetado especialmente para auxiliar a detecção e o estudo de tendências emergentes a partir da análise de redes de vários tipos, extraídas das bases de dados científicas. Através desta ferramenta computacional robusta e eficaz, foram conduzidas análises de frentes emergentes de pesquisa e desenvolvimento na área de Sistemas Geradores de Energia Nuclear de Geração IV, de forma que se pudesse evidenciar, dentre os tipos de reatores selecionados como os mais promissores pelo GIF - Generation IV International Forum, aqueles que mais se desenvolveram nos últimos dez anos e que se apresentam, atualmente, como os mais capazes de cumprir as promessas realizadas sobre os seus conceitos inovadores.
Dissertação (Mestrado em Tecnologia Nuclear)
IPEN/D
Instituto de Pesquisas Energéticas e Nucleares - IPEN-CNEN/SP
APA, Harvard, Vancouver, ISO, and other styles
40

Duguépéroux, Joris. "Protection des travailleurs dans les plateformes de crowdsourcing : une perspective technique." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S023.

Full text
Abstract:
Ce travail porte sur les moyens de protéger les travailleurs dans le cadre du crowdsourcing. Une première contribution s’intéresse à la protection de la vie privée des travailleurs pour une plateforme unique, tout en autorisant différents usages des données (pour affecter des tâches aux travailleurs ou pour avoir des statistiques sur la population par exemple). Une seconde contribution propose la mise à disposition d’outils, pour les législateurs, permettant de réguler de multiples plateformes en combinant à la fois transparence et respect de la vie privée. Ces deux approches font appel à de nombreux outils (d’anonymisation, de chiffrement ou de distribution des calculs notamment), et sont à la fois accompagnées de preuves de sécurité et validées par des expérimentations. Une troisième contribution, moins développée, propose de mettre en lumière un problème de sécurité dans une des techniques utilisées (le PIR) lorsque celle-ci est utilisée à de multiples reprises, problème jusqu’à présent ignoré dans les contributions de l’état de l’art
This work focuses on protecting workers in a crowdsourcing context. Indeed, workers are especially vulnerable in online work, and both surveillance from platforms and lack of regulation are frequently denounced for endangering them. Our first contribution focuses on protecting their privacy, while allowing usages of their anonymized data for, e.g. assignment to tasks or providing help for task-design to requesters. Our second contribution considers a multi-platform context, and proposes a set of tools for law-makers to regulate platforms, allowing them to enforce limits on interactions in various ways (to limit the work time for instance), while also guaranteeing transparency and privacy. Both of these approaches make use of many technical tools such as cryptography, distribution, or anonymization tools, and include security proofs and experimental validations. A last, smaller contribution, draws attention on a limit and possible security issue for one of these technical tools, the PIR, when it is used multiple times, which has been ignored in current state-of-the-art contributions
APA, Harvard, Vancouver, ISO, and other styles
41

Novaes, Tiago Fernandes de Athayde. "Processamento distribu?do da consulta espa?o textual top-k." Universidade Estadual de Feira de Santana, 2017. http://localhost:8080/tede/handle/tede/530.

Full text
Abstract:
Submitted by Ricardo Cedraz Duque Moliterno (ricardo.moliterno@uefs.br) on 2017-11-28T21:38:06Z No. of bitstreams: 1 dissertacao-versao-final.pdf: 2717503 bytes, checksum: a1476bba65482b40daa1a139191ea912 (MD5)
Made available in DSpace on 2017-11-28T21:38:06Z (GMT). No. of bitstreams: 1 dissertacao-versao-final.pdf: 2717503 bytes, checksum: a1476bba65482b40daa1a139191ea912 (MD5) Previous issue date: 2017-07-17
With the popularization of databases containing objects with spatial and textual information (spatio-textual object), the interest in new queries and techniques for retrieving these objects have increased. In this scenario, the main query is the the top-k spatio-textual query. This query retrieves the k best spatio-textual objects considering the distance of the object to the query location and the textual similarity between the query keywords and the textual information of the objects. However, most the studies related to top-k spatio-textual query are performed in centralized environments, not addressing real world problems such as scalability. In this paper, we study different strategies for partitioning the data and processing the top-k spatio-textual query in a distributed environment. We evaluate each strategy in a real distributed environment, employing real datasets.
Com a populariza??o de bases de dados contendo objetos que possuem informa??o espacial e textual (objeto espa?o-textual), aumentou o interesse por novas consultas e t?cnicas capazes de recuperar esses objetos de forma eficiente. Uma das principais consultas para objetos espa?o-textuais ? a consulta espa?o-textual top-k. Essa consulta visa recuperar os k melhores objetos considerando a dist?ncia do objeto at? um local informado na consulta e a similaridade textual entre palavras-chave de busca e a informa??o textual dos objetos. No entanto, a maioria dos estudos para consultas espa?o-textual top-k assumem ambientes centralizados, n?o abordando problemas frequentes em aplica??es do mundo real como escalabilidade. Nesta disserta??o s?o estudadas diferentes formas de particionar os dados e o impacto destes particionamentos no processamento da consulta espa?o-textual top-k em um ambiente distribu?do. Todas as estrat?gias propostas s?o avaliadas em um ambiente distribu?do real, utilizando dados reais.
APA, Harvard, Vancouver, ISO, and other styles
42

Gaignard, Alban. "Partage et production de connaissances distribuées dans des plateformes scientifiques collaboratives." Phd thesis, Université de Nice Sophia-Antipolis, 2013. http://tel.archives-ouvertes.fr/tel-00827926.

Full text
Abstract:
Cette thèse s'intéresse à la production et au partage cohérent de connaissances distribuées dans le domaine des sciences de la vie. Malgré l'augmentation constante des capacités de stockage et de calcul des infrastructures informatiques, les approches centralisées pour la gestion de grandes masses de données scientifiques multi-sources deviennent inadaptées pour plusieurs raisons: (i) elles ne garantissent pas l'autonomie des fournisseurs de données qui doivent conserver un certain contrôle sur les don- nées hébergées pour des raisons éthiques et/ou juridiques, (ii) elles ne permettent pas d'envisager le passage à l'échelle des plateformes en sciences computationnelles qui sont la source de productions massives de données scientifiques. Nous nous intéressons, dans le contexte des plateformes collaboratives en sci- ences de la vie NeuroLOG et VIP, d'une part, aux problématiques de distribution et d'hétérogénéité sous-jacentes au partage de ressources, potentiellement sensibles ; et d'autre part, à la production automatique de connaissances au cours de l'usage de ces plateformes, afin de faciliter l'exploitation de la masse de données produites. Nous nous appuyons sur une approche ontologique pour la modélisation des connaissances et pro- posons à partir des technologies du web sémantique (i) d'étendre ces plateformes avec des stratégies efficaces, statiques et dynamiques, d'interrogations sémantiques fédérées et (ii) d'étendre leur environnent de traitement de données pour automatiser l'annotation sémantique des résultats d'expérience "in silico", à partir de la capture d'informations de provenance à l'exécution et de règles d'inférence spécifiques au domaine. Les résultats de cette thèse, évalués sur l'infrastructure distribuée et contrôlée Grid'5000, apportent des éléments de réponse à trois enjeux majeurs des plateformes collaboratives en sciences computationnelles : (i) un modèle de collaborations sécurisées et une stratégie de contrôle d'accès distribué pour permettre la mise en place d'études multi-centriques dans un environnement compétitif, (ii) des résumés sémantiques d'expérience qui font sens pour l'utilisateur pour faciliter la navigation dans la masse de données produites lors de campagnes expérimentales, et (iii) des stratégies efficaces d'interrogation et de raisonnement fédérés, via les standards du Web Sémantique, pour partager les connaissances capitalisées dans ces plateformes et les ouvrir potentiellement sur le Web de données.
APA, Harvard, Vancouver, ISO, and other styles
43

Moin, Afshin. "Les Techniques De Recommandation Et De Visualisation Pour Les Données A Une Grande Echelle." Phd thesis, Université Rennes 1, 2012. http://tel.archives-ouvertes.fr/tel-00724121.

Full text
Abstract:
Nous avons assisté au développement rapide de la technologie de l'information au cours de la dernière décennie. D'une part, la capacité du traitement et du stockage des appareils numériques est en constante augmentation grâce aux progrès des méthodes de construction. D'autre part, l'interaction entre ces dispositifs puissants a été rendue possible grâce à la technologie de réseautage. Une conséquence naturelle de ces progrès, est que le volume des données générées dans différentes applications a grandi à un rythme sans précédent. Désormais, nous sommes confrontés à de nouveaux défis pour traiter et représenter efficacement la masse énorme de données à notre disposition. Cette thèse est centrée autour des deux axes de recommandation du contenu pertinent et de sa visualisation correcte. Le rôle des systèmes de recommandation est d'aider les utilisateurs dans le processus de prise de décision pour trouver des articles avec un contenu pertinent et une qualité satisfaisante au sein du vaste ensemble des possibilités existant dans le Web. D'autre part, la représentation correcte des données traitées est un élément central à la fois pour accroître l'utilité des données pour l'utilisateur final et pour la conception des outils d'analyse efficaces. Dans cet exposé, les principales approches des systèmes de recommandation ainsi que les techniques les plus importantes de la visualisation des données sous forme de graphes sont discutées. En outre, il est montré comment quelques-unes des mêmes techniques appliquées aux systèmes de recommandation peuvent être modifiées pour tenir compte des exigences de visualisation.
APA, Harvard, Vancouver, ISO, and other styles
44

Elofson, Gregg Steven. "Facilitating knowledge sharing in organizations: Semiautonomous agents that learn to gather, classify, and distribute environmental scanning knowledge." Diss., The University of Arizona, 1989. http://hdl.handle.net/10150/184743.

Full text
Abstract:
Evaluating patterns of indicators is often the first step an organization takes in scanning the environment. Not surprisingly, the experts that evaluate these patterns are not equally adept across all disciplines. While one expert is particularly skilled at recognizing the potential for political turmoil in a foreign nation, another is best at recognizing how Japanese government de-regulation is meant to complement the development of some new product. Moreover, the experts often benefit from one another's skills and knowledge in assessing activity in the environment external to the organization. One problem in this process occurs when the expert is unavailable and can't share his knowledge. And, addressing the problem of knowledge sharing, of distributing expertise, is the focus of this dissertation. A technical approach is adapted in this effort--an architecture and a prototype are described that provide the capability of capturing, organizing, and delivering the knowledge used by experts in classifying patterns of qualitative indicators about the business environment. Using a combination of artificial intelligence and machine learning techniques, a collection of objects termed "Apprentices" are employed to do the work of gathering, classifying, and distributing the expertise of knowledge workers in environmental scanning. Furthermore, an archival case study is provided to illustrate the operations of an Apprentice using "real world" data.
APA, Harvard, Vancouver, ISO, and other styles
45

Lee, Chin Siong. "NPS AUV workbench: collaborative environment for autonomous underwater vehicles (AUV) mission planning and 3D visualization." Thesis, Monterey, California. Naval Postgraduate School, 2004. http://hdl.handle.net/10945/1658.

Full text
Abstract:
Approved for public release, distribution is unlimited
alities. The extensible Markup Language (XML) is used for data storage and message exchange, Extensible 3D (X3D) Graphics for visualization and XML Schema-based Binary Compression (XSBC) for data compression. The AUV Workbench provides an intuitive cross-platform-capable tool with extensibility to provide for future enhancements such as agent-based control, asynchronous reporting and communication, loss-free message compression and built-in support for mission data archiving. This thesis also investigates the Jabber instant messaging protocol, showing its suitability for text and file messaging in a tactical environment. Exemplars show that the XML backbone of this open-source technology can be leveraged to enable both human and agent messaging with improvements over current systems. Integrated Jabber instant messaging support makes the NPS AUV Workbench the first custom application supporting XML Tactical Chat (XTC). Results demonstrate that the AUV Workbench provides a capable testbed for diverse AUV technologies, assisting in the development of traditional single-vehicle operations and agent-based multiple-vehicle methodologies. The flexible design of the Workbench further encourages integration of new extensions to serve operational needs. Exemplars demonstrate how in-mission and post-mission event monitoring by human operators can be achieved via simple web page, standard clients or custom instant messaging client. Finally, the AUV Workbench's potential as a tool in the development of multiple-AUV tactics and doctrine is discussed.
Civilian, Singapore Defence Science and Technology Agency
APA, Harvard, Vancouver, ISO, and other styles
46

El, Mahdaouy Abdelkader. "Accès à l'information dans les grandes collections textuelles en langue arabe." Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM091/document.

Full text
Abstract:
Face à la quantité d'information textuelle disponible sur le web en langue arabe, le développement des Systèmes de Recherche d'Information (SRI) efficaces est devenu incontournable pour retrouver l'information pertinente. La plupart des SRIs actuels de la langue arabe reposent sur la représentation par sac de mots et l'indexation des documents et des requêtes est effectuée souvent par des mots bruts ou des racines. Ce qui conduit à plusieurs problèmes tels que l'ambigüité et la disparité des termes, etc.Dans ce travail de thèse, nous nous sommes intéressés à apporter des solutions aux problèmes d'ambigüité et de disparité des termes pour l'amélioration de la représentation des documents et le processus de l'appariement des documents et des requêtes. Nous apportons quatre contributions au niveau de processus de représentation, d'indexation et de recherche d'information en langue arabe. La première contribution consiste à représenter les documents à la fois par des termes simples et des termes complexes. Cela est justifié par le fait que les termes simples seuls et isolés de leur contexte sont ambigus et moins précis pour représenter le contenu des documents. Ainsi, nous avons proposé une méthode hybride pour l’extraction de termes complexes en langue arabe, en combinant des propriétés linguistiques et des modèles statistiques. Le filtre linguistique repose à la fois sur l'étiquetage morphosyntaxique et la prise en compte des variations pour sélectionner les termes candidats. Pour sectionner les termes candidats pertinents, nous avons introduit une mesure d'association permettant de combiner l'information contextuelle avec les degrés de spécificité et d'unité. La deuxième contribution consiste à explorer et évaluer les systèmes de recherche d’informations permettant de tenir compte de l’ensemble des éléments d’indexation (termes simples et complexes). Par conséquent, nous étudions plusieurs extensions des modèles existants de RI pour l'intégration des termes complexes. En outre, nous explorons une panoplie de modèles de proximité. Pour la prise en compte des dépendances de termes dans les modèles de RI, nous introduisons une condition caractérisant de tels modèle et leur validation théorique. La troisième contribution permet de pallier le problème de disparité des termes en proposant une méthode pour intégrer la similarité entre les termes dans les modèles de RI en s'appuyant sur les représentations distribuées des mots (RDMs). L'idée sous-jacente consiste à permettre aux termes similaires à ceux de la requête de contribuer aux scores des documents. Les extensions des modèles de RI proposées dans le cadre de cette méthode sont validées en utilisant les contraintes heuristiques d'appariement sémantique. La dernière contribution concerne l'amélioration des modèles de rétro-pertinence (Pseudo Relevance Feedback PRF). Étant basée également sur les RDM, notre méthode permet d'intégrer la similarité entre les termes d'expansions et ceux de la requête dans les modèles standards PRF. La validation expérimentale de l'ensemble des contributions apportées dans le cadre de cette thèse est effectuée en utilisant la collection standard TREC 2002/2001 de la langue arabe
Given the amount of Arabic textual information available on the web, developing effective Information Retrieval Systems (IRS) has become essential to retrieve relevant information. Most of the current Arabic SRIs are based on the bag-of-words representation, where documents are indexed using surface words, roots or stems. Two main drawbacks of the latter representation are the ambiguity of Single Word Terms (SWTs) and term mismatch.The aim of this work is to deal with SWTs ambiguity and term mismatch. Accordingly, we propose four contributions to improve Arabic content representation, indexing, and retrieval. The first contribution consists of representing Arabic documents using Multi-Word Terms (MWTs). The latter is motivated by the fact that MWTs are more precise representational units and less ambiguous than isolated SWTs. Hence, we propose a hybrid method to extract Arabic MWTs, which combines linguistic and statistical filtering of MWT candidates. The linguistic filter uses POS tagging to identify MWTs candidates that fit a set of syntactic patterns and handles the problem of MWTs variation. Then, the statistical filter rank MWT candidate using our proposed association measure that combines contextual information and both termhood and unithood measures. In the second contribution, we explore and evaluate several IR models for ranking documents using both SWTs and MWTs. Additionally, we investigate a wide range of proximity-based IR models for Arabic IR. Then, we introduce a formal condition that IR models should satisfy to deal adequately with term dependencies. The third contribution consists of a method based on Distributed Representation of Word vectors, namely Word Embedding (WE), for Arabic IR. It relies on incorporating WE semantic similarities into existing probabilistic IR models in order to deal with term mismatch. The aim is to allow distinct, but semantically similar terms to contribute to documents scores. The last contribution is a method to incorporate WE similarity into Pseud-Relevance Feedback PRF for Arabic Information Retrieval. The main idea is to select expansion terms using their distribution in the set of top pseudo-relevant documents along with their similarity to the original query terms. The experimental validation of all the proposed contributions is performed using standard Arabic TREC 2002/2001 collection
APA, Harvard, Vancouver, ISO, and other styles
47

Craswell, Nicholas Eric. "Methods for Distributed Information Retrieval." Phd thesis, 2000. http://hdl.handle.net/1885/46255.

Full text
Abstract:
Published methods for distributed information retrieval generally rely on cooperation from search servers. But most real servers, particularly the tens of thousands available on the Web, are not engineered for such cooperation. This means that the majority of methods proposed, and evaluated in simulated environments of homogeneous cooperating servers, are never applied in practice. ¶ This thesis introduces new methods for server selection and results merging. The methods do not require search servers to cooperate, yet are as effective as the best methods which do. Two large experiments evaluate the new methods against many previously published methods. In contrast to previous experiments they simulate a Web-like environment, where servers employ varied retrieval algorithms and tend not to sub-partition documents from a single source. ...
APA, Harvard, Vancouver, ISO, and other styles
48

Viles, Charles L. "Maintaining retrieval effectiveness in distributed, dynamic information retrieval systems." 1996. http://books.google.com/books?id=g8rgAAAAMAAJ.

Full text
APA, Harvard, Vancouver, ISO, and other styles
49

Lu, Zhihong. "Scalable distributed architectures for information retrieval." 1999. https://scholarworks.umass.edu/dissertations/AAI9932326.

Full text
Abstract:
As information explodes across the Internet and intranets, information retrieval (IR) systems must cope with the challenge of scale. How to provide scalable performance for rapidly increasing data and workloads is critical in the design of next generation information retrieval systems. This dissertation studies scalable distributed IR architectures that not only provide quick response but also maintain acceptable retrieval accuracy. Our distributed architectures exploit parallelism in information retrieval on a cluster of parallel IR servers using symmetric multiprocessors, and use partial collection replication and selection as well as collection selection to restrict the search to a small percentage of data while maintaining retrieval accuracy. We first investigate using partial collection replication for IR systems. We examine query locality in real systems, how to select a partial replica based on relevance, how to load-balance between replicas and the original collection, as well as updating overheads and strategies. Our results show that there exists sufficient query locality to justify partial replication for information retrieval. Our proposed replica selection algorithm effectively selects relevant partial replicas, and is inexpensive to implement. Our evidence also indicates that partial replication achieves better performance than caching queries, because the replica selection algorithm finds similarity between nonidentical queries, and thus increases observed locality. We use a validated simulator to perform a detailed performance evaluation of distributed IR architectures. We explore how best to build parallel IR servers using symmetric multiprocessors, evaluate the performance of partial collection replication and collection selection, and compare the performance of partial collection replication with collection partitioning as well as collection selection. At last we present experiments for searching a terabyte of text. We also examine performance changes when we use fewer large servers, faster servers, and longer queries. Our results show that because IR systems have heavy computational and I/O loads, the number of CPUs, disks, and threads must be carefully balanced to achieve scalable performance. Our results show that partial collection replication is much more effective at decreasing the query response time than collection partitioning for a loaded system, even with fewer resources, and it requires only modest query locality. Our results also show that partial collection replication performs better than collection selection when there exists enough query locality, and it performs worse when the collection access is fairly uniform after collection selection. Finally our results show that replica and collection selection can be combined to provide quick response time for a terabyte of text. Changes of system configurations do not significantly change the relative improvements due to partial collection replication and collection selection, although they affect the absolute response time.
APA, Harvard, Vancouver, ISO, and other styles
50

Hawking, David Anthony. "Text retrieval over distributed collections." Phd thesis, 1998. http://hdl.handle.net/1885/147205.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography