Journal articles on the topic 'Retrieved document sets'

To see the other types of publications on this topic, follow the link: Retrieved document sets.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Retrieved document sets.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Fosci, Paolo, and Giuseppe Psaila. "Towards Flexible Retrieval, Integration and Analysis of JSON Data Sets through Fuzzy Sets: A Case Study." Information 12, no. 7 (June 22, 2021): 258. http://dx.doi.org/10.3390/info12070258.

Full text
Abstract:
How to exploit the incredible variety of JSON data sets currently available on the Internet, for example, on Open Data portals? The traditional approach would require getting them from the portals, then storing them into some JSON document store and integrating them within the document store. However, once data are integrated, the lack of a query language that provides flexible querying capabilities could prevent analysts from successfully completing their analysis. In this paper, we show how the J-CO Framework, a novel framework that we developed at the University of Bergamo (Italy) to manage large collections of JSON documents, is a unique and innovative tool that provides analysts with querying capabilities based on fuzzy sets over JSON data sets. Its query language, called J-CO-QL, is continuously evolving to increase potential applications; the most recent extensions give analysts the capability to retrieve data sets directly from web portals as well as constructs to apply fuzzy set theory to JSON documents and to provide analysts with the capability to perform imprecise queries on documents by means of flexible soft conditions. This paper presents a practical case study in which real data sets are retrieved, integrated and analyzed to effectively show the unique and innovative capabilities of the J-CO Framework.
APA, Harvard, Vancouver, ISO, and other styles
2

VILLATORO, ESAÚ, ANTONIO JUÁREZ, MANUEL MONTES, LUIS VILLASEÑOR, and L. ENRIQUE SUCAR. "Document ranking refinement using a Markov random field model." Natural Language Engineering 18, no. 2 (March 14, 2012): 155–85. http://dx.doi.org/10.1017/s1351324912000010.

Full text
Abstract:
AbstractThis paper introduces a novel ranking refinement approach based on relevance feedback for the task of document retrieval. We focus on the problem of ranking refinement since recent evaluation results from Information Retrieval (IR) systems indicate that current methods are effective retrieving most of the relevant documents for different sets of queries, but they have severe difficulties to generate a pertinent ranking of them. Motivated by these results, we propose a novel method to re-rank the list of documents returned by an IR system. The proposed method is based on a Markov Random Field (MRF) model that classifies the retrieved documents as relevant or irrelevant. The proposed MRF combines: (i) information provided by the base IR system, (ii) similarities among documents in the retrieved list, and (iii) relevance feedback information. Thus, the problem of ranking refinement is reduced to that of minimising an energy function that represents a trade-off between document relevance and inter-document similarity. Experiments were conducted using resources from four different tasks of the Cross Language Evaluation Forum (CLEF) forum as well as from one task of the Text Retrieval Conference (TREC) forum. The obtained results show the feasibility of the method for re-ranking documents in IR and also depict an improvement in mean average precision compared to a state of the art retrieval machine.
APA, Harvard, Vancouver, ISO, and other styles
3

Sunita, B., and T. John Peter. "Analysis of Various Multilingual Document Clustering." Journal of Computational and Theoretical Nanoscience 17, no. 9 (July 1, 2020): 3921–26. http://dx.doi.org/10.1166/jctn.2020.8989.

Full text
Abstract:
Today’s world is heading towards data science era. As the volume of the data is increasing extremely at exponential rate and data is produced and circulated all over the world, not only in English languages but in every regional language too. Since the data’s are in multilingual it is extremely difficult to manage such huge amount of variant data. Hence there is a scope for research work on multilingual document clustering. By document clustering we can retrieve the information of user query. This technique is to divide a given set of documents into a certain number of clusters. The aim is to create a multilingual document clustering that are related internally, but substantially different from each other. Main challenge faced while creating MDC, is the quality and stability of the cluster which swirls rapidly with document sets. Multilingual Document clustering has to be represented in a form of matrix which is done by either Vector Space Model or TF–IDF method. Where each word is given a value representing particular document. News articles are retrieved from relevant news explorer using proper search engines. Word in news articles are represented in one dimensional error using mathematical vector. Keywords in every articles are selected using term frequency (tf) and evaluated with inverse document frequency (idf). It uses discriminative power of keyword’s over articles. It encourages users to opt for cluster based browsing which is acceptable for processing the results. Big Data tools work efficiently in distributed environment which gives a significant analysis of our retrieved information. Many document clustering works well with small set of data but fails to deal with the large set of document. In this paper we concentrate on bring out the various problems that raise during multilingual document clustering and possible solution to overcome those problems.
APA, Harvard, Vancouver, ISO, and other styles
4

Jayasudha, R., S. Subramanian, and L. Sivakumar. "Genetic Algorithm and PSO Based Intelligent Software Reuse." Applied Mechanics and Materials 573 (June 2014): 612–17. http://dx.doi.org/10.4028/www.scientific.net/amm.573.612.

Full text
Abstract:
Software Reuse can improve the development time, cost and quality of Software artifacts. The Storage of artifacts plays an important role of easy retrieval of the needed components according to the requirement. In this paper a great measure has been taken for the retrieval of relevant component from the Ontology based repository. Two famous evolutionary algorithms Genetic Algorithm and Particle Swarm Optimization algorithm were used for extraction of needed component. These two algorithms are separately used for component retrieval. Genetic Algorithm in Component Retrieval is best suited if the repository has more number of relevant components. PSO for Component search is best suited if the query is highly refined to get more relevant document. PSO is used for the mainly query expansion. These two methods are combined first the retrieved set of component is organized with the help of GA and PSO for best query expansion. Thus these two methods are combined for best precision and retrieval time for different sets of requirement query
APA, Harvard, Vancouver, ISO, and other styles
5

Kumaravel, Girthana, and Swamynathan Sankaranarayanan. "PQPS: Prior-Art Query-Based Patent Summarizer Using RBM and Bi-LSTM." Mobile Information Systems 2021 (December 28, 2021): 1–19. http://dx.doi.org/10.1155/2021/2497770.

Full text
Abstract:
A prior-art search on patents ascertains the patentability constraints of the invention through an organized review of prior-art document sources. This search technique poses challenges because of the inherent vocabulary mismatch problem. Manual processing of every retrieved relevant patent in its entirety is a tedious and time-consuming job that demands automated patent summarization for ease of access. This paper employs deep learning models for summarization as they take advantage of the massive dataset present in the patents to improve the summary coherence. This work presents a novel approach of patent summarization named PQPS: prior-art query-based patent summarizer using restricted Boltzmann machine (RBM) and bidirectional long short-term memory (Bi-LSTM) models. The PQPS also addresses the vocabulary mismatch problem through query expansion with knowledge bases such as domain ontology and WordNet. It further enhances the retrieval rate through topic modeling and bibliographic coupling of citations. The experiments analyze various interlinked smart device patent sample sets. The proposed PQPS demonstrates that retrievability increases both in extractive and abstractive summaries.
APA, Harvard, Vancouver, ISO, and other styles
6

Yogish, Deepa, T. N. Manjunath, and Ravindra S. Hegadi. "Analysis of Vector Space Method in Information Retrieval for Smart Answering System." Journal of Computational and Theoretical Nanoscience 17, no. 9 (July 1, 2020): 4468–72. http://dx.doi.org/10.1166/jctn.2020.9099.

Full text
Abstract:
In the world of internet, searching play a vital role to retrieve the relevant answers for the user specific queries. The most promising application of natural language processing and information retrieval system is Question answering system which provides directly the accurate answer instead of set of documents. The main objective of information retrieval is to retrieve relevant document from a huge volume of data sets underlying in the internet using appropriatemodel. There are many models proposed for retrieval process such as Boolean, Vector space and Probabilistic method. Vector space model is best method in information retrieval for document ranking with efficient document representation which combines simplicity and clarity. VSM adopts similarity function to measure the matching between documents and user intent, and assign scores from the biggest to smallest. The documents and query are assigned with weights using term frequency and inverse document frequency method. To retrieve most relevant document to the user query term, document ranking function cosine similarity score is applied for every document and user query. The documents having more similarity scores will be considered as relevant documents to the query term and they are ranked based on these scores. This paper emphasizes on different techniques of information retrieval and Vector Space Model offers a realistic compromise in IR processing. It allows best weighing scheme which ranks the set of documents in order of relevance based on user query.
APA, Harvard, Vancouver, ISO, and other styles
7

Marijan, Robert, and Robert Leskovar. "A library’s information retrieval system (In)effectiveness: case study." Library Hi Tech 33, no. 3 (September 21, 2015): 369–86. http://dx.doi.org/10.1108/lht-07-2015-0071.

Full text
Abstract:
Purpose – The purpose of this paper is to evaluate the effectiveness of the information retrieval component of a daily newspaper publisher’s integrated library system (ILS) in comparison with the open source alternatives and observe the impact of the scale of metadata, generated daily by library administrators, on retrieved result sets. Design/methodology/approach – In Experiment 1, the authors compared the result sets of the information retrieval system (IRS) component of the publisher’s current ILS and the result sets of proposed ones with human-assessed relevance judgment set. In Experiment 2, the authors compared the performance of proposed IRS components with the publisher’s current production IRS, using result sets of current IRS classified as relevant. Both experiments were conducted using standard information retrieval (IR) evaluation methods: precision, recall, precision at k, F-measure, mean average precision and 11-point interpolated average precision. Findings – Results showed that: first, in Experiment 1, the publisher’s current production ILS ranked last of all participating IRSs when compared to a relevance document set classified by the senior library administrator; and second, in Experiment 2, the tested IR components’ request handlers that used only automatically generated metadata performed slightly better than request handlers that used all of the metadata fields. Therefore, regarding the effectiveness of IR, the daily human effort of generating the publisher’s current set of metadata attributes is unjustified. Research limitations/implications – The experiments’ collections contained Slovene language with large number of variations of the forms of nouns, verbs and adjectives. The results could be different if the experiments’ collections contained languages with different grammatical properties. Practical implications – The authors have confirmed, using standard IR methods, that the IR component used in the publisher’s current ILS, could be adequately replaced with an open source component. Based on the research, the publisher could incorporate the suggested open source IR components in practice. In the research, the authors have described the methods that can be used by libraries for evaluating the effectiveness of the IR of their ILSs. Originality/value – The paper provides a framework for the evaluation of an ILS’s IR effectiveness for libraries. Based on the evaluation results, the libraries could replace the IR components if their current information system setup allows it.
APA, Harvard, Vancouver, ISO, and other styles
8

Wang, Yanshan, In-Chan Choi, and Hongfang Liu. "Generalized ensemble model for document ranking in information retrieval." Computer Science and Information Systems 14, no. 1 (2017): 123–51. http://dx.doi.org/10.2298/csis160229042w.

Full text
Abstract:
A generalized ensemble model (gEnM) for document ranking is proposed in this paper. The gEnM linearly combines the document retrieval models and tries to retrieve relevant documents at high positions. In order to obtain the optimal linear combination of multiple document retrieval models or rankers, an optimization program is formulated by directly maximizing the mean average precision. Both supervised and unsupervised learning algorithms are presented to solve this program. For the supervised scheme, two approaches are considered based on the data setting, namely batch and online setting. In the batch setting, we propose a revised Newton?s algorithm, gEnM.BAT, by approximating the derivative and Hessian matrix. In the online setting, we advocate a stochastic gradient descent (SGD) based algorithm-gEnM.ON. As for the unsupervised scheme, an unsupervised ensemble model (UnsEnM) by iteratively co-learning from each constituent ranker is presented. Experimental study on benchmark data sets verifies the effectiveness of the proposed algorithms. Therefore, with appropriate algorithms, the gEnM is a viable option in diverse practical information retrieval applications.
APA, Harvard, Vancouver, ISO, and other styles
9

Nita, Stefania Loredana. "Secure Document Search in Cloud Computing using MapReduce." Scientific Bulletin of Naval Academy XXIII, no. 1 (July 15, 2020): 231–35. http://dx.doi.org/10.21279/1454-864x-20-i1-031.

Full text
Abstract:
Nowadays, cloud computing is an important technology, which is part of our daily lives. Moving to cloud brings some benefits: create new applications, store large sets of data, process large amount of data. Individual users or companies can store own data on cloud (e.g. maritime, environmental protection, physics analysis etc.). An important thing before storing in cloud is that data needs to be encrypted, in order to keep its confidentiality. Among these, users can store encrypted documents on cloud. However, when owner needs a specific document, they should retrieve all documents from cloud, decrypt them, chose the desired document, encrypt again and finally store back encrypted documents on cloud. To avoid these entire steps, a user can choose to work with searchable encryption. This is an encryption technique, where key words (or indexes) are associated to encrypted documents, and when the owner needs a document, he/she only needs to search throw key words and then retrieve the documents that have associated the desired keywords. An important programming paradigm for cloud computing is MapReduce, which allows high scalability on a large number of servers in a cluster. Basically, MapReduce works with (key, value) pairs. In the current study paper, we describe a new technique through which a user can extract encrypted documents stored on cloud servers based on key words, using searchable encryption and MapReduce.
APA, Harvard, Vancouver, ISO, and other styles
10

Bhari, Purushottam, Abhishek Dadhich, and Vikram Khandelwal. "An Approach for Improving Similarity Measure Using Fuzzy Logic." ECS Transactions 107, no. 1 (April 24, 2022): 20213–33. http://dx.doi.org/10.1149/10701.20213ecst.

Full text
Abstract:
An information retrieval system stores and indexes documents such that when users submit a query, the system gets relevant documents and assigns a score to each one. The higher the score, the more important the document is. IR systems typically yield vast result sets, and users must spend a significant amount of time sifting through them to identify the elements that are genuinely important. Different suggestions for applying evolutionary computing to the topic of information retrieval will be reviewed from the specialist literature. To do so, researchers looked at a variety of IR issues that were addressed using evolutionary algorithms. Some of the current ways will be detailed in detail; for example, when dealing with specialized domain knowledge, this challenge can be solved by embedding a knowledge base into existing information retrieval systems that illustrates the relationships between index words. The fuzzy set theory may be used to change the knowledge in the bases to cope with the ambiguity that is typical of human knowledge. In this work, a novel way for implementing a similarity measure utilizing fuzzy logic for IR is provided. A suggested similarity metric is based on many IR system attributes that boost IR system performance. This method's strength is that it can extract the majority of a document's characteristics. Fuzzy rules, which translated domain knowledge into fuzzy sets, were also designed to make this most effective. Our suggested similarity metric is validated using the CACM and CRAN benchmark datasets.
APA, Harvard, Vancouver, ISO, and other styles
11

Al Sibahee, Mustafa A., Ayad I. Abdulsada, Zaid Ameen Abduljabbar, Junchao Ma, Vincent Omollo Nyangaresi, and Samir M. Umran. "Lightweight, Secure, Similar-Document Retrieval over Encrypted Data." Applied Sciences 11, no. 24 (December 17, 2021): 12040. http://dx.doi.org/10.3390/app112412040.

Full text
Abstract:
Applications for document similarity detection are widespread in diverse communities, including institutions and corporations. However, currently available detection systems fail to take into account the private nature of material or documents that have been outsourced to remote servers. None of the existing solutions can be described as lightweight techniques that are compatible with lightweight client implementation, and this deficiency can limit the effectiveness of these systems. For instance, the discovery of similarity between two conferences or journals must maintain the privacy of the submitted papers in a lightweight manner to ensure that the security and application requirements for limited-resource devices are fulfilled. This paper considers the problem of lightweight similarity detection between document sets while preserving the privacy of the material. The proposed solution permits documents to be compared without disclosing the content to untrusted servers. The fingerprint set for each document is determined in an efficient manner, also developing an inverted index that uses the whole set of fingerprints. Before being uploaded to the untrusted server, this index is secured by the Paillier cryptosystem. This study develops a secure, yet efficient method for scalable encrypted document comparison. To evaluate the computational performance of this method, this paper carries out several comparative assessments against other major approaches.
APA, Harvard, Vancouver, ISO, and other styles
12

Kern, Stefan, Thomas Lavergne, Dirk Notz, Leif Toudal Pedersen, Rasmus Tage Tonboe, Roberto Saldo, and Atle MacDonald Sørensen. "Satellite passive microwave sea-ice concentration data set intercomparison: closed ice and ship-based observations." Cryosphere 13, no. 12 (December 10, 2019): 3261–307. http://dx.doi.org/10.5194/tc-13-3261-2019.

Full text
Abstract:
Abstract. We report on results of a systematic inter-comparison of 10 global sea-ice concentration (SIC) data products at 12.5 to 50.0 km grid resolution for both the Arctic and the Antarctic. The products are compared with each other with respect to differences in SIC, sea-ice area (SIA), and sea-ice extent (SIE), and they are compared against a global wintertime near-100 % reference SIC data set for closed pack ice conditions and against global year-round ship-based visual observations of the sea-ice cover. We can group the products based on the concept of their SIC retrieval algorithms. Group I consists of data sets using the self-optimizing EUMETSAT OSI SAF and ESA CCI algorithms. Group II includes data using the Comiso bootstrap algorithm and the NOAA NSIDC sea-ice concentration climate data record (CDR). The standard NASA Team and the ARTIST Sea Ice (ASI) algorithms are put into group III, and NASA Team 2 is the only element of group IV. The three CDRs of group I (SICCI-25km, SICCI-50km, and OSI-450) are biased low compared to a 100 % reference SIC data set with biases of −0.4 % to −1.0 % (Arctic) and −0.3 % to −1.1 % (Antarctic). Products of group II appear to be mostly biased high in the Arctic by between +1.0 % and +3.5 %, while their biases in the Antarctic range from −0.2 % to +0.9 %. Group III product biases are different for the Arctic, +0.9 % (NASA Team) and −3.7 % (ASI), but similar for the Antarctic, −5.4 % and −5.6 %, respectively. The standard deviation is smaller in the Arctic for the quoted group I products (1.9 % to 2.9 %) and Antarctic (2.5 % to 3.1 %) than for group II and III products: 3.6 % to 5.0 % for the Arctic and 4.0 % to 6.5 % for the Antarctic. We refer to the paper to understand why we could not give values for group IV here. We discuss the impact of truncating the SIC distribution, as naturally retrieved by the algorithms around the 100 % sea-ice concentration end. We show that evaluation studies of such truncated SIC products can result in misleading statistics and favour data sets that systematically overestimate SIC. We describe a method to reconstruct the non-truncated distribution of SIC before the evaluation is performed. On the basis of this evaluation, we open a discussion about the overestimation of SIC in data products, with far-reaching consequences for surface heat flux estimations in winter. We also document inconsistencies in the behaviour of the weather filters used in products of group II, and we suggest advancing studies about the influence of these weather filters on SIA and SIE time series and their trends.
APA, Harvard, Vancouver, ISO, and other styles
13

Yan, Meichao, Yu Wen, Qingxuan Shi, and Xuedong Tian. "A Multimodal Retrieval and Ranking Method for Scientific Documents Based on HFS and XLNet." Scientific Programming 2022 (January 4, 2022): 1–11. http://dx.doi.org/10.1155/2022/5373531.

Full text
Abstract:
Aiming at the defects of traditional full-text retrieval models in dealing with mathematical expressions, which are special objects different from ordinary texts, a multimodal retrieval and ranking method for scientific documents based on hesitant fuzzy sets (HFS) and XLNet is proposed. This method integrates multimodal information, such as mathematical expression images and context text, as keywords to realize the retrieval of scientific documents. In the image modal, the images of mathematical expressions are recognized, and the hesitancy fuzzy set theory is introduced to calculate the hesitancy fuzzy similarity between mathematical query expressions and the mathematical expressions in candidate scientific documents. Meanwhile, in the text mode, XLNet is used to generate word vectors of the mathematical expression context to obtain the similarity between the query text and the mathematical expression context of the candidate scientific documents. Finally, the multimodal evaluation is integrated, and the hesitation fuzzy set is constructed at the document level to obtain the final scores of the scientific documents and corresponding ranked output. The experimental results show that the recall and precision of this method are 0.774 and 0.663 on the NTCIR dataset, respectively, and the average normalized discounted cumulative gain (NDCG) value of the top-10 ranking results is 0.880 on the Chinese scientific document (CSD) dataset.
APA, Harvard, Vancouver, ISO, and other styles
14

Marković, Marko, and Stevan Gostojić. "Open Judicial Data: A Comparative Analysis." Social Science Computer Review 38, no. 3 (May 20, 2018): 295–314. http://dx.doi.org/10.1177/0894439318770744.

Full text
Abstract:
Open data gained considerable traction in government, nonprofit, and profit organizations in the last several years. Open judicial data increase transparency of the judiciary and are an integral part of open justice. This article identifies relevant judicial data set types, reviews widely used open government data evaluation methodologies, selects a methodology for evaluating judicial data sets, uses the methodology to evaluate openness of judicial data sets in chosen countries, and suggests actions to improve efficiency and effectiveness of open data initiatives. Our findings show that judicial data sets should at least include court decisions, case registers, filed document records, and statistical data. The Global Open Data Index methodology is the most suitable for the task. We suggest considering actions to enable more effective and efficient opening of judicial data sets, including publishing legal documents and legal data in standardized machine-readable formats, assigning standardized metadata to the published documents and data sets, providing both programmable and bulk access to documents and data, explicitly publishing licenses which apply to them in a machine-readable format, and introducing a centralized portal enabling retrieval and browsing of open data sets from a single source.
APA, Harvard, Vancouver, ISO, and other styles
15

El, Barbary. "Document classification in information retrieval system based on neutrosophic sets." Filomat 34, no. 1 (2020): 89–97. http://dx.doi.org/10.2298/fil2001089b.

Full text
APA, Harvard, Vancouver, ISO, and other styles
16

Rashaideh, Hasan, Ahmad Sawaie, Mohammed Azmi Al-Betar, Laith Mohammad Abualigah, Mohammed M. Al-laham, Ra’ed M. Al-Khatib, and Malik Braik. "A Grey Wolf Optimizer for Text Document Clustering." Journal of Intelligent Systems 29, no. 1 (July 21, 2018): 814–30. http://dx.doi.org/10.1515/jisys-2018-0194.

Full text
Abstract:
Abstract Text clustering problem (TCP) is a leading process in many key areas such as information retrieval, text mining, and natural language processing. This presents the need for a potent document clustering algorithm that can be used effectively to navigate, summarize, and arrange information to congregate large data sets. This paper encompasses an adaptation of the grey wolf optimizer (GWO) for TCP, referred to as TCP-GWO. The TCP demands a degree of accuracy beyond that which is possible with metaheuristic swarm-based algorithms. The main issue to be addressed is how to split text documents on the basis of GWO into homogeneous clusters that are sufficiently precise and functional. Specifically, TCP-GWO, or referred to as the document clustering algorithm, used the average distance of documents to the cluster centroid (ADDC) as an objective function to repeatedly optimize the distance between the clusters of the documents. The accuracy and efficiency of the proposed TCP-GWO was demonstrated on a sufficiently large number of documents of variable sizes, documents that were randomly selected from a set of six publicly available data sets. Documents of high complexity were also included in the evaluation process to assess the recall detection rate of the document clustering algorithm. The experimental results for a test set of over a part of 1300 documents showed that failure to correctly cluster a document occurred in less than 20% of cases with a recall rate of more than 65% for a highly complex data set. The high F-measure rate and ability to cluster documents in an effective manner are important advances resulting from this research. The proposed TCP-GWO method was compared to the other well-established text clustering methods using randomly selected data sets. Interestingly, TCP-GWO outperforms the comparative methods in terms of precision, recall, and F-measure rates. In a nutshell, the results illustrate that the proposed TCP-GWO is able to excel compared to the other comparative clustering methods in terms of measurement criteria, whereby more than 55% of the documents were correctly clustered with a high level of accuracy.
APA, Harvard, Vancouver, ISO, and other styles
17

Göksel, Gökhan, Ahmet Arslan, and Bekir Taner Dinçer. "A selective approach to stemming for minimizing the risk of failure in information retrieval systems." PeerJ Computer Science 9 (January 10, 2023): e1175. http://dx.doi.org/10.7717/peerj-cs.1175.

Full text
Abstract:
Stemming is supposed to improve the average performance of an information retrieval system, but in practice, past experimental results show that this is not always the case. In this article, we propose a selective approach to stemming that decides whether stemming should be applied or not on a query basis. Our method aims at minimizing the risk of failure caused by stemming in retrieving semantically-related documents. The proposed work mainly contributes to the IR literature by proposing an application of selective stemming and a set of new features that derived from the term frequency distributions of the systems in selection. The method based on the approach leverages both some of the query performance predictors and the derived features and a machine learning technique. It is comprehensively evaluated using three rule-based stemmers and eight query sets corresponding to four document collections from the standard TREC and NTCIR datasets. The document collections, except for one, include Web documents ranging from 25 million to 733 million. The results of the experiments show that the method is capable of making accurate selections that increase the robustness of the system and minimize the risk of failure (i.e., per query performance losses) across queries. The results also show that the method attains a systematically higher average retrieval performance than the single systems for most query sets.
APA, Harvard, Vancouver, ISO, and other styles
18

Song, Min, Xiaohua Hu, Illhoi Yoo, and Eric Koppel. "A Dynamic and Semantically-Aware Technique for Document Clustering in Biomedical Literature." International Journal of Data Warehousing and Mining 5, no. 4 (October 2009): 44–57. http://dx.doi.org/10.4018/jdwm.2009080703.

Full text
Abstract:
As an unsupervised learning process, document clustering has been used to improve information retrieval performance by grouping similar documents and to help text mining approaches by providing a high-quality input for them. In this article, the authors propose a novel hybrid clustering technique that incorporates semantic smoothing of document models into a neural network framework. Recently, it has been reported that the semantic smoothing model enhances the retrieval quality in Information Retrieval (IR). Inspired by that, the authors developed and applied a context-sensitive semantic smoothing model to boost accuracy of clustering that is generated by a dynamic growing cell structure algorithm, a variation of the neural network technique. They evaluated the proposed technique on biomedical article sets from MEDLINE, the largest biomedical digital library in the world. Their experimental evaluations show that the proposed algorithm significantly improves the clustering quality over the traditional clustering techniques including k-means and self-organizing map (SOM).
APA, Harvard, Vancouver, ISO, and other styles
19

Eswaraiah, Poluru, and Hussain Syed. "An efficient ontology model with query execution for accurate document content extraction." Indonesian Journal of Electrical Engineering and Computer Science 29, no. 2 (February 1, 2023): 981. http://dx.doi.org/10.11591/ijeecs.v29.i2.pp981-989.

Full text
Abstract:
<span lang="EN-US">The technique of extracting important documents from massive data collections is known as information retrieval (IR). The dataset provider coupled with the increasing demand for high-quality retrieval results, has resulted in traditional information retrieval approaches being increasingly insufficient to meet the challenge of providing high-quality search results. Research has concentrated on information retrieval and interactive query formation through ontologies in order to overcome these challenges, with a specific emphasis on enhancing the functionality between information and search queries in order to bring the outcome sets closer to the research requirements of users. In the context of document retrieval technologies, it is a process that assists researchers in extracting documents from data collections. It is discussed in this research how to use ontology-based information retrieval approaches and techniques, taking into account the issues of ontology modelling, processing, and the transformation of ontological knowledge into database search queries. In this research work, an efficient optimized ontology model with query execution for content extraction from documents (OOM-QE-CE) is proposed. The existing ontology-to-database transformation and mapping methodologies are also thoroughly examined in terms of data and semantic loss, structural mapping and domain knowledge applicability.</span>
APA, Harvard, Vancouver, ISO, and other styles
20

Haji, Chiai Mohammed. "Linguistic Analysis on Cursive Characters." Journal of duhok university 25, no. 2 (November 9, 2022): 33–40. http://dx.doi.org/10.26682/sjuod.2022.25.2.3.

Full text
Abstract:
Document Analysis has major importance in Information Retrieval Systems. Dredged with vaults of paper and material documents, to protect very important information and the summaries, without losing their meaning and importance, each document need to be properly curated and processed. Ancient written documents possess many types of cursive language character sets, which are very tedious to discriminate the characters and subsequently the right meaning. To overcome the difficulties of reading the cursive language characters and prevent misunderstanding the meaning and the importance of documents, an improvised CNN [6] model to work on OCR and Tesseract API has been proposed in this work. The documents are scanned, curated and preprocessed in the forms of images. CNN are the best algorithms, hitherto in the existing AI and Deep Learning arena. CNN with OCR API could contribute to the development of efficient strategies of character recognition even with complex cursive styles. A method which is adaptable to the classification and segmentation of the text images with cursive styles is proposed I this article. Tesseract is the popular and effective OCR library with rich API that can enrich the CNN-OCR model
APA, Harvard, Vancouver, ISO, and other styles
21

Mandal, Sayan, Samit Biswas, Amit Kumar Das, and Bhabatosh Chanda. "Land Map Image Dataset: Ground-Truth And Classification Using Visual And Textural Features." Image Processing & Communications 19, no. 4 (December 1, 2014): 37–55. http://dx.doi.org/10.1515/ipc-2015-0024.

Full text
Abstract:
Abstract Research on document image analysis is actively pursued in the last few decades and services like OCR, vectorization of drawings/graphics and various types of form processing are very common. Handwritten documents, old historical documents and documents captured through camera are now being the subjects of active research. However, another very important type of paper document, namely the map document image processing research suffers due to the inherent complexities of the map document and also for nonavailability of benchmark public data-sets. This paper presents a new data-set, namely, the Land Map Image Database (LMIDb) that consists of a variety of land maps images (446 images at present and growing; scanned at 200/300 dpi in TIF format) and the corresponding ground-truth. Using semiautomatic tools non-text part of the images are deleted and the text-only ground-truth is also kept in the database. This paper also presents a classification strategy for map images using which the maps in the database are automatically classified into Political (Po), Physical (Ph), Resource (R) and Topographic (T) maps. The automatic classification of maps help indexing of the images in LMIDb for archival and easy retrieval of the right maps to get the appropriate geographical information. Classification accuracy is also tested on the proposed data-set and the result is encouraging.
APA, Harvard, Vancouver, ISO, and other styles
22

Egghe, L., and C. Michel. "Strong similarity measures for ordered sets of documents in information retrieval." Information Processing & Management 38, no. 6 (November 2002): 823–48. http://dx.doi.org/10.1016/s0306-4573(01)00051-6.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Vött, Andreas, Timo Willershäuser, Björn R. Röbke, Lea Obrocki, Peter Fischer, Hanna Hadler, Kurt Emde, Birgitta Eder, Hans-Joachim Gehrke, and Franziska Lang. "Major flood events recorded in the Holocene sedimentary sequence of the uplifted Ladiko and Makrisia basins near ancient Olympia (western Peloponnese, Greece)." Zeitschrift für Geomorphologie, Supplementary Issues 62, no. 2 (October 1, 2019): 143–95. http://dx.doi.org/10.1127/zfg_suppl/2018/0499.

Full text
Abstract:
Detailed palaeoenvironmental studies were conducted in the Ladiko and Makrisia basins near the Alpheios River and ancient Olympia (western Peloponnese, Greece) to assess major landscape changes during the Holocene. Previous studies and literature data document that the area experienced crust uplift of minimum 13 m to 30 m since the mid-Holocene. Geological archives were sampled along a vibracore transect connecting the Ladiko and Makrisia basins. Sediment cores were analyzed using sedimento-logical, geochemical and micropalaeontological methods. Geochronological reconstruction of major landscape changes is based on a set of 24 radiocarbon dates. Geophysical studies were carried out using electrical resistivity tomography (ERT) and Direct Push-Electrical Conductivity (DP-EC) measurements to detect stratigraphic changes and subsurface bedrock structures. The stratigraphic record of the uplifted lake basins of Ladiko and Makrisia revealed two major lithostratigraphic units. Unit I, predominantly composed of clay, silt and silty fine sand, reflects prevailing low-energy sedimentary conditions typical of quiescent (fluvio-)limnic waterbodies. Unit II is made out of fine to coarse sand and documents repeated interferences of unit I associated with abrupt and temporary high-energy flood type (= heft) events. We found signals of four different heft events (H1 to H4) showing strong stratigraphic and geochronological consistencies along the vibracore transect. The following age ranges were determined: H1 – between 4360 – 4330 cal BC and 4320 – 4080 cal BC; H2 – be- tween 2830 – 2500 cal BC and 2270 – 2140 cal BC; H3 – between 1220 –1280 cal AD and 1290 –1390 cal AD; H4 – between 1640 –1800 cal AD and 1650 –1800 cal AD. Different hypotheses concerning the characteristics, potential trigger mechanisms and causes of the flood events were tested against the background of strong Holocene crust uplift and using a variety of different methodological approaches: Geomorphological and granulometric aspects, micropalaeontological contexts, geochronological data sets, numerical simulation of flooding events, local tectonic uplift, and the palaeoclimate background were taken into account. We hypothesize that, during the mid-Holocene, the study area was affected by tsunami events, namely between 4360 – 4330 cal BC and 4320 – 4080 cal BC (H1) and between 2830 – 2500 cal BC and 2270 – 2140 cal BC (H2). These ages are very well consistent with the supra-regional and regional tsunami event signal retrieved from many coastal archives in large parts of western Greece. The timing of flood events H1 and H2 is highly consistent with ages of (supra-)regional tectonic events known from literature and is not consistent with increased flood indices of palaeoclimate data available for western Greece. Tsunami inundation scenarios based on numerical simulation are highly consistent with vibracoring and geophysical (ERT, DP-EC) data. In contrast, heft events H3 and H4 are possibly related to phases of increased precipi- tation and flooding activity in the Mediterranean or to land-based geomorphological processes triggered by regional tectonic events (RTE). Neolithic, Chalcolithic as well as Early and Middle Helladic human activities documented at ancient Olympia were most probably affected by tsunami heft events H1 and H2. Sandy deposits of tsunami event H2, covering the prehistorical tumulus, seem to have been used as a higher and dry base to construct the apsidal houses in the center of the later sanctuary at Olympia. The site, already abandoned, must have again been subject to major flood events during the 13/14th cent. AD and the 17–19th cent. AD associated with heft events H3 and H4.
APA, Harvard, Vancouver, ISO, and other styles
24

Masson, Patrick K. "Searching with combination sets in CPC: An efficient way to retrieve relevant documents." World Patent Information 54 (September 2018): S93—S98. http://dx.doi.org/10.1016/j.wpi.2017.03.007.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Jatwani, Poonam, Pradeep Tomar, and Vandana Dhingra. "Comparative Performance Evaluation of Keyword and Semantic Search Engines using Different Query Set Categories." Recent Advances in Computer Science and Communications 13, no. 5 (November 5, 2020): 1057–70. http://dx.doi.org/10.2174/2213275912666190328202153.

Full text
Abstract:
Background: Keyword search engines are unable to understand the intention of user as a result they produce enormous results for user to distinguish between relevant and non relevant answers of user queries. This has led to rise in requirement to study search capabilities of different search engines. In this research work, experimental evaluation is done based on different metrics to distinguish different search engines on the basis of type of query that can be handled by them. Methods: To check the semantics handling performance, four types of query sets consisting of 20 queries of agriculture domain are chosen. Different query set are single term queries, two term queries, three term queries and NLP queries. Queries from different query set were submitted to Google, DuckDuckGo and Bing search engines. Effectiveness of different search engines for different nature of queries is experimented and evaluated in this research using Grade relevance measures like Cumulative Gain, Discounted Cumulative Gain, Ideal Discounted Cumulative Gain, and Normalized Discounted Cumulative Gain in addition to the precision metric. Results: Our experimental results demonstrate that for single term query, Google retrieves more relevant documents and performs better and DuckDuckGo retrieves more relevant documents for NLP queries. Conclusion: Analysis done in this research shows that DuckDuckGo understand human intention and retrieve more relevant result, through NLP queries as compared to other search engines.
APA, Harvard, Vancouver, ISO, and other styles
26

Wang, Tong, Qi Han, and Bauke de Vries. "Sustainable Industrial Site Redevelopment Planning Support." International Journal of E-Planning Research 6, no. 2 (April 2017): 39–53. http://dx.doi.org/10.4018/ijepr.2017040103.

Full text
Abstract:
Abandoned industrial sites could be redeveloped in a sustainable way with the help of previous experience. This paper presents a case-based reasoning (CBR) approach to support sustainable industrial site redevelopment. For a target site that needs to be redeveloped, qualitative important key concerns are identified and quantitative attributes, which are important for sustainability, are calculated. The key concerns are generated from zoning documents and the attributes are calculated from spatial data sets. Machine learning techniques are used to find the most influential attributes to determine transition forms. Similar cases from the constructed case base are retrieved based on the algorithm the authors have proposed. The North Brabant region in the Netherlands is used as a case study. A web application is presented to illustrate the approach. The e-planning method provides a straightforward way to retrieve transition forms from similarly redeveloped cases for new regional planning tasks with a focus on sustainability.
APA, Harvard, Vancouver, ISO, and other styles
27

Venkanna, Gugulothu, and Dr K. F. Bharati. "Optimal Text Document Clustering Enabled by Weighed Similarity Oriented Jaya With Grey Wolf Optimization Algorithm." Computer Journal 64, no. 6 (April 30, 2021): 960–72. http://dx.doi.org/10.1093/comjnl/bxab013.

Full text
Abstract:
Abstract Owing to scientific development, a variety of challenges present in the field of information retrieval. These challenges are because of the increased usage of large volumes of data. These huge amounts of data are presented from large-scale distributed networks. Centralization of these data to carry out analysis is tricky. There exists a requirement for novel text document clustering algorithms, which overcomes challenges in clustering. The two most important challenges in clustering are clustering accuracy and quality. For this reason, this paper intends to present an ideal clustering model for text document using term frequency–inverse document frequency, which is considered as feature sets. Here, the initial centroid selection is much concentrated which can automatically cluster the text using weighted similarity measure in the proposed clustering process. In fact, the weighted similarity function involves the inter-cluster, and intra-cluster similarity of both ordered and unordered documents, which is used to minimize weighted similarity among the documents. An advanced model for clustering is proposed by the hybrid optimization algorithm, which is the combination of the Jaya Algorithm (JA) and Grey Wolf Algorithm (GWO), and so the proposed algorithm is termed as JA-based GWO. Finally, the performance of the proposed model is verified through a comparative analysis with the state-of-the-art models. The performance analysis exhibits that the proposed model is 96.56% better than genetic algorithm, 99.46% better than particle swarm optimization, 97.09% superior to Dragonfly algorithm, and 96.21% better than JA for the similarity index. Therefore, the proposed model has confirmed its efficiency through valuable analysis.
APA, Harvard, Vancouver, ISO, and other styles
28

Devezas, José. "Graph-based entity-oriented search." ACM SIGIR Forum 55, no. 1 (June 2021): 1–2. http://dx.doi.org/10.1145/3476415.3476430.

Full text
Abstract:
Entity-oriented search has revolutionized search engines. In the era of Google Knowledge Graph and Microsoft Satori, users demand an effortless process of search. Whether they express an information need through a keyword query, expecting documents and entities, or through a clicked entity, expecting related entities, there is an inherent need for the combination of corpora and knowledge bases to obtain an answer. Such integration frequently relies on independent signals extracted from inverted indexes, and from quad indexes indirectly accessed through queries to a triplestore. However, relying on two separate representation models inhibits the effective cross-referencing of information, discarding otherwise available relations that could lead to a better ranking. Moreover, different retrieval tasks often demand separate implementations, although the problem is, at its core, the same. With the goal of harnessing all available information to optimize retrieval, we explore joint representation models of documents and entities, while taking a step towards the definition of a more general retrieval approach. Specifically, we propose that graphs should be used to incorporate explicit and implicit information derived from the relations between text found in corpora and entities found in knowledge bases. We also take advantage of this framework to elaborate a general model for entity-oriented search, proposing a universal ranking function for the tasks of ad hoc document retrieval (leveraging entities), ad hoc entity retrieval, and entity list completion. At a conceptual stage, we begin by proposing the graph-of-entity, based on the relations between combinations of term and entity nodes. We introduce the entity weight as the corresponding ranking function, relying on the idea of seed nodes for representing the query, either directly through term nodes, or based on the expansion to adjacent entity nodes. The score is computed based on a series of geodesic distances to the remaining nodes, providing a ranking for the documents (or entities) in the graph. In order to improve on the low scalability of the graph-of-entity, we then redesigned this model in a way that reduced the number of edges in relation to the number of nodes, by relying on the hypergraph data structure. The resulting model, which we called hypergraph-of-entity, is the main contribution of this thesis. The obtained reduction was achieved by replacing binary edges with n -ary relations based on sets of nodes and entities (undirected document hyperedges), sets of entities (undirected hyperedges, either based on cooccurrence or a grouping by semantic subject), and pairs of a set of terms and a set of one entity (directed hyperedges, mapping text to an object). We introduce the random walk score as the corresponding ranking function, relying on the same idea of seed nodes, similar to the entity weight in the graph-of-entity. Scoring based on this function is highly reliant on the structure of the hypergraph, which we call representation-driven retrieval. As such, we explore several extensions of the hypergraph-of-entity, including relations of synonymy, or contextual similarity, as well as different weighting functions per node and hyperedge type. We also propose TF-bins as a discretization for representing term frequency in the hypergraph-of-entity. For the random walk score, we propose and explore several parameters, including length and repeats, with or without seed node expansion, direction, or weights, and with or without a certain degree of node and/or hyperedge fatigue, a concept that we also propose. For evaluation, we took advantage of TREC 2017 OpenSearch track, which relied on an online evaluation process based on the Living Labs API, and we also participated in TREC 2018 Common Core track, which was based on the newly introduced TREC Washington Post Corpus. Our main experiments were supported on the INEX 2009 Wikipedia collection, which proved to be a fundamental test collection for assessing retrieval effectiveness across multiple tasks. At first, our experiments solely focused on ad hoc document retrieval, ensuring that the model performed adequately for a classical task. We then expanded the work to cover all three entity-oriented search tasks. Results supported the viability of a general retrieval model, opening novel challenges in information retrieval, and proposing a new path towards generality in this area.
APA, Harvard, Vancouver, ISO, and other styles
29

Al Rababaa, Mamoun Suleiman, and Essam Said Hanandeh. "The Automated VSMs to Categorize Arabic Text Data Sets." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 13, no. 1 (March 31, 2014): 4074–81. http://dx.doi.org/10.24297/ijct.v13i1.2925.

Full text
Abstract:
Text Categorization is one of the most important tasks in information retrieval and data mining. This paper aims at investigating different variations of vector space models (VSMs) using KNN algorithm. we used 242 Arabic abstract documents that were used by (Hmeidi & Kanaan, 1997). The bases of our comparison are the most popular text evaluation measures; we use Recall measure, Precision measure, and F1 measure. The Experimental results against the Saudi data sets reveal that Cosine outperformed over of the Dice and Jaccard coefficients.
APA, Harvard, Vancouver, ISO, and other styles
30

Umamaheswari, E., and T. V. Geetha. "Event Mining Through Clustering." Journal of Intelligent Systems 23, no. 1 (January 1, 2014): 59–73. http://dx.doi.org/10.1515/jisys-2013-0025.

Full text
Abstract:
AbstractTraditional document clustering algorithms consider text-based features such as unique word count, concept count, etc. to cluster documents. Meanwhile, event mining is the extraction of specific events, their related sub-events, and the associated semantic relations from documents. This work discusses an approach to event mining through clustering. The Universal Networking Language (UNL)-based subgraph, a semantic representation of the document, is used as the input for clustering. Our research focuses on exploring the use of three different feature sets for event clustering and comparing the approaches used for specific event mining. In our previous work, the clustering algorithm used UNL-based event semantics to represent event context for clustering. However, this approach resulted in different events with similar semantics being clustered together. Hence, instead of considering only UNL event semantics, we considered assigning additional weights to similarity between event contexts with event-related attributes such as time, place, and persons. Although we get specific events in a single cluster, sub-events related to the specific events are not necessarily in a single cluster. Therefore, to improve our cluster efficiency, connective terms between two sentences and their representation as UNL subgraphs were also considered for similarity determination. By combining UNL semantics, event-specific arguments similarity, and connective term concepts between sentences, we were able to obtain clusters for specific events and their sub-events. We have used 112 000 Tamil documents from the Forum for Information Retrieval Evaluation data corpus and achieved good results. We have also compared our approach with the previous state-of-the-art approach for Router-RCV1 corpus and achieved 30% improvements in precision.
APA, Harvard, Vancouver, ISO, and other styles
31

Tipka, Anne, Leopold Haimberger, and Petra Seibert. "Flex_extract v7.1.2 – a software package to retrieve and prepare ECMWF data for use in FLEXPART." Geoscientific Model Development 13, no. 11 (November 5, 2020): 5277–310. http://dx.doi.org/10.5194/gmd-13-5277-2020.

Full text
Abstract:
Abstract. Flex_extract is an open-source software package to efficiently retrieve and prepare meteorological data from the European Centre for Medium-Range Weather Forecasts (ECMWF) as input for the widely used Lagrangian particle dispersion model FLEXPART and the related trajectory model FLEXTRA. ECMWF provides a variety of data sets which differ in a number of parameters (available fields, spatial and temporal resolution, forecast start times, level types etc.). Therefore, the selection of the right data for a specific application and the settings needed to obtain them are not trivial. Consequently, the data sets which can be retrieved through flex_extract by both member-state users and public users as well as their properties are explained. Flex_extract 7.1.2 is a substantially revised version with completely restructured code, mainly written in Python 3, which is introduced with all its input and output files and an explanation of the four application modes. Software dependencies and the methods for calculating the native vertical velocity η˙, the handling of flux data and the preparation of the final FLEXPART input files are documented. Considerations for applications give guidance with respect to the selection of data sets, caveats related to the land–sea mask and orography, etc. Formal software quality-assurance methods have been applied to flex_extract. A set of unit and regression tests as well as code metric data are also supplied. A short description of the installation and usage of flex_extract is provided in the Appendix. The paper points also to an online documentation which will be kept up to date with respect to future versions.
APA, Harvard, Vancouver, ISO, and other styles
32

Jiang, Yongbo, Juncheng Lu, and Tao Feng. "Fuzzy Keyword Searchable Encryption Scheme Based on Blockchain." Information 13, no. 11 (October 28, 2022): 517. http://dx.doi.org/10.3390/info13110517.

Full text
Abstract:
Searchable encryption is a keyword-based ciphertext retrieval scheme, which can selectively retrieve encrypted documents on encrypted cloud data. Most existing searchable encryption schemes focus only on exact keyword searches and cannot return data of interest in fuzzy search. In addition, during the searchable encryption, the cloud server may return invalid results to the data user to save computing costs or for other reasons. At the same time, the user may refuse to pay the service fee after receiving the correct result. To solve the above problems, this paper proposes a fuzzy keyword searchable encryption scheme based on blockchain, which uses edit distance to generate fuzzy keyword sets and generates a secure index with verification tags for each fuzzy keyword set to verify the authenticity of the returned results. The penalty mechanism is introduced through the blockchain to realize the fairness of service payment between users and cloud servers. Security analysis shows that the scheme achieves non-adaptive semantic security. Performance analysis and functional comparison show that the scheme is effective and can meet the requirements of searching applications in the cloud environment.
APA, Harvard, Vancouver, ISO, and other styles
33

Yadav, Nidhika, and Niladri Chatterjee. "Fuzzy Rough Set Based Technique for User Specific Information Retrieval." International Journal of Rough Sets and Data Analysis 5, no. 4 (October 2018): 32–47. http://dx.doi.org/10.4018/ijrsda.2018100102.

Full text
Abstract:
Information retrieval is widely used due to extremely large volume of text and image data available on the web and consequently, efficient retrieval is required. Text information retrieval is a branch of information retrieval which deals with text documents. Another key factor is the concern for a retrieval engine, often referred to as user-specific information retrieval, which works according to a specific user. This article performs a preliminary investigation of the proposed fuzzy rough sets-based model for user-specific text information retrieval. The model improves on the computational time required to compute the approximations compared to classical fuzzy rough set model by using Wikipedia as the information source. The technique also improves on the accuracy of clustering obtained for user specified classes.
APA, Harvard, Vancouver, ISO, and other styles
34

YU, YI, KAZUKI JOE, VINCENT ORIA, FABIAN MOERCHEN, J. STEPHEN DOWNIE, and LEI CHEN. "MULTI-VERSION MUSIC SEARCH USING ACOUSTIC FEATURE UNION AND EXACT SOFT MAPPING." International Journal of Semantic Computing 03, no. 02 (June 2009): 209–34. http://dx.doi.org/10.1142/s1793351x09000732.

Full text
Abstract:
Research on audio-based music retrieval has primarily concentrated on refining audio features to improve search quality. However, much less work has been done on improving the time efficiency of music audio searches. Representing music audio documents in an indexable format provides a mechanism for achieving efficiency. To address this issue, in this work Exact Locality Sensitive Mapping (ELSM) is suggested to join the concatenated feature sets and soft hash values. On this basis we propose audio-based music indexing techniques, ELSM and Soft Locality Sensitive Hash (SoftLSH) using an optimized Feature Union (FU) set of extracted audio features. Two contributions are made here. First, the principle of similarity-invariance is applied in summarizing audio feature sequences and utilized in training semantic audio representations based on regression. Second, soft hash values are pre-calculated to help locate the searching range more accurately and improve collision probability among features similar to each other. Our algorithms are implemented in a demonstration system to show how to retrieve and evaluate multi-version audio documents. Experimental evaluation over a real "multi-version" audio dataset confirms the practicality of ELSM and SoftLSH with FU and proves that our algorithms are effective for both multi-version detection (online query, one-query vs. multi-object) and same content detection (batch queries, multi-queries vs. one-object).
APA, Harvard, Vancouver, ISO, and other styles
35

Shang, Qian, Ming Xu, Bin Qin, Pengbin Lei, and Junjian Huang. "Intelligent Question Answering System Based on Machine Reading Comprehension." Journal of Physics: Conference Series 2050, no. 1 (October 1, 2021): 012002. http://dx.doi.org/10.1088/1742-6596/2050/1/012002.

Full text
Abstract:
Abstract Question answering(Q&A) system is important for accelerating the landing of artificial intelligence. This paper makes an improvement on the Q&A system which uses the method of retrieval-machine reading comprehension (MRC). In the retrieval phase, we use BM25 to recall some documents and split these documents into paragraphs, then we reorder the paragraphs according to the correlation with the question, so as to reduce the number of recalled paragraphs and improve the speed of MRC. In the MRC stage, we design a multi-task MRC structure, which can judge whether the paragraph contains answer and locate answer accurately. Besides, we modify the loss function to fit the sparse labels during the training. The experiments are carried out on multiple data sets to verify the effectiveness of the improved system.
APA, Harvard, Vancouver, ISO, and other styles
36

Heorton, Harold, Michel Tsamados, Thomas Armitage, Andy Ridout, and Jack Landy. "CryoSat-2 Significant Wave Height in Polar Oceans Derived Using a Semi-Analytical Model of Synthetic Aperture Radar 2011–2019." Remote Sensing 13, no. 20 (October 18, 2021): 4166. http://dx.doi.org/10.3390/rs13204166.

Full text
Abstract:
This paper documents the retrieval of significant ocean surface wave heights in the Arctic Ocean from CryoSat-2 data. We use a semi-analytical model for an idealised synthetic aperture satellite radar or pulse-limited radar altimeter echo power. We develop a processing methodology that specifically considers both the Synthetic Aperture and Pulse Limited modes of the radar that change close to the sea ice edge within the Arctic Ocean. All CryoSat-2 echoes to date were matched by our idealised echo revealing wave heights over the period 2011–2019. Our retrieved data were contrasted to existing processing of CryoSat-2 data and wave model data, showing the improved fidelity and accuracy of the semi-analytical echo power model and the newly developed processing methods. We contrasted our data to in situ wave buoy measurements, showing improved data retrievals in seasonal sea ice covered seas. We have shown the importance of directly considering the correct satellite mode of operation in the Arctic Ocean where SAR is the dominant operating mode. Our new data are of specific use for wave model validation close to the sea ice edge and is available at the link in the data availability statement.
APA, Harvard, Vancouver, ISO, and other styles
37

Ferretti, Stefano, Marco Roccetti, and Claudio E. Palazzi. "Web Content Search and Adaptation for IDTV: One Step Forward in the Mediamorphosis Process toward Personal-TV." Advances in Multimedia 2007 (2007): 1–13. http://dx.doi.org/10.1155/2007/16296.

Full text
Abstract:
We are on the threshold of a mediamorphosis that will revolutionize the way we interact with our TV sets. The combination between interactive digital TV (IDTV) and the Web fosters the development of new interactive multimedia services enjoyable even through a TV screen and a remote control. Yet, several design constraints complicate the deployment of this new pattern of services. Prominent unresolved issues involve macro-problems such as collecting information on the Web based on users' preferences and appropriately presenting retrieved Web contents on the TV screen. To this aim, we propose a system able to dynamically convey contents from the Web to IDTV systems. Our system presents solutions both for personalized Web content search and automatic TV-format adaptation of retrieved documents. As we demonstrate through two case study applications, our system merges the best of IDTV and Web domains spinning the TV mediamorphosis toward the creation of the personal-TV concept.
APA, Harvard, Vancouver, ISO, and other styles
38

Quan, Tran Lam, and Vu Tat Thang. "An Approach Using Concept Lattice Structure for Data Mining and Information Retrieval." Journal of Science and Technology: Issue on Information and Communications Technology 1 (August 31, 2015): 1. http://dx.doi.org/10.31130/jst.2015.4.

Full text
Abstract:
Since the 1980s, the concept lattice was studied and applied to the problems of text mining, frequent itemset, classification, etc. The formal concept analysis - FCA is one of the main techniques applied in the concept lattice. FCA is a mathematical theory which is applied to the data mining by setting a table with rows describing objects and columns describing attributes, with relationships between them, and then sets up the concept lattice structure. In the area of information retrieval, FCA considers the correlation of objects-attributes the same as those of documents-terms. In the process of setting up the lattice, FCA defines each node in the lattice as a concept. The algorithm for the construction of concept lattice will install a couple on each node, including a set of documents with common terms, and a set of terms which co-occurs in documents. In a larger scale, each concept in the lattice could be recognized as a couple of questions - answers. In the lattice, the action of browsing up or down of nodes will allow approaching more general concepts or more detail concepts, respectively.
APA, Harvard, Vancouver, ISO, and other styles
39

Selvaraj, Suganya, and Eunmi Choi. "Dynamic Sub-Swarm Approach of PSO Algorithms for Text Document Clustering." Sensors 22, no. 24 (December 9, 2022): 9653. http://dx.doi.org/10.3390/s22249653.

Full text
Abstract:
Text document clustering is one of the data mining techniques used in many real-world applications such as information retrieval from IoT Sensors data, duplicate content detection, and document organization. Swarm intelligence (SI) algorithms are suitable for solving complex text document clustering problems compared to traditional clustering algorithms. The previous studies show that in SI algorithms, particle swarm optimization (PSO) provides an effective solution to text document clustering problems. This PSO still needs to be improved to avoid the problems such as premature convergence to local optima. In this paper, an approach called dynamic sub-swarm of PSO (subswarm-PSO) is proposed to improve the results of PSO for text document clustering problems and avoid the local optimum by improving the global search capabilities of PSO. The results of this proposed approach were compared with the standard PSO algorithm and K-means algorithm. As for performance assurance, the evaluation metric purity is used with six benchmark data sets. The experimental results of this study show that our proposed subswarm-PSO algorithm performs best with high purity comparing the standard PSO and K-means traditional algorithms and also the execution time of subswarm-PSO comparatively takes a little less than the standard PSO algorithm.
APA, Harvard, Vancouver, ISO, and other styles
40

Gupta, Yogesh, and Ashish Saini. "A New Hybrid Document Clustering for PRF-Based Automatic Query Expansion Approach for Effective IR." International Journal of e-Collaboration 16, no. 3 (July 2020): 73–95. http://dx.doi.org/10.4018/ijec.2020070105.

Full text
Abstract:
Automatic query expansion (AQE) is an effective measure to improve information retrieval performance by including additional terms in a user query. The pseudo relevance feedback (PRF) method employed for AQE so far has suffered from a major problem of query drift. Therefore, keeping it in view, a new hybrid document clustering for PRF based AQE approach is proposed in the present article. In this, Fuzzy logic and Particle Swarm Optimization (PSO) are used to construct document clusters. Further, a new and effective hybrid PSO and Fuzzy logic-based term weighting approach is followed to find more suitable additional query terms using a weighted score of four IR evidences which is considered maximized. Moreover, a combined semantic filtering method along with query terms re-weighting algorithms are also used to remove noisy or irrelevant terms semantically. The performance of the presented approaches in this article is tested and compared with other approaches on three benchmark data sets. The comparative analysis of all the tested approaches illustrates the superior performance of the proposed approach.
APA, Harvard, Vancouver, ISO, and other styles
41

Phonarin, Pilapan, Supot Nitsuwat, and Choochart Haruechaiyasak. "AGRIX: An Ontology Based Agricultural Expertise Retrieval Framework." Advanced Materials Research 403-408 (November 2011): 3714–18. http://dx.doi.org/10.4028/www.scientific.net/amr.403-408.3714.

Full text
Abstract:
Generally, information retrieval (IR) performs keyword search based on the user query to find a set of relevant documents. In the domain of agricultural expertise retrieval, the goal is to find a group of experts who has knowledge in agriculture (by using publications as the evidence) specified by the input query. Typical publication IR systems could sometimes return the search result sets, which consist of a huge amount of publications. Some of the returned publications are not relevant to the individual user’s information need. In this paper, an ontology based agricultural expertise retrieval framework called AGRIX is proposed with the focus on the ontology creation to cover three following aspects: (1) expert profiles and publications, (2) type of plants and (3) problem solving. To build the ontology model, we used a set of publications (1,249 records) which was collected from the Thai national AGRIS center, Bureau of Library Kasetsart University. In addition, a set of inference rules is created to support the expertise retrieval task. By using AGRIX to implement an agricultural expertise retrieval, users can search for experts in two perspectives, plant (e.g., rice, sugar canes) and problem solving (e.g., plant diseases, fertilizers).
APA, Harvard, Vancouver, ISO, and other styles
42

Snowden, Helen, and Sarah Marriott. "Developing a local shared care protocol for managing people with psychotic illness in primary care." Psychiatric Bulletin 27, no. 07 (July 2003): 261–66. http://dx.doi.org/10.1017/s0955603600002555.

Full text
Abstract:
Aims and Method The National Service Framework sets standards to improve the treatment of mental health on a national level, and requires the development of localised shared care protocols. We aimed to develop a shared care protocol for use in local National Health Service (NHS) services, based on best practice guidelines and local consensus. A systematic literature search used three databases and the advice of a clinical expert. Articles satisfying the search inclusion criteria were retrieved and appraised. Clinical recommendations from well-designed regional and national documents relevant to all aspects of the management of psychotic illness in primary care were compared and contrasted by a facilitated group involving primary and secondary care clinicians who drafted the final recommendations. A multi-agency steering group guided the work. Results Twenty-two articles were retrieved, of which nine reached the criteria for inclusion. The protocol provided a comprehensive range of recommendations regarding detection, assessment, management, referral and shared working with local mental health services. Clinical Implications Using local clinical consensus to resolve uncertainty about conflicting clinical recommendations from a series of well-designed guidelines was an effective method for adapting clinical guidelines to local circumstances.
APA, Harvard, Vancouver, ISO, and other styles
43

Snowden, Helen, and Sarah Marriott. "Developing a local shared care protocol for managing people with psychotic illness in primary care." Psychiatric Bulletin 27, no. 7 (June 2003): 261–66. http://dx.doi.org/10.1192/pb.27.7.261.

Full text
Abstract:
Aims and MethodThe National Service Framework sets standards to improve the treatment of mental health on a national level, and requires the development of localised shared care protocols. We aimed to develop a shared care protocol for use in local National Health Service (NHS) services, based on best practice guidelines and local consensus. A systematic literature search used three databases and the advice of a clinical expert. Articles satisfying the search inclusion criteria were retrieved and appraised. Clinical recommendations from well-designed regional and national documents relevant to all aspects of the management of psychotic illness in primary care were compared and contrasted by a facilitated group involving primary and secondary care clinicians who drafted the final recommendations. A multi-agency steering group guided the work.ResultsTwenty-two articles were retrieved, of which nine reached the criteria for inclusion. The protocol provided a comprehensive range of recommendations regarding detection, assessment, management, referral and shared working with local mental health services.Clinical ImplicationsUsing local clinical consensus to resolve uncertainty about conflicting clinical recommendations from a series of well-designed guidelines was an effective method for adapting clinical guidelines to local circumstances.
APA, Harvard, Vancouver, ISO, and other styles
44

Werner, Frank, Galina Wind, Zhibo Zhang, Steven Platnick, Larry Di Girolamo, Guangyu Zhao, Nandana Amarasinghe, and Kerry Meyer. "Marine boundary layer cloud property retrievals from high-resolution ASTER observations: case studies and comparison with Terra MODIS." Atmospheric Measurement Techniques 9, no. 12 (December 8, 2016): 5869–94. http://dx.doi.org/10.5194/amt-9-5869-2016.

Full text
Abstract:
Abstract. A research-level retrieval algorithm for cloud optical and microphysical properties is developed for the Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) aboard the Terra satellite. It is based on the operational MODIS algorithm. This paper documents the technical details of this algorithm and evaluates the retrievals for selected marine boundary layer cloud scenes through comparisons with the operational MODIS Data Collection 6 (C6) cloud product. The newly developed, ASTER-specific cloud masking algorithm is evaluated through comparison with an independent algorithm reported in [Zhao and Di Girolamo(2006)]. To validate and evaluate the cloud optical thickness (τ) and cloud effective radius (reff) from ASTER, the high-spatial-resolution ASTER observations are first aggregated to the same 1000 m resolution as MODIS. Subsequently, τaA and reff, aA retrieved from the aggregated ASTER radiances are compared with the collocated MODIS retrievals. For overcast pixels, the two data sets agree very well with Pearson's product-moment correlation coefficients of R > 0.970. However, for partially cloudy pixels there are significant differences between reff, aA and the MODIS results which can exceed 10 µm. Moreover, it is shown that the numerous delicate cloud structures in the example marine boundary layer scenes, resolved by the high-resolution ASTER retrievals, are smoothed by the MODIS observations. The overall good agreement between the research-level ASTER results and the operational MODIS C6 products proves the feasibility of MODIS-like retrievals from ASTER reflectance measurements and provides the basis for future studies concerning the scale dependency of satellite observations and three-dimensional radiative effects.
APA, Harvard, Vancouver, ISO, and other styles
45

Kathuria, Mamta, Chander Kumar Nagpal, and Neelam Duhan. "A Fuzzy Logic Based Synonym Resolution Approach for Automated Information Retrieval." International Journal on Semantic Web and Information Systems 14, no. 4 (October 2018): 92–109. http://dx.doi.org/10.4018/ijswis.2018100105.

Full text
Abstract:
Precise semantic similarity measurement between words is vital from the viewpoint of many automated applications in the areas of word sense disambiguation, machine translation, information retrieval and data clustering, etc. Rapid growth of the automated resources and their diversified novel applications has further reinforced this requirement. However, accurate measurement of semantic similarity is a daunting task due to inherent ambiguities of the natural language, spread of web documents across various domains, localities and dialects. All these issues render to the inadequacy of the manually maintained semantic similarity resources (i.e. dictionaries). This article uses context sets of the words under consideration in multiple corpora to compute semantic similarity and provides credible and verifiable semantic similarity results directly usable for automated applications in the intelligent manner using fuzzy inference mechanism. It can also be used to strengthen the existing lexical resources by augmenting the context set and properly defined extent of semantic similarity.
APA, Harvard, Vancouver, ISO, and other styles
46

Al-Msie'deen, R., M. Huchard, A. D. Seriai, C. Urtado, and S. Vauttier. "Automatic Documentation of [Mined] Feature Implementations from Source Code Elements and Use-Case Diagrams with the REVPLINE Approach." International Journal of Software Engineering and Knowledge Engineering 24, no. 10 (December 2014): 1413–38. http://dx.doi.org/10.1142/s0218194014400142.

Full text
Abstract:
Companies often develop a set of software variants that share some features and differ in others to meet specific requirements. To exploit the existing software variants as a Software Product Line (SPL), a Feature Model of this SPL must be built as a first step. To do so, it is necessary to define and document the optional and mandatory features that compose the variants. In our previous work, we mined a set of feature implementations as identified sets of source code elements. In this paper, we propose a complementary approach, which aims to document the mined feature implementations by giving them names and descriptions, based on the source code elements that form feature implementations and the use-case diagrams that specify software variants. The novelty of our approach is its use of commonality and variability across software variants, at feature implementation and use-case levels, to run Information Retrieval methods in an efficient way. Experiments on several real case studies (Mobile media and ArgoUML-SPL) validate our approach and show promising results.
APA, Harvard, Vancouver, ISO, and other styles
47

Psaila, Giuseppe, and Paolo Fosci. "J-CO: A Platform-Independent Framework for Managing Geo-Referenced JSON Data Sets." Electronics 10, no. 5 (March 7, 2021): 621. http://dx.doi.org/10.3390/electronics10050621.

Full text
Abstract:
Internet technology and mobile technology have enabled producing and diffusing massive data sets concerning almost every aspect of day-by-day life. Remarkable examples are social media and apps for volunteered information production, as well as Open Data portals on which public administrations publish authoritative and (often) geo-referenced data sets. In this context, JSON has become the most popular standard for representing and exchanging possibly geo-referenced data sets over the Internet.Analysts, wishing to manage, integrate and cross-analyze such data sets, need a framework that allows them to access possibly remote storage systems for JSON data sets, to retrieve and query data sets by means of a unique query language (independent of the specific storage technology), by exploiting possibly-remote computational resources (such as cloud servers), comfortably working on their PC in their office, more or less unaware of real location of resources. In this paper, we present the current state of the J-CO Framework, a platform-independent and analyst-oriented software framework to manipulate and cross-analyze possibly geo-tagged JSON data sets. The paper presents the general approach behind the J-CO Framework, by illustrating the query language by means of a simple, yet non-trivial, example of geographical cross-analysis. The paper also presents the novel features introduced by the re-engineered version of the execution engine and the most recent components, i.e., the storage service for large single JSON documents and the user interface that allows analysts to comfortably share data sets and computational resources with other analysts possibly working in different places of the Earth globe. Finally, the paper reports the results of an experimental campaign, which show that the execution engine actually performs in a more than satisfactory way, proving that our framework can be actually used by analysts to process JSON data sets.
APA, Harvard, Vancouver, ISO, and other styles
48

LEE, CHANG WOO, HYUN KANG, HANG JOON KIM, and KEECHUL JUNG. "FONT CLASSIFICATION USING NMF WITH HIERARCHICAL CLUSTERING." International Journal of Pattern Recognition and Artificial Intelligence 19, no. 06 (September 2005): 755–73. http://dx.doi.org/10.1142/s0218001405004307.

Full text
Abstract:
The current paper proposes a font classification method for document images that uses non-negative matrix factorization (NMF), that is able to learn part-based representations of objects. The basic idea of the proposed method is based on the fact that the characteristics of each font are derived from parts of individual characters in each font rather than holistic textures. Spatial localities, parts composing of font images, are automatically extracted using NMF and, then, used as features representing each font. Using hierarchical clustering algorithm, these feature sets are generalized for font classification, resulting in the prototype templates construction. In both the prototype construction and font classification, earth mover's distance (EMD) is used as the distance metric, which is more suitable for the NMF feature space than Cosine or Euclidean distance. In the experimental results, the distribution of features and the appropriateness of the features specifying each font are investigated, and the results are compared with a related algorithm: principal component analysis (PCA). The proposed method is expected to improve the performance of optical character recognition (OCR), document indexing and retrieval systems, when such systems adopt a font classifier as a preprocessor.
APA, Harvard, Vancouver, ISO, and other styles
49

Jaton, Florian. "We get the algorithms of our ground truths: Designing referential databases in digital image processing." Social Studies of Science 47, no. 6 (September 26, 2017): 811–40. http://dx.doi.org/10.1177/0306312717730428.

Full text
Abstract:
This article documents the practical efforts of a group of scientists designing an image-processing algorithm for saliency detection. By following the actors of this computer science project, the article shows that the problems often considered to be the starting points of computational models are in fact provisional results of time-consuming, collective and highly material processes that engage habits, desires, skills and values. In the project being studied, problematization processes lead to the constitution of referential databases called ‘ground truths’ that enable both the effective shaping of algorithms and the evaluation of their performances. Working as important common touchstones for research communities in image processing, the ground truths are inherited from prior problematization processes and may be imparted to subsequent ones. The ethnographic results of this study suggest two complementary analytical perspectives on algorithms: (1) an ‘axiomatic’ perspective that understands algorithms as sets of instructions designed to solve given problems computationally in the best possible way, and (2) a ‘problem-oriented’ perspective that understands algorithms as sets of instructions designed to computationally retrieve outputs designed and designated during specific problematization processes. If the axiomatic perspective on algorithms puts the emphasis on the numerical transformations of inputs into outputs, the problem-oriented perspective puts the emphasis on the definition of both inputs and outputs.
APA, Harvard, Vancouver, ISO, and other styles
50

Nordquist, B., J. Fischer, S. Y. Kim, S. M. Stover, T. Garcia-Nolen, K. Hayashi, J. Liu, and A. S. Kapatkin. "Effects of trial repetition, limb side, intraday and inter-week variation on vertical and craniocaudal ground reaction forces in clinically normal Labrador Retrievers." Veterinary and Comparative Orthopaedics and Traumatology 24, no. 06 (2011): 435–44. http://dx.doi.org/10.3415/vcot-11-01-0015.

Full text
Abstract:
SummaryObjectives: To document the contributions of trial repetition, limb side, and intraday and inter-week measurements on variation in vertical and craniocaudal ground reaction force data.Methods: Following habituation, force and time data were collected for all four limbs of seven Labrador Retrievers during sets of five valid trot trials. Each set was performed twice daily (morning and afternoon), every seven days for three consecutive weeks. A repeated measures analysis of variance was used to determine the effects of limb, trial, intraday, and inter-week factors on ground reaction force data for the thoracic and pelvic limbs.Results: Of the four factors evaluated, variation due to trial repetition had the largest magnitude of effect on ground reaction forces. Trial within a set of data had an effect on all craniocaudal, but not vertical, ground reaction force variables studied, for the thoracic limbs. The first of five trials was often different from later trials. Some thoracic limb and pelvic limb variables were different between weeks. A limb side difference was only apparent for pelvic limb vertical ground reaction force data. Only pelvic limb craniocaudal braking variables were different between sets within a day.Discussion and clinical significance: When controlling for speed, handler, gait, weight and dog breed, variation in ground reaction forces mainly arise from trial repetition and inter-week data collection. When using vertical peak force and impulse to evaluate treatment, trial repetition and inter-week data collection should have minimal effect of the data.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography