Journal articles: 'Web document clustering (WDC)'

1

Im, Yeong-Hui. "A Post Web Document Clustering Algorithm." KIPS Transactions:PartB 9B, no. 1 (February 1, 2002): 7–16. http://dx.doi.org/10.3745/kipstb.2002.9b.1.007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

He, Xiaofeng, Hongyuan Zha, Chris H.Q. Ding, and Horst D. Simon. "Web document clustering using hyperlink structures." Computational Statistics & Data Analysis 41, no. 1 (November 2002): 19–45. http://dx.doi.org/10.1016/s0167-9473(02)00070-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Hammouda, K. M., and M. S. Kamel. "Efficient phrase-based document indexing for Web document clustering." IEEE Transactions on Knowledge and Data Engineering 16, no. 10 (October 2004): 1279–96. http://dx.doi.org/10.1109/tkde.2004.58.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Rani Manukonda, Sumathi, Asst Prof Kmit, Narayanguda ., Hyderabad ., Nomula Divya, Asst Prof Cmrit, Medchal ., and Hyderabad . "Efficient Document Clustering for Web Search Result." International Journal of Engineering & Technology 7, no. 3.3 (June 21, 2018): 90. http://dx.doi.org/10.14419/ijet.v7i3.3.14494.

Full text

Abstract:

Clustering the document in data mining is one of the traditional approach in which the same documents that are more relevant are grouped together. Document clustering take part in achieving accuracy that retrieve information for systems that identifies the nearest neighbors of the document. Day to day the massive quantity of data is being generated and it is clustered. According to particular sequence to improve the cluster qualityeven though different clustering methods have been introduced, still many challenges exist for the improvement of document clustering. For web search purposea document in group is efficiently arranged for the result retrieval.The users accordingly search query in an organized way. Hierarchical clustering is attained by document clustering.To the greatest algorithms for groupingdo not concentrate on the semantic approach, hence resulting to the unsatisfactory output clustering. The involuntary approach of organizing documents of web like Google, Yahoo is often considered as a reference. A distinct method to identify the existing group of similar things in the previously organized documents and retrieves effective document classifier for new documents. In this paper the main concentration is on hierarchical clustering and k-means algorithms, hence prove that k-means and its variant are efficient than hierarchical clustering along with this by implementing greedy fast k-means algorithm (GFA) for cluster document in efficient way is considered.

APA, Harvard, Vancouver, ISO, and other styles

5

Creţulescu, Radu G., Daniel I. Morariu, Macarie Breazu, and Daniel Volovici. "DBSCAN Algorithm for Document Clustering." International Journal of Advanced Statistics and IT&C for Economics and Life Sciences 9, no. 1 (June 1, 2019): 58–66. http://dx.doi.org/10.2478/ijasitels-2019-0007.

Full text

Abstract:

AbstractDocument clustering is a problem of automatically grouping similar document into categories based on some similarity metrics. Almost all available data, usually on the web, are unclassified so we need powerful clustering algorithms that work with these types of data. All common search engines return a list of pages relevant to the user query. This list needs to be generated fast and as correct as possible. For this type of problems, because the web pages are unclassified, we need powerful clustering algorithms. In this paper we present a clustering algorithm called DBSCAN – Density-Based Spatial Clustering of Applications with Noise – and its limitations on documents (or web pages) clustering. Documents are represented using the “bag-of-words” representation (word occurrence frequency). For this type o representation usually a lot of algorithms fail. In this paper we use Information Gain as feature selection method and evaluate the DBSCAN algorithm by its capacity to integrate in the clusters all the samples from the dataset.

APA, Harvard, Vancouver, ISO, and other styles

6

Shen Huang, Zheng Chen, Yong Yu, and Wei-Ying Ma. "Multitype features coselection for Web document clustering." IEEE Transactions on Knowledge and Data Engineering 18, no. 4 (April 2006): 448–59. http://dx.doi.org/10.1109/tkde.2006.1599384.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Chan, Samuel W. K., and Mickey W. C. Chong. "Unsupervised clustering for nontextual web document classification." Decision Support Systems 37, no. 3 (June 2004): 377–96. http://dx.doi.org/10.1016/s0167-9236(03)00035-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Boley, Daniel, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore. "Partitioning-based clustering for Web document categorization." Decision Support Systems 27, no. 3 (December 1999): 329–41. http://dx.doi.org/10.1016/s0167-9236(99)00055-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

9

Su, Zhong, Qiang Yang, Hongjiang Zhang, Xiaowei Xu, Yu-Hen Hu, and Shaoping Ma. "Correlation-Based Web Document Clustering for Adaptive Web Interface Design." Knowledge and Information Systems 4, no. 2 (April 2002): 151–67. http://dx.doi.org/10.1007/s101150200002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Chawla, Suruchi. "Application of Convolution Neural Networks in Web Search Log Mining for Effective Web Document Clustering." International Journal of Information Retrieval Research 12, no. 1 (January 2022): 1–14. http://dx.doi.org/10.4018/ijirr.300367.

Full text

Abstract:

The volume of web search data stored in search engine log is increasing and has become big search log data. The web search log has been the source of data for mining based on web document clustering techniques to improve the efficiency and effectiveness of information retrieval. In this paper Deep Learning Model Convolution Neural Network(CNN) is used in big web search log data mining to learn the semantic representation of a document. These semantic documents vectors are clustered using K-means to group relevant documents for effective web document clustering. Experiment was done on the data set of web search query and associated clicked URLs to measure the quality of clusters based on document semantic representation using Deep learning model CNN. The clusters analysis was performed based on WCSS(the sum of squared distances of documents samples to their closest cluster center) and decrease in the WCSS in comparison to TF.IDF keyword based clusters confirm the effectiveness of CNN in web search log mining for effective web document clustering.

APA, Harvard, Vancouver, ISO, and other styles

11

Sung, Ki-Youn, and Bo-Hyun Yun. "Topic based Web Document Clustering using Named Entities." Journal of the Korea Contents Association 10, no. 5 (May 28, 2010): 29–36. http://dx.doi.org/10.5392/jkca.2010.10.5.029.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

He, Y., S. C. Hui, and A. C. M. Fong. "Mining a web citation database for document clustering." Applied Artificial Intelligence 16, no. 4 (April 2002): 283–302. http://dx.doi.org/10.1080/08839510252906462.

Full text

APA, Harvard, Vancouver, ISO, and other styles

13

Khan, M. Shamim, and Sebastian W. Khor. "Web document clustering using a hybrid neural network." Applied Soft Computing 4, no. 4 (September 2004): 423–32. http://dx.doi.org/10.1016/j.asoc.2004.02.003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

14

Fersini, E., E. Messina, and F. Archetti. "A probabilistic relational approach for web document clustering." Information Processing & Management 46, no. 2 (March 2010): 117–30. http://dx.doi.org/10.1016/j.ipm.2009.08.003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Kaneko, Masaya, Shusuke Okamoto, Masaki Kohana, and You Inayoshi. "Document clustering based on web search hit counts." International Journal of Business Intelligence and Data Mining 8, no. 1 (2013): 61. http://dx.doi.org/10.1504/ijbidm.2013.055787.

Full text

APA, Harvard, Vancouver, ISO, and other styles

16

Takale, Sheetal A., Prakash J. Kulkarni, and Sahil K. Shah. "An Intelligent Web Search Using Multi-Document Summarization." International Journal of Information Retrieval Research 6, no. 2 (April 2016): 41–65. http://dx.doi.org/10.4018/ijirr.2016040103.

Full text

Abstract:

Information available on the internet is huge, diverse and dynamic. Current Search Engine is doing the task of intelligent help to the users of the internet. For a query, it provides a listing of best matching or relevant web pages. However, information for the query is often spread across multiple pages which are returned by the search engine. This degrades the quality of search results. So, the search engines are drowning in information, but starving for knowledge. Here, we present a query focused extractive summarization of search engine results. We propose a two level summarization process: identification of relevant theme clusters, and selection of top ranking sentences to form summarized result for user query. A new approach to semantic similarity computation using semantic roles and semantic meaning is proposed. Document clustering is effectively achieved by application of MDL principle and sentence clustering and ranking is done by using SNMF. Experiments conducted demonstrate the effectiveness of system in semantic text understanding, document clustering and summarization.

APA, Harvard, Vancouver, ISO, and other styles

17

Li, Zhao, and Xindong Wu. "A Phrase-Based Method for Hierarchical Clustering of Web Snippets." Proceedings of the AAAI Conference on Artificial Intelligence 24, no. 1 (July 5, 2010): 1947–48. http://dx.doi.org/10.1609/aaai.v24i1.7773.

Full text

Abstract:

Document clustering has been applied in web information retrieval, which facilitates users’ quick browsing by organizing retrieved results into different groups. Meanwhile, a tree-like hierarchical structure is wellsuited for organizing the retrieved results in favor of web users. In this regard, we introduce a new method for hierarchical clustering of web snippets by exploiting a phrase-based document index. In our method, a hierarchy of web snippets is built based on phrases instead of all snippets, and the snippets are then assigned to the corresponding clusters consisting of phrases. We show that, as opposed to the traditional hierarchical clustering, our method not only presents meaningful cluster labels but also improves clustering performance.

APA, Harvard, Vancouver, ISO, and other styles

18

Jinarat, Supakpong, Choochart Haruechaiyasak, and Arnon Rungsawang. "Graph-Based Concept Clustering for Web Search Results." International Journal of Electrical and Computer Engineering (IJECE) 5, no. 6 (December 1, 2015): 1536. http://dx.doi.org/10.11591/ijece.v5i6.pp1536-1544.

Full text

Abstract:

A search engine usually returns a long list of web search results corresponding to a query from the user. Users must spend a lot of time for browsing and navigating the search results for the relevant results. Many research works applied the text clustering techniques, called web search results clustering, to handle the problem. Unfortunately, search result document returned from search engine is a very short text. It is difficult to cluster related documents into the same group because a short document has low informative content. In this paper, we proposed a method to cluster the web search results with high clustering quality using graph-based clustering with concept which extract from the external knowledge source. The main idea is to expand the original search results with some related concept terms. We applied the Wikipedia as the external knowledge source for concept extraction. We compared the clustering results of our proposed method with two well-known search results clustering techniques, Suffix Tree Clustering and Lingo. The experimental results showed that our proposed method significantly outperforms over the well-known clustering techniques.

APA, Harvard, Vancouver, ISO, and other styles

19

Avanija, J., and K. Ramar. "Semantic Clustering of Web Documents." International Journal of Information Technology and Web Engineering 7, no. 4 (October 2012): 20–33. http://dx.doi.org/10.4018/jitwe.2012100102.

Full text

Abstract:

With the massive growth and large volume of the web it is very difficult to recover results based on the user preferences. The next generation web architecture, semantic web reduces the burden of the user by performing search based on semantics instead of keywords. Even in the context of semantic technologies optimization problem occurs but rarely considered. In this paper document clustering is applied to recover relevant documents. The authors propose an ontology based clustering algorithm using semantic similarity measure and Particle Swarm Optimization (PSO), which is applied to the annotated documents for optimizing the result. The proposed method uses Jena API and GATE tool API and the documents can be recovered based on their annotation features and relations. A preliminary experiment comparing the proposed method with K-Means shows that the proposed method is feasible and performs better than K-Means.

APA, Harvard, Vancouver, ISO, and other styles

20

Subhashini, R., and V. Jawahar Senthil Kumar. "A Roadmap to Integrate Document Clustering in Information Retrieval." International Journal of Information Retrieval Research 1, no. 1 (January 2011): 31–44. http://dx.doi.org/10.4018/ijirr.2011010103.

Full text

Abstract:

The World Wide Web is a large distributed digital information space. The ability to search and retrieve information from the Web efficiently and effectively is an enabling technology for realizing its full potential. Information Retrieval (IR) plays an important role in search engines. Today’s most advanced engines use the keyword-based (“bag of words”) paradigm, which has inherent disadvantages. Organizing web search results into clusters facilitates the user’s quick browsing of search results. Traditional clustering techniques are inadequate because they do not generate clusters with highly readable names. This paper proposes an approach for web search results in clustering based on a phrase based clustering algorithm. It is an alternative to a single ordered result of search engines. This approach presents a list of clusters to the user. Experimental results verify the method s feasibility and effectiveness.

APA, Harvard, Vancouver, ISO, and other styles

21

Sawalkar, Abhishek, Mohit Mandlecha, Dnyanesh Kulkarni, and Dr Ratnamala S. Paswan. "Comparing the Performance of SOM with Traditional Methods for Document Clustering Using Wordnet Ontologies." International Journal for Research in Applied Science and Engineering Technology 10, no. 4 (April 30, 2022): 1512–18. http://dx.doi.org/10.22214/ijraset.2022.41554.

Full text

Abstract:

Abstract: Retrieving useful information has become challenging due to the rapid expansion of web material. To improve the retrieval outcomes, efficient clustering methods are required. Document clustering is the process of identifying similarities and differences among given objects and grouping them into clusters with comparable features. We used WordNet lexical as an addition to compare several document clustering techniques in this article. The suggested method employs WordNet to determine the relevance of the concepts in the text, and then clusters the content using several document clustering algorithms (K-means, Agglomerative Clustering, and self-organizing maps). We wish to compare alternative ways for making document clustering algorithms more successful. Keywords: Document clustering, Clustering technique, Self-organizing maps, WordNet, K-means, Hierarchical Clustering.

APA, Harvard, Vancouver, ISO, and other styles

22

R, Subhashini, and Jawahar Senthil Kumar .V. "A NOVEL DOCUMENT CLUSTERING FOR ORGANIZING THE WEB PAGES." International Journal on Information Sciences and Computing 4, no. 2 (2010): 49–54. http://dx.doi.org/10.18000/ijisac.50079.

Full text

APA, Harvard, Vancouver, ISO, and other styles

23

Carullo, Moreno, Elisabetta Binaghi, and Ignazio Gallo. "An online document clustering technique for short web contents." Pattern Recognition Letters 30, no. 10 (July 2009): 870–76. http://dx.doi.org/10.1016/j.patrec.2009.04.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

24

Hammouda, Khaled, and Mohamed Kamel. "Distributed collaborative Web document clustering using cluster keyphrase summaries." Information Fusion 9, no. 4 (October 2008): 465–80. http://dx.doi.org/10.1016/j.inffus.2006.12.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Tarczynski, Tomasz. "Document Clustering - Concepts, Metrics and Algorithms." International Journal of Electronics and Telecommunications 57, no. 3 (September 1, 2011): 271–77. http://dx.doi.org/10.2478/v10177-011-0036-5.

Full text

Abstract:

Document Clustering - Concepts, Metrics and AlgorithmsDocument clustering, which is also refered to astext clustering, is a technique of unsupervised document organisation. Text clustering is used to group documents into subsets that consist of texts that are similar to each orher. These subsets are called clusters. Document clustering algorithms are widely used in web searching engines to produce results relevant to a query. An example of practical use of those techniques are Yahoo! hierarchies of documents [1]. Another application of document clustering is browsing which is defined as searching session without well specific goal. The browsing techniques heavily relies on document clustering. In this article we examine the most important concepts related to document clustering. Besides the algorithms we present comprehensive discussion about representation of documents, calculation of similarity between documents and evaluation of clusters quality.

APA, Harvard, Vancouver, ISO, and other styles

26

Obidallah, Waeal J., Bijan Raahemi, and Waleed Rashideh. "Multi-Layer Web Services Discovery Using Word Embedding and Clustering Techniques." Data 7, no. 5 (May 4, 2022): 57. http://dx.doi.org/10.3390/data7050057.

Full text

Abstract:

We propose a multi-layer data mining architecture for web services discovery using word embedding and clustering techniques to improve the web service discovery process. The proposed architecture consists of five layers: web services description and data preprocessing; word embedding and representation; syntactic similarity; semantic similarity; and clustering. In the first layer, we identify the steps to parse and preprocess the web services documents. In the second layer, Bag of Words with Term Frequency–Inverse Document Frequency and three word-embedding models are employed for web services representation. In the third layer, four distance measures, namely, Cosine, Euclidean, Minkowski, and Word Mover, are considered to find the similarities between Web services documents. In layer four, WordNet and Normalized Google Distance are employed to represent and find the similarity between web services documents. Finally, in the fifth layer, three clustering algorithms, namely, affinity propagation, K-means, and hierarchical agglomerative clustering, are investigated for clustering of web services based on observed similarities in documents. We demonstrate how each component of the five layers is employed in web services clustering using randomly selected web services documents. We conduct experimental analysis to cluster web services using a collected dataset consisting of web services documents and evaluate their clustering performances. Using a ground truth for evaluation purposes, we observe that clusters built based on the word embedding models performed better than those built using the Bag of Words with Term Frequency–Inverse Document Frequency model. Among the three word embedding models, the pre-trained Word2Vec’s skip-gram model reported higher performance in clustering web services. Among the three semantic similarity measures, path-based WordNet similarity reported higher clustering performance. By considering the different word representations models and syntactic and semantic similarity measures, we found that the affinity propagation clustering technique performed better in discovering similarities among Web services.

APA, Harvard, Vancouver, ISO, and other styles

27

Chawla, Suruchi. "Application of Fuzzy C-Means Clustering and Semantic Ontology in Web Query Session Mining for Intelligent Information Retrieval." International Journal of Fuzzy System Applications 10, no. 1 (January 2021): 1–19. http://dx.doi.org/10.4018/ijfsa.2021010101.

Full text

Abstract:

Information retrieval based on keywords search retrieves irrelevant documents because of vocabulary gap between document content and search queries. The keyword vector representation of web documents is very high dimensional, and keyword terms are unable to capture the semantic of document content. Ontology has been built in various domains for representing the semantics of documents based on concepts relevant to document subject. The web documents often contain multiple topics; therefore, fuzzy c-means document clustering has been used for discovering clusters with overlapping boundaries. In this paper, the method is proposed for intelligent information retrieval using hybrid of fuzzy c-means clustering and ontology in query session mining. Thus, use of fuzzy clusters of web query session concept vector improve quality of clusters for effective web search. The proposed method was evaluated experimentally, and results show the improvement in precision of search results.

APA, Harvard, Vancouver, ISO, and other styles

28

Nishina, Tomoya, and Akira Utsumi. "Web Document Clustering Based on the Clusters of Topic Words." Journal of Natural Language Processing 17, no. 4 (2010): 23–41. http://dx.doi.org/10.5715/jnlp.17.4_23.

Full text

APA, Harvard, Vancouver, ISO, and other styles

29

Krishnaraj, Dr N., Dr P. Kumar, and Sri K. Bhagavan. "Conceptual Semantic Model for Web Document Clustering Using Term Frequency." EAI Endorsed Transactions on Energy Web 5, no. 20 (September 12, 2018): 155744. http://dx.doi.org/10.4108/eai.12-9-2018.155744.

Full text

APA, Harvard, Vancouver, ISO, and other styles

30

Srikanth, D., and S. Sakthivel. "Time and Space Efficient Web Document Clustering Using Rayleigh Distribution." Wireless Personal Communications 102, no. 4 (January 31, 2018): 3255–68. http://dx.doi.org/10.1007/s11277-018-5366-5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Zhao, Ying, Ya Jun Du, and Qiang Qiang Peng. "Clustering Chinese Web Search Results Based on Association Calculation." Applied Mechanics and Materials 55-57 (May 2011): 1418–23. http://dx.doi.org/10.4028/www.scientific.net/amm.55-57.1418.

Full text

Abstract:

Clustering web search results is a kind of solution which help user to find the interested topic by grouping the search results. This paper presents an improved method for clustering search results focused on Chinese web pages. The main contributions of this paper are the following: First, in this paper, a method which identifies the complete semantic information phrase by comparing the attributes of base clusters in the suffix tree document model and the overlap of their document sets is presented. Second, by analyzing the content and structure of title and snippet of Chinese web search results, one way of sentence segmentation is designed and implemented to constructing suffix tree. Third, In order to better respond to the associate degree of terms, a novel method is proposed which compute the distance in sentence-grain of terms' co-occurrences. Finally, the experiment illustrates that the new clustering method provides an efficient and effective way for user browsing and locating sought information.

APA, Harvard, Vancouver, ISO, and other styles

32

Fadllullah, Arif, Dasrit Debora Kamudi, Muhamad Nasir, Agus Zainal Arifin, and Diana Purwitasari. "WEB NEWS DOCUMENTS CLUSTERING IN INDONESIAN LANGUAGE USING SINGULAR VALUE DECOMPOSITION-PRINCIPAL COMPONENT ANALYSIS (SVDPCA) AND ANT ALGORITHMS." Jurnal Ilmu Komputer dan Informasi 9, no. 1 (February 15, 2016): 17. http://dx.doi.org/10.21609/jiki.v9i1.362.

Full text

Abstract:

Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%).

APA, Harvard, Vancouver, ISO, and other styles

33

TSEKOURAS, GEORGE E., and DAMIANOS GAVALAS. "AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING." International Journal of Software Engineering and Knowledge Engineering 23, no. 06 (August 2013): 869–86. http://dx.doi.org/10.1142/s021819401350023x.

Full text

Abstract:

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download web documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each thematic cultural area, filtering the documents with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term occurrences. We calculate the dissimilarity between the cultural-related document vectors and for each cultural theme, we use cluster analysis to partition the documents into a number of clusters. Our approach is validated via a proof-of-concept application which analyzes hundreds of web pages spanning different cultural thematic areas.

APA, Harvard, Vancouver, ISO, and other styles

34

Al-Mofareji, Hanan, Mahmoud Kamel, and Mohamed Y. Dahab. "WeDoCWT: A New Method for Web Document Clustering Using Discrete Wavelet Transforms." Journal of Information & Knowledge Management 16, no. 01 (March 2017): 1750004. http://dx.doi.org/10.1142/s0219649217500046.

Full text

Abstract:

Organizing web information is an important aspect of finding information in the easiest and most efficient way. We present a new method for web document clustering called WeDoCWT, which exploits the discrete wavelet transform and term signal, to improve the document representation. We studied different methods for document segmentation to construct the term signals. We used two datasets, UW-CAN and WebKB, to evaluate the proposed method. The experimental results indicated that dividing the documents into fixed segments is preferable to dividing them into logical segments based on HTML features because the web pages do not have the same structure. Mean TF–IDF reduction technique gives the best results in most cases. WeDoCWT gives [Formula: see text]-measure better than most of the previous approaches described in the literature. We used Munkres assignment algorithm to assign each produced cluster to the original class in order to evaluate the clustering results.

APA, Harvard, Vancouver, ISO, and other styles

35

Li, Gui, Cheng Chen, Zheng Yu Li, Zi Yang Han, and Ping Sun. "Web Data Extraction Based on Tag Path Clustering." Advanced Materials Research 756-759 (September 2013): 1590–94. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.1590.

Full text

Abstract:

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

APA, Harvard, Vancouver, ISO, and other styles

36

Reka, M., and N. Shanthi. "An Efficient Multi-Dimensional Level based Semantic Relational Depthness Clustering for Enhancing Web Document Clustering." Asian Journal of Research in Social Sciences and Humanities 6, cs1 (2016): 343. http://dx.doi.org/10.5958/2249-7315.2016.00968.0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

VarmaPamba, Raja, and Elizabeth Sherly. "EEWDCO: The Efficient way of Enhancing Web Document Clustering using Ontologies." International Journal of Computer Applications 86, no. 3 (January 16, 2014): 23–25. http://dx.doi.org/10.5120/14966-3144.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Ko, Suc-Bum, and Sung-Dae Youn. "A performance improvement methodology of web document clustering using FDC-TCT." KIPS Transactions:PartD 12D, no. 4 (August 1, 2005): 637–46. http://dx.doi.org/10.3745/kipstd.2005.12d.4.637.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Sinka, Mark P., and David W. Corne. "The BankSearch web document dataset: investigating unsupervised clustering and category similarity." Journal of Network and Computer Applications 28, no. 2 (April 2005): 129–46. http://dx.doi.org/10.1016/j.jnca.2004.01.002.

Full text

APA, Harvard, Vancouver, ISO, and other styles

40

Lee, Ingyu, and Byung-Won On. "An effective web document clustering algorithm based on bisection and merge." Artificial Intelligence Review 36, no. 1 (January 18, 2011): 69–85. http://dx.doi.org/10.1007/s10462-011-9203-4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

41

Xu, Shuting, and Jun Zhang. "A Parallel Hybrid Web Document Clustering Algorithm and its Performance Study." Journal of Supercomputing 30, no. 2 (November 2004): 117–31. http://dx.doi.org/10.1023/b:supe.0000040611.25862.d9.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Li, Peng, Bin Wang, and Wei Jin. "Improving Web Document Clustering through Employing User-Related Tag Expansion Techniques." Journal of Computer Science and Technology 27, no. 3 (January 2012): 554–66. http://dx.doi.org/10.1007/s11390-012-1243-y.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

Luzon, Christine, Luisito Lacatan, Harold Bangalisan, and Jayvee Osapdin. "Web-Based File Clustering and Indexing for Mindoro State University." International Journal of Computing Sciences Research 6 (January 31, 2022): 951–61. http://dx.doi.org/10.25147/ijcsr.2017.001.1.82.

Full text

Abstract:

Purpose – The Web-Based File Clustering and Indexing for Mindoro State University aim to organize data circulated over the Web into groups/collections to facilitate data availability and access, and at the same time meet user preferences. The main beneﬁts include: increasing Web information accessibility, understanding users’ navigation behavior, improving information retrieval and content delivery on the Web. Web-based file clustering could help in reaching the required documents that the user is searching for. Method – In this paper a novel approach has been introduced for search results clustering that is based on the semantics of the retrieved documents rather than the syntax of the terms in those documents. Data clustering was used to improve the information retrieval from the collection of documents. Data were processed and analyzed using SPSS (version 18) where the instrument was evaluated to test the reliability and validity of the measures used. Evaluation was based on a Likert scale of Excellent, Good, Fair, and Poor as described for the selected quality characteristics. Results – A total of 200 questionnaires were distributed with a return rate of 100%. The questionnaire was tested 0.735 using Cronbach’s Alpha Coefficient and considered a reliable instrument. Four quality characteristics were evaluated in this study; Usability, Performance Efficiency, Reliability, and Functionality Suitability. Conclusion - The Web-based file clustering could help in reaching the required documents that the user is searching for. The need for an information retrieval mechanism can only be supported if the document collection is organized into a meaningful structure, which allows part or all the document collection to be browsed at each stage of a search. Recommendations – It is recommended that upon uploading of file it will show the use of the file and where it is originated (department). It is also recommended to create an index to cluster not only the file type but also the content and use of a file. Explore the clustering to a wider scope. Practical Implications – Document clustering provides a structure for organizing large bodies of text for efficient browsing and searching and helps a lot for the Mindoro State University for records/ document processing. Indexing is the best tool to maintain uniqueness of records in a database. Whenever new files or records are created, it can be easily added to the index. This makes it easy to keep documents up-to-date at all times. Grouping documents into two or more categories improves search time and makes life easier for everyone.

APA, Harvard, Vancouver, ISO, and other styles

44

Ping, Deng Li, Guo Bing, and Zheng Wen. "Web Service Clustering Approach Based on Network and Fused Document-Based and Tag-Based Topics Similarity." International Journal of Web Services Research 18, no. 3 (July 2021): 63–81. http://dx.doi.org/10.4018/ijwsr.2021070104.

Full text

Abstract:

To produce a web services clustering with values that satisfy many requirements is a challenging focus. In this article, the authors proposed a new approach with two models, which are helpful to the service clustering problem. Firstly, a document-tag LDA model (DTag-LDA) is proposed that considers the tag information of web services, and the tag can describe the effective information of documents accurately. Based on the first model, this article further proposes an efficient document weight and tag weight-LDA model (DTw-LDA), which fused multi-modal data network. To further improve the clustering accuracy, the model constructs the network for describing text and tag respectively and then merges the two networks to generate web service network clustered. In addition, this article also designs experiments to verify that the used auxiliary information can help to extract more accurate semantics by conducting service classification. And the proposed method has obvious advantages in precision, recall, purity, and other performance.

APA, Harvard, Vancouver, ISO, and other styles

45

Feng, Jian, Ying Zhang, and Yuqiang Qiao. "A Detection Method for Phishing Web Page Using DOM-Based Doc2Vec Model." Journal of Computing and Information Technology 28, no. 1 (July 10, 2020): 19–31. http://dx.doi.org/10.20532/cit.2020.1004899.

Full text

Abstract:

Detecting phishing web pages is a challenging task. The existing detection method for phishing web page based on DOM (Document Object Model) is mainly aiming at obtaining structural characteristics but ignores the overall representation of web pages and the semantic information that HTML tags may have. This paper regards DOMs as a natural language with Doc2Vec model and learns the structural semantics automatically to detect phishing web pages. Firstly, the DOM structure of the obtained web page is parsed to construct the DOM tree, then the Doc2Vec model is used to vectorize the DOM tree, and to measure the semantic similarity in web pages by the distance between different DOM vectors. Finally, the hierarchical clustering method is used to implement clustering of web pages. Experiments show that the method proposed in the paper achieves higher recall and precision for phishing classification, compared to DOM-based structural clustering method and TF-IDF-based semantic clustering method. The result shows that using Paragraph Vector is effective on DOM in a linguistic approach.

APA, Harvard, Vancouver, ISO, and other styles

46

Chahal, Poonam, and Manjeet Singh. "An Efficient Approach for Ranking of Semantic Web Documents by Computing Semantic Similarity and Using HCS Clustering." International Journal of Semiotics and Visual Rhetoric 5, no. 1 (January 2021): 45–56. http://dx.doi.org/10.4018/ijsvr.2021010104.

Full text

Abstract:

In today's era, with the availability of a huge amount of dynamic information available in world wide web (WWW), it is complex for the user to retrieve or search the relevant information. One of the techniques used in information retrieval is clustering, and then the ranking of the web documents is done to provide user the information as per their query. In this paper, semantic similarity score of Semantic Web documents is computed by using the semantic-based similarity feature combining the latent semantic analysis (LSA) and latent relational analysis (LRA). The LSA and LRA help to determine the relevant concepts and relationships between the concepts which further correspond to the words and relationships between these words. The extracted interrelated concepts are represented by the graph further representing the semantic content of the web document. From this graph representation for each document, the HCS algorithm of clustering is used to extract the most connected subgraph for constructing the different number of clusters which is according to the information-theoretic approach. The web documents present in clusters in graphical form are ranked by using the text-rank method in combination with the proposed method. The experimental analysis is done by using the benchmark datasets OpinRank. The performance of the approach on ranking of web documents using semantic-based clustering has shown promising results.

APA, Harvard, Vancouver, ISO, and other styles

47

Bagban, T. I., and P. J. Kulkarni. "On Applying Document Similarity Measures for Template based Clustering of Web Documents." International Journal of Computer Sciences and Engineering 06, no. 01 (February 28, 2018): 37–42. http://dx.doi.org/10.26438/ijcse/v6si1.3742.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Lei, Jingsheng. "A Fuzzy Clustering Technology Based on Hierarchical Neural Networks for Web Document." Journal of Computer Research and Development 43, no. 10 (2006): 1695. http://dx.doi.org/10.1360/crad20061003.

Full text

APA, Harvard, Vancouver, ISO, and other styles

49

taheri khameneh, behnam, and hamid shokrzadeh. "Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics." Signal and Data Processing 17, no. 1 (June 1, 2020): 29–46. http://dx.doi.org/10.29252/jsdp.17.1.29.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Ponomarev, I. "Development of an automated system for clustering text documents." System technologies 1, no. 138 (March 30, 2022): 115–19. http://dx.doi.org/10.34185/1562-9945-1-138-2022-10.

Full text

Abstract:

Grouping texts into groups similar in content is a common task in various fields of human activity. Text document clustering is used to automatically categorize text documents, filter emails, group web pages in search engines, and so on. Automation of this process can signifi-cantly reduce the time spent on this task.

APA, Harvard, Vancouver, ISO, and other styles

Journal articles on the topic 'Web document clustering (WDC)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles