Journal articles on the topic 'Web Crawler'

To see the other types of publications on this topic, follow the link: Web Crawler.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Web Crawler.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Feng, Guilian. "Implementation of Web Data Mining Technology Based on Python." Journal of Physics: Conference Series 2066, no. 1 (November 1, 2021): 012033. http://dx.doi.org/10.1088/1742-6596/2066/1/012033.

Full text
Abstract:
Abstract With the arrival of the era of big data, people have gradually realized the importance of data. Data is not just a resource, it is an asset. This paper mainly studies the realization of Web data mining technology based on Python. This paper analyzes the overall architecture design of distributed web crawler system, and then analyzes in detail the principles of crawler’s URL function module, crawler’s web crawl function module, crawler’s web page parsing function module, crawler’s data storage function module and so on. Each function module of the crawler system was tested on the experimental computer, and the data information was summarized for comparative analysis. The main significance of this paper lies in the design and implementation of a distributed web crawler system, which, to a certain extent, solves the problems of slow speed, low efficiency and poor scalability of traditional single computer web crawler, and improves the speed and efficiency of web crawler in grasping information and web page data.
APA, Harvard, Vancouver, ISO, and other styles
2

Liu, Dong Fei, and Xian Shuang Fan. "Study and Application of Web Crawler Algorithm Based on Heritrix." Advanced Materials Research 219-220 (March 2011): 1069–72. http://dx.doi.org/10.4028/www.scientific.net/amr.219-220.1069.

Full text
Abstract:
In this paper, the web crawler in search engine was introduced firstly, based on the detailed analysis of the system architecture about open source web crawler Heritrix, proposed design of a particular parser, parsed the particular web site to achieve the purpose of particular crawl. Then by eliminating the impact on individual processors caused by robots.txt file, and introduced the ELFHash algorithm implements the purpose of efficient, multi-thread access to the web crawler resources. Finally, by the comparison of the speed of crawl web page between before-improved and after-improved, and the analysis of the number of crawled pages in the same long time, verify the performance of the after-improved web crawler has been more obvious increased.
APA, Harvard, Vancouver, ISO, and other styles
3

Yang, Juan. "Analysis on the Judicial Interpretation of the Crawler Technology Infringing on the Intellectual Property Rights of Enterprise Data." E3S Web of Conferences 251 (2021): 01038. http://dx.doi.org/10.1051/e3sconf/202125101038.

Full text
Abstract:
In the actual process of web crawler infringement and criminal identification, there is a theory of “weakening the infringement typology and strengthening the presumption of legal interest”. This is also the basic method for the subsequent identification of infringement in the commercial application of enterprise data through crawler technology. This article summarizes the criminal risks of abusing crawler technology based on previous work experience. The author discusses the judicial interpretation of crawler technology infringing the legal benefits of corporate data intellectual property rights from the categories of data types crawled by various crawlers and the types of data crawled by crawlers determine the types of applicable laws, increase the concept of trade secrets, and determine the standards and other judicial interpretation clauses three sides.
APA, Harvard, Vancouver, ISO, and other styles
4

Boppana, Venugopal, and Sandhya P. "Focused crawling from the basic approach to context aware notification architecture." Indonesian Journal of Electrical Engineering and Computer Science 13, no. 2 (February 1, 2019): 492. http://dx.doi.org/10.11591/ijeecs.v13.i2.pp492-498.

Full text
Abstract:
<p><span lang="EN-IN">The large and wide range of information has become a tough time for crawlers and search engines to extract related information. This paper discusses about focused crawlers also called as topic specific crawler and variations of focused crawlers leading to distributed architecture, i.e., context aware notification architecture. To get the relevant pages from a huge amount of information available in the internet we use the focused crawler. This can bring out the relevant pages for the given topic with less number of searches in a short time. Here the input to the focused crawler is a topic specified using exemplary documents, but not using the keywords. Focused crawlers avoid the searching of all the web documents instead it searches over the links that are relevant to the crawler boundary. The Focused crawling mechanism helps us to save CPU time to large extent to keep the crawl up-to-date.</span></p>
APA, Harvard, Vancouver, ISO, and other styles
5

Mani Sekhar, S. R., G. M. Siddesh, Sunilkumar S. Manvi, and K. G. Srinivasa. "Optimized Focused Web Crawler with Natural Language Processing Based Relevance Measure in Bioinformatics Web Sources." Cybernetics and Information Technologies 19, no. 2 (June 1, 2019): 146–58. http://dx.doi.org/10.2478/cait-2019-0021.

Full text
Abstract:
Abstract In the fast growing of digital technologies, crawlers and search engines face unpredictable challenges. Focused web-crawlers are essential for mining the boundless data available on the internet. Web-Crawlers face indeterminate latency problem due to differences in their response time. The proposed work attempts to optimize the designing and implementation of Focused Web-Crawlers using Master-Slave architecture for Bioinformatics web sources. Focused Crawlers ideally should crawl only relevant pages, but the relevance of the page can only be estimated after crawling the genomics pages. A solution for predicting the page relevance, which is based on Natural Language Processing, is proposed in the paper. The frequency of the keywords on the top ranked sentences of the page determines the relevance of the pages within genomics sources. The proposed solution uses a TextRank algorithm to rank the sentences, as well as ensuring the correct classification of Bioinformatics web page. Finally, the model is validated by being compared with a breadth first search web-crawler. The comparison shows significant reduction in run time for the same harvest rate.
APA, Harvard, Vancouver, ISO, and other styles
6

Lu, Houqing, Donghui Zhan, Lei Zhou, and Dengchao He. "An Improved Focused Crawler: Using Web Page Classification and Link Priority Evaluation." Mathematical Problems in Engineering 2016 (2016): 1–10. http://dx.doi.org/10.1155/2016/6406901.

Full text
Abstract:
A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. However, the performance of the current focused crawling can easily suffer the impact of the environments of web pages and multiple topic web pages. In the crawling process, a highly relevant region may be ignored owing to the low overall relevance of that page, and anchor text or link-context may misguide crawlers. In order to solve these problems, this paper proposes a new focused crawler. First, we build a web page classifier based on improved term weighting approach (ITFIDF), in order to gain highly relevant web pages. In addition, this paper introduces an evaluation approach of the link, link priority evaluation (LPE), which combines web page content block partition algorithm and the strategy of joint feature evaluation (JFE), to better judge the relevance between URLs on the web page and the given topic. The experimental results demonstrate that the classifier using ITFIDF outperforms TFIDF, and our focused crawler is superior to other focused crawlers based on breadth-first, best-first, anchor text only, link-context only, and content block partition in terms of harvest rate and target recall. In conclusion, our methods are significant and effective for focused crawler.
APA, Harvard, Vancouver, ISO, and other styles
7

Qiu, Zhao, Ceng Jun Dai, and Tao Liu. "Design of Theme Crawler for Web Forum." Applied Mechanics and Materials 548-549 (April 2014): 1330–33. http://dx.doi.org/10.4028/www.scientific.net/amm.548-549.1330.

Full text
Abstract:
Network crawler as web information extraction tools, it can download web pages from internet for the engine. The implementation strategy and operating efficiency of crawling program have a direct influence on results of subsequent work. The paper aimed at the shortcomings of ordinary crawler, puts forward a practical and efficient precise crawler theme method for the BBS, the method for the BBS characteristics, attempts in the web page parsing, theme correlation analysis and the crawling strategy, using the template configuration, analyze and crawl on the article. The method is better than the general crawler in the performance, accuracy and comprehensive rate.
APA, Harvard, Vancouver, ISO, and other styles
8

Subatra Devi, S. "A Novel Approach on Focused Crawling With Anchor Text." Asian Journal of Computer Science and Technology 7, no. 1 (May 5, 2018): 7–15. http://dx.doi.org/10.51983/ajcst-2018.7.1.1849.

Full text
Abstract:
A novel approach with focused crawling for various anchor texts is discussed in this paper. Most of the search engines search the web with the anchor text to retrieve the relevant pages and answer the queries given by the users. The crawler usually searches the web pages and filters the unnecessary pages which can be done through focused crawling. A focused crawler generates its boundary to crawl the relevant pages based on the link and ignores the irrelevant pages on the web. In this paper, an effective focused crawling method is implemented to improve the quality of the search. Here, three learning phases are considered namely, content-based, link-based and sibling-based learning are undergone to improve the navigation of the search. In this approach, the crawler crawls through the relevant pages efficiently and more relevant pages are retrieved in an effective way. It is proved experimentally that more number of relevant pages are retrieved for different anchor texts with three learning phases using focused crawling.
APA, Harvard, Vancouver, ISO, and other styles
9

Ro, Inwoo, Joong Soo Han, and Eul Gyu Im. "Detection Method for Distributed Web-Crawlers: A Long-Tail Threshold Model." Security and Communication Networks 2018 (December 4, 2018): 1–7. http://dx.doi.org/10.1155/2018/9065424.

Full text
Abstract:
This paper proposes an advanced countermeasure against distributed web-crawlers. We investigated other methods for crawler detection and analyzed how distributed crawlers can bypass these methods. Our method can detect distributed crawlers by focusing on the property that web traffic follows the power distribution. When we sort web pages by the number of requests, most of requests are concentrated on the most frequently requested web pages. In addition, there will be some web pages that normal users do not generally request. But crawlers will request for these web pages because their algorithms are intended to request iteratively by parsing web pages to collect every item the crawlers encounter. Therefore, we can assume that if some IP addresses are frequently used to request the web pages that are located in the long-tail area of a power distribution graph, those IP addresses can be classified as crawler nodes. The experimental results with NASA web traffic data showed that our method was effective in identifying distributed crawlers with 0.0275% false positives when a conventional frequency-based detection method shows 2.882% false positives with an equal access threshold.
APA, Harvard, Vancouver, ISO, and other styles
10

Sakunthala Prabha, K. S., C. Mahesh, and S. P. Raja. "An Enhanced Semantic Focused Web Crawler Based on Hybrid String Matching Algorithm." Cybernetics and Information Technologies 21, no. 2 (June 1, 2021): 105–20. http://dx.doi.org/10.2478/cait-2021-0022.

Full text
Abstract:
Abstract Topic precise crawler is a special purpose web crawler, which downloads appropriate web pages analogous to a particular topic by measuring cosine similarity or semantic similarity score. The cosine based similarity measure displays inaccurate relevance score, if topic term does not directly occur in the web page. The semantic-based similarity measure provides the precise relevance score, even if the synonyms of the given topic occur in the web page. The unavailability of the topic in the ontology produces inaccurate relevance score by the semantic focused crawlers. This paper overcomes these glitches with a hybrid string-matching algorithm by combining the semantic similarity-based measure with the probabilistic similarity-based measure. The experimental results revealed that this algorithm increased the efficiency of the focused web crawlers and achieved better Harvest Rate (HR), Precision (P) and Irrelevance Ratio (IR) than the existing web focused crawlers achieve.
APA, Harvard, Vancouver, ISO, and other styles
11

ALQARALEH, Saed, Omar RAMADAN, and Muhammed SALAMAH. "Efficient watcher based web crawler design." Aslib Journal of Information Management 67, no. 6 (November 16, 2015): 663–86. http://dx.doi.org/10.1108/ajim-02-2015-0019.

Full text
Abstract:
Purpose – The purpose of this paper is to design a watcher-based crawler (WBC) that has the ability of crawling static and dynamic web sites, and can download only the updated and newly added web pages. Design/methodology/approach – In the proposed WBC crawler, a watcher file, which can be uploaded to the web sites servers, prepares a report that contains the addresses of the updated and the newly added web pages. In addition, the WBC is split into five units, where each unit is responsible for performing a specific crawling process. Findings – Several experiments have been conducted and it has been observed that the proposed WBC increases the number of uniquely visited static and dynamic web sites as compared with the existing crawling techniques. In addition, the proposed watcher file not only allows the crawlers to visit the updated and newly web pages, but also solves the crawlers overlapping and communication problems. Originality/value – The proposed WBC performs all crawling processes in the sense that it detects all updated and newly added pages automatically without any human explicit intervention or downloading the entire web sites.
APA, Harvard, Vancouver, ISO, and other styles
12

Chen, Xing, Wei Jiang Li, Tie Jun Zhao, and Xing Hai Piao. "Design of the Distributed Web Crawler." Advanced Materials Research 204-210 (February 2011): 1454–58. http://dx.doi.org/10.4028/www.scientific.net/amr.204-210.1454.

Full text
Abstract:
On the current scale of the Internet, the single web crawler is unable to visit the entire web in an effective time-frame. So, we develop a distributed web crawler system to deal with it. In our distribution design, we mainly consider two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. We focus on the distribution and parallel between nodes. We address two issues of the distributed web crawler which include the crawl strategy and dynamic configuration. The results of experiment show that the hash function based on the web site achieves the goal of the distributed web crawler. At the same time, we pursue the load balance of the system, we also should reduce the communication and management spending as much as possible.
APA, Harvard, Vancouver, ISO, and other styles
13

Xie, Dong Xiang, and Wen Feng Xia. "Design and Implementation of the Topic-Focused Crawler Based on Scrapy." Advanced Materials Research 850-851 (December 2013): 487–90. http://dx.doi.org/10.4028/www.scientific.net/amr.850-851.487.

Full text
Abstract:
E-commerce websites has abundant commercial data. Some very beneficial information to the analysis and prediction of the market can be discovered from these data by applying data mining techniques. The topic-focused web crawler can crawl and gather the subject-related web pages as soon as possible. This thesis has designed and realized the topic-focused crawler based on Scrapy. It firstly introduces the design idea of the crawler and highlights the functions of Scrapys every part. Then, it uses this topic-focused crawler to realize the capture of information from the C2C e-commerce platform, for example TaoBao. At last, it obtains the running result and comparisons of crawling performance between Scrapy based crawler and general crawler.
APA, Harvard, Vancouver, ISO, and other styles
14

Sharma, Dilip Kumar, and A. K. Sharma. "A Novel Architecture for Deep Web Crawler." International Journal of Information Technology and Web Engineering 6, no. 1 (January 2011): 25–48. http://dx.doi.org/10.4018/jitwe.2011010103.

Full text
Abstract:
A traditional crawler picks up a URL, retrieves the corresponding page and extracts various links, adding them to the queue. A deep Web crawler, after adding links to the queue, checks for forms. If forms are present, it processes them and retrieves the required information. Various techniques have been proposed for crawling deep Web information, but much remains undiscovered. In this paper, the authors analyze and compare important deep Web information crawling techniques to find their relative limitations and advantages. To minimize limitations of existing deep Web crawlers, a novel architecture is proposed based on QIIIEP specifications (Sharma & Sharma, 2009). The proposed architecture is cost effective and has features of privatized search and general search for deep Web data hidden behind html forms.
APA, Harvard, Vancouver, ISO, and other styles
15

Basaligheh, Prof Parvaneh. "Mining Of Deep Web Interfaces Using Multi Stage Web Crawler." International Journal of New Practices in Management and Engineering 9, no. 04 (December 31, 2020): 11–16. http://dx.doi.org/10.17762/ijnpme.v9i04.91.

Full text
Abstract:
As deep web develops at an exceptionally high speed, there has been expanded interest in procedures that help productively find deep-web interfaces. Nonetheless, because of the huge volume of web assets and the dynamic idea of deep web, accomplishing wide inclusion and high proficiency is a difficult issue. In this venture propose a three-stage framework, for proficient reaping deep web interfaces. In the main stage, web crawler performs website based looking for focus pages with the assistance of web indexes, trying not to visit an enormous number of pages. To accomplish more exact outcomes for an engaged slither, Web Crawler positions websites to organize profoundly applicable ones for a given subject. In the second stage the proposed framework opens the web pages inside in application with the assistance of Jsoup API and preprocess it. At that point it plays out the word include of inquiry in web pages. In the third stage the proposed framework performs recurrence investigation dependent on TF and IDF. It additionally utilizes a blend of TF*IDF for positioning web pages. To kill inclination on visiting some exceptionally applicable connections in shrouded web registries, In this paper we propose plan a connection tree information structure to accomplish more extensive inclusion for a website. Venture trial results on a bunch of delegate areas show the deftness and exactness of our proposed crawler framework, which proficiently recovers deep-web interfaces from enormous scope destinations and accomplishes higher reap rates than different crawlers utilizing gullible Bayes calculation.
APA, Harvard, Vancouver, ISO, and other styles
16

Kumar, Ashwani, Anuj Kumar, and Rahul Mishra. "Effective Concentrated Web Crawling Approach Path for Google." International Journal of Advanced Research in Computer Science and Software Engineering 7, no. 11 (December 8, 2017): 1. http://dx.doi.org/10.23956/ijarcsse.v7i11.459.

Full text
Abstract:
A concentered crawler crosses the World Wide Web, choosing out applicable pages to a predefined topic and forgetting those out of concern. Collecting domain specific documents employing focused crawlers has been considered one of most crucial schemes to detect applicable data. While browsing the Internet, it is unmanageable to act with extraneous pages and to anticipate which associates lead to quality pages. However most focused crawler use local explore algorithmic program to crisscross the web space, but they could easily entrapped within bounded a sub graph of the web that surrounds the starting URLs also there is problem related to applicable pages that are miss when no associates from the starting URLs. There is some applicable pages are miss. To address this problem we design a focused crawler where calculating the absolute frequency of the topic keyword also calculate the equivalent word and sub equivalent word of the keyword. The weight table is constructed agreeing to the user query. To check the resemblance of web pages with respect to topic keywords and priority of extracted associate is computed.
APA, Harvard, Vancouver, ISO, and other styles
17

Gunawan, Dani, Amalia Amalia, and Atras Najwan. "Improving Data Collection on Article Clustering by Using Distributed Focused Crawler." Data Science: Journal of Computing and Applied Informatics 1, no. 1 (July 18, 2017): 1–12. http://dx.doi.org/10.32734/jocai.v1.i1-82.

Full text
Abstract:
Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low.
APA, Harvard, Vancouver, ISO, and other styles
18

Mayal, Deepak. "Analysis on Web Crawling Algorithms." International Journal on Recent and Innovation Trends in Computing and Communication 6, no. 12 (December 31, 2018): 33–36. http://dx.doi.org/10.17762/ijritcc.v6i12.5216.

Full text
Abstract:
World Wide Web (WWW)also referred to as web acts as a vital source of information and searching over the web has become so much easy nowadays all thanks to search engines google, yahoo etc. A search engine is basically a complex multiprogram that allows user to search information available on the web and for that purpose, they use web crawlers. Web crawler systematically browses the world wide web. Effective search helps in avoiding downloading and visiting irrelevant web pages on the web in order to do that web crawlers use different searching algorithm . This paper reviews different web crawling algorithm that determines the fate of the search system.
APA, Harvard, Vancouver, ISO, and other styles
19

Singh Ahuja, Mini, Dr Jatinder Singh Bal, and Var nica. "Web Crawler: Extracting the Web Data." International Journal of Computer Trends and Technology 13, no. 3 (July 25, 2014): 132–37. http://dx.doi.org/10.14445/22312803/ijctt-v13p128.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Khine, Su Mon, and Yadana Thein. "Myanmar Web Pages Crawler." International Journal on Web Service Computing 6, no. 1 (March 31, 2015): 01–11. http://dx.doi.org/10.5121/ijwsc.2015.6101.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

AbuKausar, Md, V. S. Dhaka, and Sanjeev Kumar Singh. "Web Crawler: A Review." International Journal of Computer Applications 63, no. 2 (February 15, 2013): 31–36. http://dx.doi.org/10.5120/10440-5125.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Brandman, Onn, Junghoo Cho, Hector Garcia-Molina, and Narayanan Shivakumar. "Crawler-Friendly Web Servers." ACM SIGMETRICS Performance Evaluation Review 28, no. 2 (September 2000): 9–14. http://dx.doi.org/10.1145/362883.362894.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Taubes, Gary. "The Web-Crawler Wars." Science 269, no. 5229 (September 8, 1995): 1355. http://dx.doi.org/10.1126/science.269.5229.1355.

Full text
APA, Harvard, Vancouver, ISO, and other styles
24

Liang, Guo Chao, and Cai Feng Cao. "Research and Implementation of LED Optical Design Focused Web Crawler." Applied Mechanics and Materials 543-547 (March 2014): 2941–44. http://dx.doi.org/10.4028/www.scientific.net/amm.543-547.2941.

Full text
Abstract:
The LED optical design focused web crawler is proposed based on Shark-Search and topical dictionary. Then the crawling strategy is implemented by extending the web crawler Heritrix. The experimental results show that the design scheme (Topic-First) and its web crawler LED-Crawler can effectively capture the LED optical design relevant web pages. And compared to general search engines, LED-Crawler can improve the accuracy.
APA, Harvard, Vancouver, ISO, and other styles
25

Naik, Deepak Ranoji, and Dr Satish R. Todmal. "Intelligent Web Crawler by Supervised Learning." Journal of Advances and Scholarly Researches in Allied Education 15, no. 4 (June 1, 2018): 99–109. http://dx.doi.org/10.29070/15/57336.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Nadkarni, Tushar. "International Journal for Research and Development in Engineering." International Journal of Software Engineering and Technologies (IJSET) 1, no. 2 (August 1, 2016): 83. http://dx.doi.org/10.11591/ijset.v1i2.4571.

Full text
Abstract:
Search Engines are tremendous force multipliers for end hosts trying to discover content on the Web. As the amount of content online grows, so does dependence on web crawlers to discover relevant content. The motive is to develop an efficient Web Crawler that will give results more relevant to search keyword and faster, which will support Semantics extraction, multithreading and distributed computing.
APA, Harvard, Vancouver, ISO, and other styles
27

Pech-May, Fernando, Alicia Martínez-Rebollar, Hugo Estrada-Esquivel, and Eduardo Pedroza-Landa. "CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web." Lámpsakos, no. 13 (January 1, 2015): 39. http://dx.doi.org/10.21501/21454086.1365.

Full text
Abstract:
The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web.
APA, Harvard, Vancouver, ISO, and other styles
28

Hernandez, Julio, Heidy M. Marin-Castro, and Miguel Morales-Sandoval. "A Semantic Focused Web Crawler Based on a Knowledge Representation Schema." Applied Sciences 10, no. 11 (May 31, 2020): 3837. http://dx.doi.org/10.3390/app10113837.

Full text
Abstract:
The Web has become the main source of information in the digital world, expanding to heterogeneous domains and continuously growing. By means of a search engine, users can systematically search over the web for particular information based on a text query, on the basis of a domain-unaware web search tool that maintains real-time information. One type of web search tool is the semantic focused web crawler (SFWC); it exploits the semantics of the Web based on some ontology heuristics to determine which web pages belong to the domain defined by the query. An SFWC is highly dependent on the ontological resource, which is created by domain human experts. This work presents a novel SFWC based on a generic knowledge representation schema to model the crawler’s domain, thus reducing the complexity and cost of constructing a more formal representation as the case when using ontologies. Furthermore, a similarity measure based on the combination of the inverse document frequency (IDF) metric, standard deviation, and the arithmetic mean is proposed for the SFWC. This measure filters web page contents in accordance with the domain of interest during the crawling task. A set of experiments were run over the domains of computer science, politics, and diabetes to validate and evaluate the proposed novel crawler. The quantitative (harvest ratio) and qualitative (Fleiss’ kappa) evaluations demonstrate the suitability of the proposed SFWC to crawl the Web using a knowledge representation schema instead of a domain ontology.
APA, Harvard, Vancouver, ISO, and other styles
29

LŐRINCZ, ANDRÁS, ISTVÁN KÓKAI, and ATTILA MERETEI. "INTELLIGENT HIGH-PERFORMANCE CRAWLERS USED TO REVEAL TOPIC-SPECIFIC STRUCTURE OF THE WWW." International Journal of Foundations of Computer Science 13, no. 04 (August 2002): 477–95. http://dx.doi.org/10.1142/s0129054102001230.

Full text
Abstract:
The slogan that "information is power" has undergone a slight change. Today, "information updating" is in the focus of interest. The largest source of information today is the World Wide Web. Fast search methods are needed to utilize this enormous source of information. In this paper our novel crawler using support vector classification and on-line reinforcement learning is described. We launched crawler searches from different sites, including sites that offer, at best, very limited information about the search subject. This case may correspond to typical searches of non-experts. Results indicate that the considerable performance improvement of our crawler over other known crawlers is due to its on-line adaptation property. We used our crawler to characterize basic topic-specific properties of WWW environments. It was found that topic-specific regions have a broad distribution of valuable documents. Expert sites are excellent starting points, whereas mailing lists can form trape for the crawler. These properties of the WWW and the emergence of intelligent "high-performance" crawlers that monitor and search for novel information together predict a significant increase of communication load on the WWW in the near future.
APA, Harvard, Vancouver, ISO, and other styles
30

Pardede, Jasman, Uung Ungkawa, and Muhammad Akbar Bernovaldy. "Implementasi Ontology Pada Web Crawler." MIND Journal 1, no. 1 (November 26, 2018): 76–84. http://dx.doi.org/10.26760/mindjournal.v1i2.76-84.

Full text
Abstract:
Web crawler adalah suatu program atau script otomatis yang bekerja dengan memprioritaskan ketentuan khusus untuk melakukan penjelajahan dan melakukan pengambilan informasi dalam halaman web yang ada di internet. Proses pengindeksan merupakan proses crawler yang memudahkan setiap orang dalam pencarian informasi Pada proses indexing tersebut dibangun dengan menggunakan metode ontology. Metode ontology merupakan sebuah teori tentang makna dari suatu objek dengan hubungan objek tersebut. Pada penelitian ini, metode ontology diterapkan dalam proses pengambilan data dan pengelompokkan data. Metode ontology memiliki proses, yaitu melakukan splitting terhadap objek dengan ketentuan relasi untuk mendapatkan sebuah objek ontology. Selanjutnya dilakukan crawling terhadap objek ontology tersebut untuk mendapatkan hasil crawling dengan ontology. Pengelompokkan data diproses berdasarkan objek yang telah didapat berdasarkan relasi ontology. Dari hasil penelitian dapat diambil kesimpulan, yaitu presentase objek relasi sesuai dengan relasinya adalah 100% dan kecepatan web crawler dengan ontology lebih cepat 56,67% dibanding dengan web crawler biasa.
APA, Harvard, Vancouver, ISO, and other styles
31

Mahi, Gurjot Singh, and Amandeep Verma. "Development of Focused Crawlers for Building Large Punjabi News Corpus." Journal of ICT Research and Applications 15, no. 3 (December 28, 2021): 205–15. http://dx.doi.org/10.5614/itbj.ict.res.appl.2021.15.3.1.

Full text
Abstract:
Web crawlers are as old as the Internet and are most commonly used by search engines to visit websites and index them into repositories. They are not limited to search engines but are also widely utilized to build corpora in different domains and languages. This study developed a focused set of web crawlers for three Punjabi news websites. The web crawlers were developed to extract quality text articles and add them to a local repository to be used in further research. The crawlers were implemented using the Python programming language and were utilized to construct a corpus of more than 134,000 news articles in nine different news genres. The crawler code and extracted corpora were made publicly available to the scientific community for research purposes.
APA, Harvard, Vancouver, ISO, and other styles
32

Gupta, Sonali, and Komal Kumar Bhatia. "Optimal Query Generation for Hidden Web Extraction through Response Analysis." International Journal of Information Retrieval Research 4, no. 2 (April 2014): 1–18. http://dx.doi.org/10.4018/ijirr.2014040101.

Full text
Abstract:
A huge number of Hidden Web databases exists over the WWW forming a massive source of high quality information. Retrieval of this information for enriching the repository of the search engine is the prime target of a Hidden web crawler. Besides this, the crawler should perform this task at an affordable cost and resource utilization. This paper proposes a Random ranking mechanism whereby the queries to be raised by the hidden web crawler have been ranked. By ranking the queries according to the proposed mechanism, the Hidden Web crawler is able to make an optimal choice among the candidate queries and efficiently retrieve the Hidden web databases. The Hidden Web crawler proposed here also possesses an extensible and scalable framework to improve the efficiency of crawling. The proposed approach has also been compared with other methods of Hidden Web crawling existing in the literature.
APA, Harvard, Vancouver, ISO, and other styles
33

Choudhary, Jaytrilok, and Devshri Roy. "Priority based Semantic Web Crawler." International Journal of Computer Applications 81, no. 15 (November 22, 2013): 10–13. http://dx.doi.org/10.5120/14197-2372.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Kumar, K. Praveen. "Crawler for Efficiently Harvesting Web." International Journal of Communication Technology for Social Networking Services 5, no. 1 (March 30, 2017): 7–14. http://dx.doi.org/10.21742/ijctsns.2017.5.1.02.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Rungsawang, A., and N. Angkawattanawit. "Learnable topic-specific web crawler." Journal of Network and Computer Applications 28, no. 2 (April 2005): 97–114. http://dx.doi.org/10.1016/j.jnca.2004.01.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Oh, Hyo-Jung, Dong-Hyun Won, Chonghyuck Kim, Sung-Hee Park, and Yong Kim. "Design and implementation of crawling algorithm to collect deep web information for web archiving." Data Technologies and Applications 52, no. 2 (April 3, 2018): 266–77. http://dx.doi.org/10.1108/dta-07-2017-0053.

Full text
Abstract:
Purpose The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web. Design/methodology/approach This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages. Findings Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case. Research limitations/implications To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors. Practical implications The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs. Originality/value This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.
APA, Harvard, Vancouver, ISO, and other styles
37

Hien, Ngo Le Huy, Thai Quang Tien, and Nguyen Van Hieu. "Web Crawler: Design And Implementation For Extracting Article-Like Contents." Cybernetics and Physics, Issue Volume 9, 2020, Number 3 (November 30, 2020): 144–51. http://dx.doi.org/10.35470/2226-4116-2020-9-3-144-151.

Full text
Abstract:
The World Wide Web is a large, wealthy, and accessible information system whose users are increasing rapidly nowadays. To retrieve information from the web as per users’ requests, search engines are built to access web pages. As search engine systems play a significant role in cybernetics, telecommunication, and physics, many efforts were made to enhance their capacity.However, most of the data contained on the web are unmanaged, making it impossible to access the entire network at once by current search engine system mechanisms. Web Crawler, therefore, is a critical part of search engines to navigate and download full texts of the web pages. Web crawlers may also be applied to detect missing links and for community detection in complex networks and cybernetic systems. However, template-based crawling techniques could not handle the layout diversity of objects from web pages. In this paper, a web crawler module was designed and implemented, attempted to extract article-like contents from 495 websites. It uses a machine learning approach with visual cues, trivial HTML, and text-based features to filter out clutters. The outcomes are promising for extracting article-like contents from websites, contributing to the search engine systems development and future research gears towards proposing higher performance systems.
APA, Harvard, Vancouver, ISO, and other styles
38

Hao, Zhi Feng, Ze Bin Zhang, Zhao Quan Cai, and Han Huang. "An Improved Crawler Algorithm Based on Hierarchical Structure Preservation." Key Engineering Materials 474-476 (April 2011): 2120–24. http://dx.doi.org/10.4028/www.scientific.net/kem.474-476.2120.

Full text
Abstract:
This paper proposes an improved web crawler algorithm to climb more useful information since the basic web crawler algorithm is low-efficiency and easy to climb useless repeated information. By the proposed algorithm, the website urls are hierarchical saved to store websites overall topology, which will make crisscross complex web URL system from a graphic structure into a tree structure. The actual website BBS experiments show that the algorithm is much better than the basic web crawler algorithm in crawling speed and download information such as the usefulness of baking. Furthermore, it provides a performing structure mode for the increment crawler algorithm.
APA, Harvard, Vancouver, ISO, and other styles
39

Prieto, Víctor, Manuel Álvarez, Rafael López-García, and Fidel Cacheda. "A scale for crawler effectiveness on the client-side hidden web." Computer Science and Information Systems 9, no. 2 (2012): 561–83. http://dx.doi.org/10.2298/csis111215015p.

Full text
Abstract:
The main goal of this study is to present a scale that classifies crawling systems according to their effectiveness in traversing the ?clientside? Hidden Web. First, we perform a thorough analysis of the different client-side technologies and the main features of the web pages in order to determine the basic steps of the aforementioned scale. Then, we define the scale by grouping basic scenarios in terms of several common features, and we propose some methods to evaluate the effectiveness of the crawlers according to the levels of the scale. Finally, we present a testing web site and we show the results of applying the aforementioned methods to the results obtained by some open-source and commercial crawlers that tried to traverse the pages. Only a few crawlers achieve good results in treating client-side technologies. Regarding standalone crawlers, we highlight the open-source crawlers Heritrix and Nutch and the commercial crawler WebCopierPro, which is able to process very complex scenarios. With regard to the crawlers of the main search engines, only Google processes most of the scenarios we have proposed, while Yahoo! and Bing just deal with the basic ones. There are not many studies that assess the capacity of the crawlers to deal with client-side technologies. Also, these studies consider fewer technologies, fewer crawlers and fewer combinations. Furthermore, to the best of our knowledge, our article provides the first scale for classifying crawlers from the point of view of the most important client-side technologies.
APA, Harvard, Vancouver, ISO, and other styles
40

Kim, Kwang-Young, Won-Goo Lee, Min-Ho Lee, Hwa-Mook Yoon, and Sung-Ho Shin. "Development of Web Crawler for Archiving Web Resources." Journal of the Korea Contents Association 11, no. 9 (September 28, 2011): 9–16. http://dx.doi.org/10.5392/jkca.2011.11.9.009.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Li, Bin, and Ting Zhang. "An Algorithm of Scene Information Collection in General Football Matches Based on Web Documents." Security and Communication Networks 2021 (October 14, 2021): 1–11. http://dx.doi.org/10.1155/2021/5801631.

Full text
Abstract:
In order to obtain the scene information of the ordinary football game more comprehensively, an algorithm of collecting the scene information of the ordinary football game based on web documents is proposed. The commonly used T-graph web crawler model is used to collect the sample nodes of a specific topic in the football game scene information and then collect the edge document information of the football game scene information topic after the crawling stage of the web crawler. Using the feature item extraction algorithm of semantic analysis, according to the similarity of the feature items, the feature items of the football game scene information are extracted to form a web document. By constructing a complex network and introducing the local contribution and overlap coefficient of the community discovery feature selection algorithm, the features of the web document are selected to realize the collection of football game scene information. Experimental results show that the algorithm has high topic collection capabilities and low computational cost, the average accuracy of equilibrium is always around 98%, and it has strong quantification capabilities for web crawlers and communities.
APA, Harvard, Vancouver, ISO, and other styles
42

Deshmukh, Mayuri Anantrao. "2 Way Crawling." International Journal of Applied Evolutionary Computation 10, no. 3 (July 2019): 34–39. http://dx.doi.org/10.4018/ijaec.2019070105.

Full text
Abstract:
As we know that the deep web grows at very fast pace, there has been increased interest in techniques which help efficiently locate and check deep web interfaces. So, it is important to achieve wide coverage and high efficiency on the large volume of web resources. For this we propose a multistage framework, Smart crawler. Smart crawler is a two-stage crawler used to efficiently harvest deep web interfaces. In the first stage, the crawler performs site-based searching for center pages and avoids visiting non-relevant sites. In the second stage, an adaptive link ranking technique is used which helps to searching relevant site by excavating most relevant links. It is important to eliminate bias on visiting highly relevant links which is hidden in web directories, for this a link tree data structure is designed to achieve wider coverage for a website. The proposed framework gives experimental result on different domains and shows the agility and accuracy of the proposed framework, which retrieves deep-web interfaces from a large volume of sites and achieves higher harvest rates than other crawler.
APA, Harvard, Vancouver, ISO, and other styles
43

Patel, Shailesh A., and Dr Jayesh M. Patel. "Web Crawler: An Intelligent Agent Through Intellect Webbot." Indian Journal of Applied Research 1, no. 12 (October 1, 2011): 45–48. http://dx.doi.org/10.15373/2249555x/sep2012/16.

Full text
APA, Harvard, Vancouver, ISO, and other styles
44

Thelwall, Mike. "Creating and using Web corpora." International Journal of Corpus Linguistics 10, no. 4 (November 7, 2005): 517–41. http://dx.doi.org/10.1075/ijcl.10.4.07the.

Full text
Abstract:
The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.
APA, Harvard, Vancouver, ISO, and other styles
45

Kumar, Manish, Ankit Bindal, Robin Gautam, and Rajesh Bhatia. "Keyword query based focused Web crawler." Procedia Computer Science 125 (2018): 584–90. http://dx.doi.org/10.1016/j.procs.2017.12.075.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

Yu, Linxuan, Yeli Li, Qingtao Zeng, Yanxiong Sun, Yuning Bian, and Wei He. "Summary of web crawler technology research." Journal of Physics: Conference Series 1449 (January 2020): 012036. http://dx.doi.org/10.1088/1742-6596/1449/1/012036.

Full text
APA, Harvard, Vancouver, ISO, and other styles
47

Thelwall, Mike. "Methodologies for crawler based Web surveys." Internet Research 12, no. 2 (May 2002): 124–38. http://dx.doi.org/10.1108/10662240210422503.

Full text
APA, Harvard, Vancouver, ISO, and other styles
48

Punj, Deepika, and Ashutosh Dixit. "Design of a Migrating Crawler Based on a Novel URL Scheduling Mechanism using AHP." International Journal of Rough Sets and Data Analysis 4, no. 1 (January 2017): 95–110. http://dx.doi.org/10.4018/ijrsda.2017010106.

Full text
Abstract:
In order to manage the vast information available on web, crawler plays a significant role. The working of crawler should be optimized to get maximum and unique information from the World Wide Web. In this paper, architecture of migrating crawler is proposed which is based on URL ordering, URL scheduling and document redundancy elimination mechanism. The proposed ordering technique is based on URL structure, which plays a crucial role in utilizing the web efficiently. Scheduling ensures that URLs should go to optimum agent for downloading. To ensure this, characteristics of both agents and URLs are taken into consideration for scheduling. Duplicate documents are also removed to make the database unique. To reduce matching time, document matching is made on the basis of their Meta information only. The agents of proposed migrating crawler work more efficiently than traditional single crawler by providing ordering and scheduling of URLs.
APA, Harvard, Vancouver, ISO, and other styles
49

Sarwosri, Sarwosri, Ahmad Hoirul Basori, and Wahyu Budi Surastyo. "APLIKASI WEB CRAWLER UNTUK WEB CONTENT PADA MOBILE PHONE." JUTI: Jurnal Ilmiah Teknologi Informasi 7, no. 3 (January 1, 2009): 127. http://dx.doi.org/10.12962/j24068535.v7i3.a79.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Chen, Xiu Xia, and Wen Qian Shang. "Research and Design of Web Crawler for Music Resources Finding." Applied Mechanics and Materials 543-547 (March 2014): 2957–60. http://dx.doi.org/10.4028/www.scientific.net/amm.543-547.2957.

Full text
Abstract:
This paper designs an automatic web crawler system which crawls music resources on the Internet. Firstly, this paper gives the architecture of the system and the function of each module; then describes the detailed design of each module; Finally, the key technologies and algorithms used in the system are given in a detailed description, including the use of χ2 statistics to select feature words, TF-IDF algorithm to calculate the weights of feature words, the correlation of web page and music theme using vector space model.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography