Academic literature on the topic 'Web page data extraction'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Web page data extraction.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Web page data extraction"

1

Ahmad Sabri, Ily Amalina, and Mustafa Man. "Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model." Indonesian Journal of Electrical Engineering and Computer Science 9, no. 3 (March 1, 2018): 752. http://dx.doi.org/10.11591/ijeecs.v9.i3.pp752-763.

Full text
Abstract:
<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>
APA, Harvard, Vancouver, ISO, and other styles
2

Ahamed, B. Bazeer, D. Yuvaraj, S. Shitharth, Olfat M. Mizra, Aisha Alsobhi, and Ayman Yafoz. "An Efficient Mechanism for Deep Web Data Extraction Based on Tree-Structured Web Pattern Matching." Wireless Communications and Mobile Computing 2022 (May 27, 2022): 1–10. http://dx.doi.org/10.1155/2022/6335201.

Full text
Abstract:
The World Wide Web comprises of huge web databases where the data are searched using web query interface. Generally, the World Wide Web maintains a set of databases to store several data records. The distinct data records are extracted by the web query interface as per the user requests. The information maintained in the web database is hidden and retrieves deep web content even in dynamic script pages. In recent days, a web page offers a huge amount of structured data and is in need of various web-related latest applications. The challenge lies in extracting complicated structured data from deep web pages. Deep web contents are generally accessed by the web queries, but extracting the structured data from the web database is a complex problem. Moreover, making use of such retrieved information in combined structures needs significant efforts. No further techniques are established to address the complexity in data extraction of deep web data from various web pages. Despite the fact that several ways for deep web data extraction are offered, very few research address template-related issues at the page level. For effective web data extraction with a large number of online pages, a unique representation of page generation using tree-based pattern matches (TBPM) is proposed. The performance of the proposed technique TBPM is compared to that of existing techniques in terms of relativity, precision, recall, and time consumption. The performance metrics such as high relativity is about 17-26% are achieved when compared to FiVaTech approach.
APA, Harvard, Vancouver, ISO, and other styles
3

Ahmad Sabri, Ily Amalina, and Mustafa Man. "A deep web data extraction model for web mining: a review." Indonesian Journal of Electrical Engineering and Computer Science 23, no. 1 (July 1, 2021): 519. http://dx.doi.org/10.11591/ijeecs.v23.i1.pp519-528.

Full text
Abstract:
The World Wide Web has become a large pool of information. Extracting structured data from a published web pages has drawn attention in the last decade. The process of web data extraction (WDE) has many challenges, dueto variety of web data and the unstructured data from hypertext mark up language (HTML) files. The aim of this paper is to provide a comprehensive overview of current web data extraction techniques, in termsof extracted quality data. This paper focuses on study for data extraction using wrapper approaches and compares each other to identify the best approach to extract data from online sites. To observe the efficiency of the proposed model, we compare the performance of data extraction by single web page extraction with different models such as document object model (DOM), wrapper using hybrid dom and json (WHDJ), wrapper extraction of image using DOM and JSON (WEIDJ) and WEIDJ (no-rules). Finally, the experimentations proved that WEIDJ can extract data fastest and low time consuming compared to other proposed method.<br /><div> </div>
APA, Harvard, Vancouver, ISO, and other styles
4

Liu, Hong, and Yin Xiao Ma. "Web Data Extraction Research Based on Wrapper and XPath Technology." Advanced Materials Research 271-273 (July 2011): 706–12. http://dx.doi.org/10.4028/www.scientific.net/amr.271-273.706.

Full text
Abstract:
For satisfy people’s various need, some websites consist of pages that are dynamically generated using a common template populated with data from www, such as product description pages on e-commerce sites. In this paper, it merges wrapper technology with XPath to form a dependable, robust process for web data extraction. Through validating such a method in some experiments; we get results that it has high efficiency in extracting list page.
APA, Harvard, Vancouver, ISO, and other styles
5

Ibrahim, Nadia, Alaa Hassan, and Marwah Nihad. "Big Data Analysis of Web Data Extraction." International Journal of Engineering & Technology 7, no. 4.37 (December 13, 2018): 168. http://dx.doi.org/10.14419/ijet.v7i4.37.24095.

Full text
Abstract:
In this study, the large data extraction techniques; include detection of patterns and secret relationships between factors numbering and bring in the required information. Rapid analysis of massive data can lead to innovation and concepts of the theoretical value. Compared with results from mining between traditional data sets and the vast amount of large heterogeneous data interdependent it has the ability expand the knowledge and ideas about the target domain. We studied in this research data mining on the Internet. The various networks that are used to extract data onto different locations complex may appear sometimes and has been used to extract information on the web technology to extract and data analysis (Marwah et al., 2016). In this research, we extracted the information on large quantities of the web pages and examined the pages of the site using Java code, and we added the extracted information on a special database for the web page. We used the data network function to get accurate results of evaluating and categorizing the data pages found, which identifies the trusted web or risky web pages, and imported the data onto a CSV extension. Consequently, examine and categorize these data using WEKA to obtain accurate results. We concluded from the results that the applied data mining algorithms are better than other techniques in classification and extraction of data and high performance.
APA, Harvard, Vancouver, ISO, and other styles
6

Kayed, Mohammed, and Chia-Hui Chang. "FiVaTech: Page-Level Web Data Extraction from Template Pages." IEEE Transactions on Knowledge and Data Engineering 22, no. 2 (February 2010): 249–63. http://dx.doi.org/10.1109/tkde.2009.82.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Et. al., Shilpa Deshmukh,. "Efficient Methodology for Deep Web Data Extraction." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 1S (April 11, 2021): 286–93. http://dx.doi.org/10.17762/turcomat.v12i1s.1769.

Full text
Abstract:
Deep Web substance are gotten to by inquiries submitted to Web information bases and the returned information records are enwrapped in progressively created Web pages (they will be called profound Web pages in this paper). Removing organized information from profound Web pages is a difficult issue because of the fundamental mind boggling structures of such pages. As of not long ago, an enormous number of strategies have been proposed to address this issue, however every one of them have characteristic impediments since they are Web-page-programming-language subordinate. As the mainstream two-dimensional media, the substance on Web pages are constantly shown routinely for clients to peruse. This inspires us to look for an alternate path for profound Web information extraction to beat the constraints of past works by using some fascinating normal visual highlights on the profound Web pages. In this paper, a novel vision-based methodology that is Visual Based Deep Web Data Extraction (VBDWDE) Algorithm is proposed. This methodology basically uses the visual highlights on the profound Web pages to execute profound Web information extraction, including information record extraction and information thing extraction. We additionally propose another assessment measure amendment to catch the measure of human exertion expected to create wonderful extraction. Our investigations on a huge arrangement of Web information bases show that the proposed vision-based methodology is exceptionally viable for profound Web information extraction.
APA, Harvard, Vancouver, ISO, and other styles
8

GAO, XIAOYING, MENGJIE ZHANG, and PETER ANDREAE. "AUTOMATIC PATTERN CONSTRUCTION FOR WEB INFORMATION EXTRACTION." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 12, no. 04 (August 2004): 447–70. http://dx.doi.org/10.1142/s0218488504002928.

Full text
Abstract:
This paper describes a domain independent approach for automatically constructing information extraction patterns for semi-structured web pages. Given a randomly chosen page from a web site of similarly structured pages, the system identifies a region of the page that has a regular "tabular" structure, and then infers an extraction pattern that will match the "rows" of the region and identify the data elements. The approach was tested on three corpora containing a series of tabular web sites from different domains and achieved a success rate of at least 80%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.
APA, Harvard, Vancouver, ISO, and other styles
9

Patnaik, Sudhir Kumar, and C. Narendra Babu. "Trends in web data extraction using machine learning." Web Intelligence 19, no. 3 (December 16, 2021): 169–90. http://dx.doi.org/10.3233/web-210465.

Full text
Abstract:
Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from web page and documents to automated extraction to an intelligent extraction using machine learning algorithms, tools and techniques. Data extraction is one of the key components of end-to-end life cycle in web data extraction process that includes navigation, extraction, data enrichment and visualization. This paper presents the journey of web data extraction over the last many years highlighting evolution of tools, techniques, frameworks and algorithms for building intelligent web data extraction systems. The paper also throws light into challenges, opportunities for future research and emerging trends over the years in web data extraction with specific focus on machine learning techniques. Both traditional and machine learning approaches to manual and automated web data extraction are experimented and results published with few use cases demonstrating the challenges in web data extraction in the event of changes in the website layout. This paper introduces novel ideas such as self-healing capability in web data extraction and proactive error detection in the event of changes in website layout as an area of future research. This unique perspective will help readers to get deeper insights in to the present and future of web data extraction.
APA, Harvard, Vancouver, ISO, and other styles
10

Kumaresan, Umamageswari, and Kalpana Ramanujam. "A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling." International Journal of Information Retrieval Research 12, no. 1 (January 2022): 1–18. http://dx.doi.org/10.4018/ijirr.290830.

Full text
Abstract:
The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.
APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic "Web page data extraction"

1

Alves, Ricardo João de Freitas. "Declarative approach to data extraction of web pages." Master's thesis, Faculdade de Ciências e Tecnologia, 2009. http://hdl.handle.net/10362/5822.

Full text
Abstract:
Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer Science
In the last few years, we have been witnessing a noticeable WEB evolution with the introduction of significant improvements at technological level, such as the emergence of XHTML, CSS,Javascript, and Web2.0, just to name ones. This, combined with other factors such as physical expansion of the Web, as well as its low cost, have been the great motivator for the organizations and the general public to join, with a consequent growth in the number of users and thus influencing the volume of the largest global data repository. In consequence, there was an increasing need for regular data acquisition from the WEB, and because of its frequency, length or complexity, it would only be viable to obtain through automatic extractors. However, two main difficulties are inherent to automatic extractors. First, much of the Web's information is presented in visual formats mainly directed for human reading. Secondly, the introduction of dynamic webpages, which are brought together in local memory from different sources, causing some pages not to have a source file. Therefore, this thesis proposes a new and more modern extractor, capable of supporting the Web evolution, as well as being generic, so as to be able to be used in any situation, and capable of being extended and easily adaptable to a more particular use. This project is an extension of an earlier one which had the capability of extractions on semi-structured text files. However it evolved to a modular extraction system capable of extracting data from webpages, semi-structured text files and be expanded to support other data source types. It also contains a more complete and generic validation system and a new data delivery system capable of performing the earlier deliveries as well as new generic ones. A graphical editor was also developed to support the extraction system features and to allow a domain expert without computer knowledge to create extractions with only a few simple and intuitive interactions on the rendered webpage.
APA, Harvard, Vancouver, ISO, and other styles
2

Cheng, Wang. "AMBER : a domain-aware template based system for data extraction." Thesis, University of Oxford, 2015. http://ora.ox.ac.uk/objects/uuid:ff49d786-bfd8-4cd4-a69c-19e81cb95920.

Full text
Abstract:
The web is the greatest information source in human history, yet finding all offers for flats with gardens in London, Paris, and Berlin or all restaurants open after a screening of the latest blockbuster remain hard tasks – as that data is not easily amenable to processing. Extracting web data into databases for easier processing has been a resource-intensive process, requiring human supervision for every source from which to extract. This has been changing with approaches that replace human annotators with automated annotations. Such approaches could be successfully applied to restricted settings such as single attribute extraction or for domains with significant redundancy among sources. Multi-attribute objects are often presented on (i) Result pages, where multiple objects are presented on a single page as lists, tables or grids, with most important attributes and a summary description, (ii) Detail pages, where each page provides a detailed list of attributes and long description for a single entity, often in rich format. Both result and detail pages are having their own advantages. Extracting objects from result pages is orders of magnitude faster than from detail pages, and the links to detail pages are often only accessible through result pages. Detail pages have a complete list of attributes and full description of the entity. Early web data extraction approaches requires manual annotations for each web site to reach high accuracy, while a number of domain independent approaches only focus on unsupervised repeated structure segmentation. The former is limited in scaling and automation, while the latter is lacked in accuracy. Recent automated data extraction systems are often informed with an ontology and a set of object and attribute recognizers, however they have focused on extracting simple objects with few attributes from single-entity pages and avoided result pages. We present an automatic ontology-based multi-attribute object extraction system AMBER, which deals with both result and detail pages, achieves very high accuracy (>96%) with zero site-specific supervision, and is able to solve practical issues that arise in real-life data extraction tasks. AMBER is also applied as an important component of DIADEM, the first automatic full-site extraction system that is able to extract structured data from different domains without site-specific supervision, and has been tested through a large-scale evaluation (>10, 000) sites. On the result page side, AMBER achieves high accuracy through a novel domain- aware, path-based template discovery algorithm, and integrates annotations for all parts of the extraction, from identifying the primary list of objects, over segment- ing the individual objects, to aligning the attributes. Yet, AMBER is able to tolerate significant noise in the annotations, by combining these annotations with a novel algorithm for finding regular structures based on XPATH expressions that capture regular tree structures. On the detail page side, AMBER integrates boilerplate removal, dynamic lists identification and page dissimilarity calculation seamlessly to identify data region, then employs a set of fairly simple and cheaply computable features for attribute extraction. Besides, AMBER is the first approach that combines result page extraction and detail page extraction by integrating attributes extracted from result pages and the attributes found on corresponding detail pages. AMBER is able to identify attributes of objects with near perfect accuracy and to extract dozens of attributes with > 96% across several domains, even in presence of significant noise. It outperforms uninformed, automated approaches by a wide margin if given an ontology. Even in absence of an ontology, AMBER outperforms most previous systems on record segmentation.
APA, Harvard, Vancouver, ISO, and other styles
3

Anderson, Neil David Alan. "Data extraction & semantic annotation from web query result pages." Thesis, Queen's University Belfast, 2016. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.705642.

Full text
Abstract:
Our unquenchable thirst for knowledge is one of the few things that really defines our humanity. Yet the Information Age, which we have created, has left us floating aimlessly in a vast ocean of unintelligible data. Hidden Web databases are one massive source of structured data. The contents of these databases are, however, often only accessible through a query proposed by a user. The data returned in these Query Result Pages is intended for human consumption and, as such, has nothing more than an implicit semantic structure which can be understood visually by a human reader, but not by a computer. This thesis presents an investigation into the processes of extraction and semantic understanding of data from Query Result Pages. The work is multi-faceted and includes at the outset, the development of a vision-based data extraction tool. This work is followed by the development of a number of algorithms which make use of machine learning-based techniques first to align the data extracted into semantically similar groups and then to assign a meaningful label to each group. Part of the work undertaken in fulfilment of this thesis has also addressed the lack of large, modern datasets containing a wide range of result pages representing of those typically found online today. In particular, a new innovative crowdsourced dataset is presented. Finally, the work concludes by examining techniques from the complementary research field of Information Extraction. An initial, critical assessment of how these mature techniques could be applied to this research area is provided.
APA, Harvard, Vancouver, ISO, and other styles
4

Wu, Yongliang. "Aggregating product reviews for the Chinese market." Thesis, KTH, Kommunikationssystem, CoS, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-91484.

Full text
Abstract:
As of December 2007, the number of Internet users in China had increased to 210 million people. The annual growth rate reached 53.3 percent in 2008, with the average number of Internet users increasing every day by 200,000 people. Currently, China's Internet population is slightly lower than the 215 million internet users in the United States. [1] Despite the rapid growth of the Chinese economy in the global Internet market, China’s e-commerce is not following the traditional pattern of commerce, but instead has developed based on user demand. This growth has extended into every area of the Internet. In the west, expert product reviews have been shown to be an important element in a user’s purchase decision. The higher the quality of product reviews that customers received, the more products they buy from on-line shops. As the number of products and options increase, Chinese customers need impersonal, impartial, and detailed products reviews. This thesis focuses on on-line product reviews and how they affect Chinese customer’s purchase decisions. E-commerce is a complex system. As a typical model of e-commerce, we examine a Business to Consumer (B2C) on-line retail site and consider a number of factors; including some seemingly subtitle factors that may influence a customer’s eventually decision to shop on website. Specifically this thesis project will examine aggregated product reviews from different on-line sources by analyzing some existing western companies. Following this the thesis demonstrates how to aggregate product reviews for an e-business website. During this thesis project we found that existing data mining techniques made it straight forward to collect reviews. These reviews were stored in a database and web applications can query this database to provide a user with a set of relevant product reviews. One of the important issues, just as with search engines is providing the relevant product reviews and determining what order they should be presented in. In our work we selected the reviews based upon matching the product (although in some cases there are ambiguities concerning if two products are actually identical or not) and ordering the matching reviews by date - with the most recent reviews present first. Some of the open questions that remain for the future are: (1) improving the matching - to avoid the ambiguity concerning if the reviews are about the same product or not and (2) determining if the availability of product reviews actually affect a Chinese user's decision to purchase a product.
I december 2007 uppgick antalet internetanvändare i Kina har ökat till 210 miljoner människor. Den årliga tillväxttakten nådde 53,3 procent 2008, med den genomsnittliga Antalet Internet-användare ökar för varje dag av 200.000 människor. Närvarande Kinas Internet befolkningen är något lägre än de 215 miljoner Internetanvändare i USA Staterna.[1] Trots den snabba tillväxten i den kinesiska ekonomin i den globala Internetmarknaden, Kinas e-handel inte följer det traditionella mönstret av handel, men i stället har utvecklats baserat på användarnas efterfrågan. Denna tillväxt har utvidgas till alla områden I Internet. I väst har expert recensioner visat sig vara en viktig del I användarens köpbeslut. Ju högre kvalitet på produkten recensioner som kunderna mottagna fler produkter de köper från on-line butiker. Eftersom antalet produkter och alternativen ökar, kinesiska kunderna behöver opersonlig, opartisk och detaljerade produkter recensioner. Denna avhandling fokuserar på on-line recensioner och hur de påverkar Kinesiska kundens köpbeslut.</p> E-handel är ett komplext system. Som en typisk modell för e-handel, vi undersöka ett Business to Consumer (B2C) on-line-försäljning plats och överväga ett antal faktorer; inklusive några till synes subtitle faktorer som kan påverka kundens småningom Beslutet att handla på webbplatsen. Uttryckligen detta examensarbete kommer att undersöka aggregerade recensioner från olika online-källor genom att analysera vissa befintliga västra företag. Efter den här avhandlingen visar hur samlade produkt recensioner för en e-affärer webbplats. Under detta examensarbete fann vi att befintliga data mining tekniker gjort det rakt fram för att samla recensioner. Dessa översyner har lagrats i en databas och webb program kan söka denna databas för att ge en användare med en rad relevanta product recensioner. En av de viktiga frågorna, precis som med sökmotorer är att tillhandahålla relevanta produkt recensioner och bestämma vilken ordning de ska presenteras i. vårt arbete har vi valt recensioner baserat på matchning produkten (men i vissa fall det finns oklarheter i fråga om två produkter verkligen identiska eller inte) och beställa matchande recensioner efter datum - med den senaste recensioner närvarande första. Några av de öppna frågorna som kvarstår för framtiden är: (1) förbättra matchning - För att undvika oklarheter rörande om Gästrecensionerna om samma produkt eller inte och (2) avgöra om det finns recensioner faktiskt påverka en kinesisk användarens val att köpa en produkt.
APA, Harvard, Vancouver, ISO, and other styles
5

Malchik, Alexander 1975. "An aggregator tool for extraction and collection of data from web pages." Thesis, Massachusetts Institute of Technology, 2000. http://hdl.handle.net/1721.1/86522.

Full text
Abstract:
Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2000.
Includes bibliographical references (p. 54-56).
by Alexander Malchik.
M.Eng.
APA, Harvard, Vancouver, ISO, and other styles
6

Kolečkář, David. "Systém pro integraci webových datových zdrojů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2020. http://www.nusl.cz/ntk/nusl-417239.

Full text
Abstract:
The thesis aims at designing and implementing a web application that will be used for the integration of web data sources. For data integration, a method using domain model of the target information system was applied. The work describes individual methods used for extracting information from web pages. The text describes the process of designing the architecture of the system including a description of the chosen technologies and tools. The main part of the work is implementation and testing the final web application that is written in Java and Angular framework. The outcome of the work is a web application that will allow its users to define web data sources and save data in the target database.
APA, Harvard, Vancouver, ISO, and other styles
7

Mazal, Zdeněk. "Extrakce textových dat z internetových stránek." Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2011. http://www.nusl.cz/ntk/nusl-219347.

Full text
Abstract:
This work focus at data and especially text mining from Web pages, an overview of programs for downloading the text and ways of their extraction. It also contains an overview of the most frequently used programs for extracting data from internet. The output of this thesis is a Java program that can download text from a selection of servers and save them into xml le.
APA, Harvard, Vancouver, ISO, and other styles
8

Weng, Daiyue. "Extracting structured data from Web query result pages." Thesis, Queen's University Belfast, 2016. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.709858.

Full text
Abstract:
A rapidly increasing number of Web databases are now become accessible via their HTML form- based query interfaces only. Comparing various services or products from a number of web sites in a specific domain is time-consuming and tedious. There is a demand for value-added Web applications that integrate data from multiple sources. To facilitate the development of such applications, we need to develop techniques for automating the process of providing integrated access to a multitude of database-driven Web sites, and integrating data from their underlying databases. This presents three challenges, namely query form extraction, query form matching and translation, and Web query result extraction. In this thesis, 1 focus on Web query result extraction, which aims to extract structured data encoded in semi-structured HTML pages, and return extracted data in relational tables. 1 begin by reviewing the existing approaches for Web query result extraction. 1 categorize them based on their degree of automation, i.e. manual, semi-automatic and fully automatic approaches. For each category, every approach will be described in terms of its technical features, followed by an analysis listing the advantages and limitations of the approach. The literature review leads to my proposed approaches, which resolve the Web data extraction problem, i.e. Web data record extraction, Web data alignment and Web data annotation. Each approach is presented in a chapter which includes the methodology, experiment and related work. The last chapter concludes the thesis.
APA, Harvard, Vancouver, ISO, and other styles
9

Смілянець, Федір Андрійович. "Екстракція структурованої інформації з множини веб-сторінок." Master's thesis, КПІ ім. Ігоря Сікорського, 2020. https://ela.kpi.ua/handle/123456789/39926.

Full text
Abstract:
Актуальність теми дослідження. Сучасний широкий інтернет є істотним джерелом даних для використання у наукових та бізнес-дослідженнях. Можливість видобувати актуальні дані часто є ключовою для досягнення необхідних цілей, але сучасні якісні рішення з застосуванням технологій машинного зору та інших можуть бути дорогими до придбання або розробки, тому прості та дешеві як з точки зору розробки та підтримки, так і з точки зору експлуатації рішення є необхідними. Метою дослідження є створення програмного інструментарію екстракції структурованих даних з веб-сторінок новинних ресурсів для подальшої класифікації за достовірністю. Для досягнення поставленої мети було окреслено та виконано наступні завдання: - провести огляд існуючих підходів та програмних аналогів у областях екстракції даних з веб-ресурсів та оцінки якості новин; - позробити та реалізувати алгоритми екстракції, підготовки та класифікації даних; - порівняти результати, отримані розробленим алгоритмом та результатами тренування алгоритмів машинного навчання на даних, видобутих ним з існуючим аналогом та результатами тренування на даних аналогу. Об’єктом дослідження є процес екстракції текстових даних з подальшою обробкою методами машинного навчання. Предметом дослідження є методи та засоби екстракції та аналізу структурованих текстових даних. Наукова новизна одержаних результатів. Було створено простий жадібний алгоритм у якому суміщено процеси пошуку посилань та видобування інформації, доведено доцільність використання простих алгоритмів для збору даних з ресурсів у мережі Інтернет з ціллю використання у тренуванні алгоритмів машинного навчання. Було доведено що як класичні алгоритми навчання здатні досягати результатів, співставним з такими у нейронних мереж, таких як мережі ДКЧП, та показано що такі моделі здатні працювати на двомовному датасеті. Публікації. Матеріали роботи було опубліковано у п’ятій Всеукраїнській науково-практичній конференції молодих вчених та студентів «Інформаційні системи та технології управління» (ІСТУ-2020) «Класифікація новин за достовірністю на основі методів машинного навчання».
Relevance of the research topic. Modern wide internet is a considerable source of data to be used in scientific and business applications. An ability to extract up to date data is frequently crutial for reaching necessary goals, though, modern quality solutions to this problem, which are using computer vision and other technologies, may be finantially demanding to acquire or develop, thus simple and cheap to develop, maintain and use solutions are necessary. The purpose of the study is to create a software instrument aimed at extraction of structured data from news websites for usage in news trustworthiness classification. Following tasks were outlined and implemented to achieve the aforementioned goal: - Outline existing approaches and analogues in areas of data extraction and news classification; - Design and develop extraction, preparation and classification algorhitms; - Compare the results achieved with developed extraction algorhitm and with existing software solution, including comparing machine learning accuracies on both of the extractors. The object of the study is the process of text data extraction with subsequent machine learning analysis. The subjects of the study are methods and tools of extraction and analysis of text data. Scientific novelty of the obtained results. A simple greedy algorithm was created, combining the process of link discovery and data extraction. Expediency of usage of simple web data extraction algorithms for composing machine learning datasets was proven. It was also proven that classical machine learning algorithms can achieve results similar to neural networks such as LSTM. Capabilities of machine learning systems to function efficiently in a bilingual context were also shown. Publications. Materials, related to this study, were published in the All-Ukrainian Scientific and Practical Conference of Young Scientists and Students “Information Systems and Management Technologies” (ISTU-2019) “News trustworthiness classification with machine learning”.
APA, Harvard, Vancouver, ISO, and other styles
10

Hou, Jingyu. "Discovering web page communities for web-based data management." University of Southern Queensland, Faculty of Sciences, 2002. http://eprints.usq.edu.au/archive/00001447/.

Full text
Abstract:
The World Wide Web is a rich source of information and continues to expand in size and complexity. Mainly because the data on the web is lack of rigid and uniform data models or schemas, how to effectively and efficiently manage web data and retrieve information is becoming a challenge problem. Discovering web page communities, which capture the features of the web and web-based data to find intrinsic relationships among the data, is one of the effective ways to solve this problem. A web page community is a set of web pages that has its own logical and semantic structures. In this work, we concentrate on the web data in web page format and exploit hyperlink information to discover (construct) web page communities. Three main web page communities are studied in this work: the first one is consisted of hub and authority pages, the second one is composed of relevant web pages with respect to a given page (URL), and the last one is the community with hierarchical cluster structures. For analysing hyperlinks, we establish a mathematical framework, especially the matrix-based framework, to model hyperlinks. Within this mathematical framework, hyperlink analysis is placed on a solid mathematic base and the results are reliable. For the web page community that is consisted of hub and authority pages, we focus on eliminating noise pages from the concerned page source to obtain another good quality page source, and in turn improve the quality of web page communities. We propose an innovative noise page elimination algorithm based on the hyperlink matrix model and mathematic operations, especially the singular value decomposition (SVD) of matrix. The proposed algorithm exploits hyperlink information among the web pages, reveals page relationships at a deeper level, and numerically defines thresholds for noise page elimination. The experiment results show the effectiveness and feasibility of the algorithm. This algorithm could also be used solely for web-based data management systems to filter unnecessary web pages and reduce the management cost. In order to construct a web page community that is consisted of relevant pages with respect to a given page (URL), we propose two hyperlink based relevant page finding algorithms. The first algorithm comes from the extended co-citation analysis of web pages. It is intuitive and easy to be implemented. The second one takes advantage of linear algebra theories to reveal deeper relationships among the web pages and identify relevant pages more precisely and effectively. The corresponding page source construction for these two algorithms can prevent the results from being affected by malicious hyperlinks on the web. The experiment results show the feasibility and effectiveness of the algorithms. The research results could be used to enhance web search by caching the relevant pages for certain searched pages. For the purpose of clustering web pages to construct a community with its hierarchical cluster structures, we propose an innovative web page similarity measurement that incorporates hyperlink transitivity and page importance (weight).Based on this similarity measurement, two types of hierarchical web page clustering algorithms are proposed. The first one is the improvement of the conventional K-mean algorithms. It is effective in improving page clustering, but is sensitive to the predefined similarity thresholds for clustering. Another type is the matrix-based hierarchical algorithm. Two algorithms of this type are proposed in this work. One takes cluster-overlapping into consideration, another one does not. The matrix-based algorithms do not require predefined similarity thresholds for clustering, are independent of the order in which the pages are presented, and produce stable clustering results. The matrix-based algorithms exploit intrinsic relationships among web pages within a uniform matrix framework, avoid much influence of human interference in the clustering procedure, and are easy to be implemented for applications. The experiments show the effectiveness of the new similarity measurement and the proposed algorithms in web page clustering improvement. For applying above mathematical algorithms better in practice, we generalize the web page discovering as a special case of information retrieval and present a visualization system prototype, as well as technical details on visualization algorithm design, to support information retrieval based on linear algebra. The visualization algorithms could be smoothly applied to web applications. XML is a new standard for data representation and exchange on the Internet. In order to extend our research to cover this important web data, we propose an object representation model (ORM) for XML data. A set of transformation rules and algorithms are established to transform XML data (DTD and XML documents with DTD or without DTD) into this model. This model capsulizes elements of XML data and data manipulation methods. DTD-Tree is also defined to describe the logical structure of DTD. It also can be used as an application program interface (API) for processing DTD, such as transforming a DTD document into the ORM. With this data model, semantic meanings of the tags (elements) in XML data can be used for further research in XML data management and information retrieval, such as community construction for XML data.
APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Web page data extraction"

1

1964-, Palade Vasile, ed. Adaptive web sites: A knowledge extraction from web data approach. Amsterdam: IOS Press, 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
2

Developments in data extraction, management, and analysis. Hershey, PA: Information Science Reference, 2012.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
3

Paul, McFedries, ed. The complete idiot's guide to creating a Web page. 4th ed. Indianapolis, Ind: Que, 1999.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
4

The complete idiot's guide to creating a Web page. 5th ed. Indianapolis, IN: Alpha, 2002.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
5

McFedries, Paul. The complete idiot's guide to creating a Web page. 4th ed. Indianapolis, Ind: Que, 2000.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
6

Explorer's guide to the Semantic Web. Greenwich: Manning, 2004.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
7

Shroff, Gautam. The Intelligent Web. Oxford University Press, 2013. http://dx.doi.org/10.1093/oso/9780199646715.001.0001.

Full text
Abstract:
As we use the Web for social networking, shopping, and news, we leave a personal trail. These days, linger over a Web page selling lamps, and they will turn up at the advertising margins as you move around the Internet, reminding you, tempting you to make that purchase. Search engines such as Google can now look deep into the data on the Web to pull out instances of the words you are looking for. And there are pages that collect and assess information to give you a snapshot of changing political opinion. These are just basic examples of the growth of "Web intelligence", as increasingly sophisticated algorithms operate on the vast and growing amount of data on the Web, sifting, selecting, comparing, aggregating, correcting; following simple but powerful rules to decide what matters. While original optimism for Artificial Intelligence declined, this new kind of machine intelligence is emerging as the Web grows ever larger and more interconnected. Gautam Shroff takes us on a journey through the computer science of search, natural language, text mining, machine learning, swarm computing, and semantic reasoning, from Watson to self-driving cars. This machine intelligence may even mimic at a basic level what happens in the brain.
APA, Harvard, Vancouver, ISO, and other styles
8

Jamaludin, Zulikha, and Wan Hussain Wan Ishak. Do it Yourself: Bina Laman Sesawang Statik & Dinamik. UUM Press, 2010. http://dx.doi.org/10.32890/9789675311314.

Full text
Abstract:
Buku ini memberi panduan asas kepada pembaca bagaimana membina laman sesawang sendiri (Do It Yourself-DIY). Pembaca akan dilatih melakukan sendiri aktiviti bermula dari peringkat asas, peringkat pertengahan hingga ke peringkat lanjutan dengan menggunakan perisian Microsoft FrontPage, Java Script, Active Server Page (ASP) dan perisian pangkalan data Microsoft Access. Di samping itu, setiap aktiviti dan langkah yang disenaraikan secara berjujukan membantu pembaca membina laman sesawang (homepage) jenis statik (informational) dan laman sesawang interaktif (dinamik). Seterusnya, pembaca didedahkan tutorial asas berkaitan teori dan konsep dalam perkomputeran moden iaitu Internet, world wide web (www)atau sesawang dan Hyper Text Markup Language(HTML). Hasil gabungan teknologi ini membolehkan capaian dan hebahan maklumat dilakukan merentas sempadan. Akhir sekali, pembaca akan diperkenalkan dengan komponen terakhir pembinaan laman web sesawang dinamik iaitu pembangunan dan manipulasi pangkalan data. Semoga buku ini dapat memberi panduan bukan sahaja kepada pembangun halaman sesawang tetapi juga kepada sesiapa yang ingin mereka bentuk dan membangunkan halaman sesawang dinamik dan statik.
APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Web page data extraction"

1

Kravchenko, Andrey, Ruslan R. Fayzrakhmanov, and Emanuel Sallinger. "Web Page Representations and Data Extraction with BERyL." In Current Trends in Web Engineering, 22–30. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-03056-8_3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Grigalis, Tomas, Lukas Radvilavičius, Antanas Čenys, and Juozas Gordevičius. "Clustering Visually Similar Web Page Elements for Structured Web Data Extraction." In Lecture Notes in Computer Science, 435–38. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-31753-8_38.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Chang, Chia-Hui, Yen-Ling Lin, Kuan-Chen Lin, and Mohammed Kayed. "Page-Level Wrapper Verification for Unsupervised Web Data Extraction." In Lecture Notes in Computer Science, 454–67. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013. http://dx.doi.org/10.1007/978-3-642-41230-1_38.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Hu, Dongdong, and Xiaofeng Meng. "Automatic Data Extraction from Data-Rich Web Pages." In Database Systems for Advanced Applications, 828–39. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005. http://dx.doi.org/10.1007/11408079_75.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Palekar, Vikas R. "A Visual Based Page Segmentation for Deep Web Data Extraction." In Advances in Intelligent and Soft Computing, 791–804. New Delhi: Springer India, 2012. http://dx.doi.org/10.1007/978-81-322-0491-6_72.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Carchiolo, Vincenza, Alessandro Longheu, and Michele Malgeri. "Extraction of Hidden Semantics from Web Pages." In Intelligent Data Engineering and Automated Learning — IDEAL 2002, 117–22. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002. http://dx.doi.org/10.1007/3-540-45675-9_20.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Chang, Chia-Hui, Shih-Chien Kuo, Kuo-Yu Hwang, Tsung-Hsin Ho, and Chih-Lung Lin. "Automatic Information Extraction for Multiple Singular Web Pages." In Advances in Knowledge Discovery and Data Mining, 297–303. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002. http://dx.doi.org/10.1007/3-540-47887-6_29.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Lassri, Safae, El Habib Benlahmar, and Abderrahim Tragha. "Web Page Classification Based on an Accurate Technique for Key Data Extraction." In Advanced Intelligent Systems for Sustainable Development (AI2SD’2020), 1124–31. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-030-90639-9_91.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Kolla, Bhanu Prakash, and Arun Raja Raman. "Data Engineered Content Extraction Studies for Indian Web Pages." In Advances in Intelligent Systems and Computing, 505–12. Singapore: Springer Singapore, 2018. http://dx.doi.org/10.1007/978-981-10-8055-5_45.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Li, Long, Dandan Song, and Lejian Liao. "Vertical Classification of Web Pages for Structured Data Extraction." In Information Retrieval Technology, 486–95. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. http://dx.doi.org/10.1007/978-3-642-35341-3_44.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Web page data extraction"

1

Kayed, Mohammed, Chia-Hui Chang, Khaled Shaalan, and Moheb Ramzy Girgis. "FiVaTech: Page-Level Web Data Extraction from Template Pages." In 2007 Seventh IEEE International Conference on Data Mining - Workshops (ICDM Workshops). IEEE, 2007. http://dx.doi.org/10.1109/icdmw.2007.95.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Győrödi, Cornelia, Robert Győrödi, Mihai Cornea, and George Pecherle. "Automated internal web page clustering for improved data extraction." In the 2nd International Conference. New York, New York, USA: ACM Press, 2012. http://dx.doi.org/10.1145/2254129.2254209.

Full text
APA, Harvard, Vancouver, ISO, and other styles
3

Xingyi Li, Yanyan Kong, and Huaji Shi. "Web page repetitive structure and URL feature based Deep Web data extraction." In 2010 Second International Conference on Communication Systems, Networks and Applications (ICCSNA). IEEE, 2010. http://dx.doi.org/10.1109/iccsna.2010.5588744.

Full text
APA, Harvard, Vancouver, ISO, and other styles
4

Yang, Jufeng, Guangshun Shi, Yan Zheng, and Qingren Wang. "Data Extraction from Deep Web Pages." In 2007 International Conference on Computational Intelligence and Security (CIS 2007). IEEE, 2007. http://dx.doi.org/10.1109/cis.2007.39.

Full text
APA, Harvard, Vancouver, ISO, and other styles
5

Hong-ping, Chen, Fang Wei, Yang Zhou, Zhuo Lin, and Cui Zhi-Ming. "Automatic Data Records Extraction from List Page in Deep Web Sources." In 2009 Asia-Pacific Conference on Information Processing, APCIP. IEEE, 2009. http://dx.doi.org/10.1109/apcip.2009.100.

Full text
APA, Harvard, Vancouver, ISO, and other styles
6

Wang, Yun, Bicheng Li, and Chen Lin. "Data extraction from Web forums based on similarity of page layout." In 2009 International Conference on Natural Language Processing and Knowledge Engineering (NLP-KE). IEEE, 2009. http://dx.doi.org/10.1109/nlpke.2009.5313736.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Gong, Jibing, Xiaomeng Kou, Hanyun Zhang, Jiquan Peng, Shishan Gong, and Shuli Wang. "Automatic web page data extraction through MD5 trigeminal tree and improved BIRCH." In International Conference on Electronic Information Engineering, Big Data, and Computer Technology (EIBDCT 2022), edited by Xuexia Ye and Guoqiang Zhong. SPIE, 2022. http://dx.doi.org/10.1117/12.2635678.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Guo, Jinsong, Valter Crescenzi, Tim Furche, Giovanni Grasso, and Georg Gottlob. "RED: Redundancy-Driven Data Extraction from Result Pages?" In The World Wide Web Conference. New York, New York, USA: ACM Press, 2019. http://dx.doi.org/10.1145/3308558.3313529.

Full text
APA, Harvard, Vancouver, ISO, and other styles
9

Zhang, Mingzhu, Zhongguo Yang, Sikandar Ali, and Weilong Ding. "Web Page Information Extraction Service Based on Graph Convolutional Neural Network and Multimodal Data Fusion." In 2021 IEEE International Conference on Web Services (ICWS). IEEE, 2021. http://dx.doi.org/10.1109/icws53863.2021.00094.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Hui Song, Suraj Giri, and Fanyuan Ma. "Data extraction and annotation for dynamic Web pages." In IEEE International Conference on e-Technology, e-Commerce and e-Service, 2004. EEE '04. 2004. IEEE, 2004. http://dx.doi.org/10.1109/eee.2004.1287353.

Full text
APA, Harvard, Vancouver, ISO, and other styles

Reports on the topic "Web page data extraction"

1

Chang, Kevin C., Truman Shuck, and Govind Kabra. Web-Scale Search-Based Data Extraction and Integration. Fort Belvoir, VA: Defense Technical Information Center, October 2011. http://dx.doi.org/10.21236/ada554205.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography