Log in

Relevant bibliographies by topics / Web page data extraction / Journal articles

Journal articles on the topic 'Web page data extraction'

To see the other types of publications on this topic, follow the link: Web page data extraction.

Author: Grafiati

Published: 30 May 2022

Last updated: 31 May 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Web page data extraction.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Ahmad Sabri, Ily Amalina, and Mustafa Man. "Improving Performance of DOM in Semi-structured Data Extraction using WEIDJ Model." Indonesian Journal of Electrical Engineering and Computer Science 9, no. 3 (March 1, 2018): 752. http://dx.doi.org/10.11591/ijeecs.v9.i3.pp752-763.

Full text

Abstract:

<p>Web data extraction is the process of extracting user required information from web page. The information consists of semi-structured data not in structured format. The extraction data involves the web documents in html format. Nowadays, most people uses web data extractors because the extraction involve large information which makes the process of manual information extraction takes time and complicated. We present in this paper WEIDJ approach to extract images from the web, whose goal is to harvest images as object from template-based html pages. The WEIDJ (Web Extraction Image using DOM (Document Object Model) and JSON (JavaScript Object Notation)) applies DOM theory in order to build the structure and JSON as environment of programming. The extraction process leverages both the input of web address and the structure of extraction. Then, WEIDJ splits DOM tree into small subtrees and applies searching algorithm by visual blocks for each web page to find images. Our approach focus on three level of extraction; single web page, multiple web page and the whole web page. Extensive experiments on several biodiversity web pages has been done to show the comparison time performance between image extraction using DOM, JSON and WEIDJ for single web page. The experimental results advocate via our model, WEIDJ image extraction can be done fast and effectively.</p>

APA, Harvard, Vancouver, ISO, and other styles

2

Ahamed, B. Bazeer, D. Yuvaraj, S. Shitharth, Olfat M. Mizra, Aisha Alsobhi, and Ayman Yafoz. "An Efficient Mechanism for Deep Web Data Extraction Based on Tree-Structured Web Pattern Matching." Wireless Communications and Mobile Computing 2022 (May 27, 2022): 1–10. http://dx.doi.org/10.1155/2022/6335201.

Full text

Abstract:

The World Wide Web comprises of huge web databases where the data are searched using web query interface. Generally, the World Wide Web maintains a set of databases to store several data records. The distinct data records are extracted by the web query interface as per the user requests. The information maintained in the web database is hidden and retrieves deep web content even in dynamic script pages. In recent days, a web page offers a huge amount of structured data and is in need of various web-related latest applications. The challenge lies in extracting complicated structured data from deep web pages. Deep web contents are generally accessed by the web queries, but extracting the structured data from the web database is a complex problem. Moreover, making use of such retrieved information in combined structures needs significant efforts. No further techniques are established to address the complexity in data extraction of deep web data from various web pages. Despite the fact that several ways for deep web data extraction are offered, very few research address template-related issues at the page level. For effective web data extraction with a large number of online pages, a unique representation of page generation using tree-based pattern matches (TBPM) is proposed. The performance of the proposed technique TBPM is compared to that of existing techniques in terms of relativity, precision, recall, and time consumption. The performance metrics such as high relativity is about 17-26% are achieved when compared to FiVaTech approach.

APA, Harvard, Vancouver, ISO, and other styles

3

Ahmad Sabri, Ily Amalina, and Mustafa Man. "A deep web data extraction model for web mining: a review." Indonesian Journal of Electrical Engineering and Computer Science 23, no. 1 (July 1, 2021): 519. http://dx.doi.org/10.11591/ijeecs.v23.i1.pp519-528.

Full text

Abstract:

The World Wide Web has become a large pool of information. Extracting structured data from a published web pages has drawn attention in the last decade. The process of web data extraction (WDE) has many challenges, dueto variety of web data and the unstructured data from hypertext mark up language (HTML) files. The aim of this paper is to provide a comprehensive overview of current web data extraction techniques, in termsof extracted quality data. This paper focuses on study for data extraction using wrapper approaches and compares each other to identify the best approach to extract data from online sites. To observe the efficiency of the proposed model, we compare the performance of data extraction by single web page extraction with different models such as document object model (DOM), wrapper using hybrid dom and json (WHDJ), wrapper extraction of image using DOM and JSON (WEIDJ) and WEIDJ (no-rules). Finally, the experimentations proved that WEIDJ can extract data fastest and low time consuming compared to other proposed method.<br /><div> </div>

APA, Harvard, Vancouver, ISO, and other styles

4

Liu, Hong, and Yin Xiao Ma. "Web Data Extraction Research Based on Wrapper and XPath Technology." Advanced Materials Research 271-273 (July 2011): 706–12. http://dx.doi.org/10.4028/www.scientific.net/amr.271-273.706.

Full text

Abstract:

For satisfy people’s various need, some websites consist of pages that are dynamically generated using a common template populated with data from www, such as product description pages on e-commerce sites. In this paper, it merges wrapper technology with XPath to form a dependable, robust process for web data extraction. Through validating such a method in some experiments; we get results that it has high efficiency in extracting list page.

APA, Harvard, Vancouver, ISO, and other styles

5

Ibrahim, Nadia, Alaa Hassan, and Marwah Nihad. "Big Data Analysis of Web Data Extraction." International Journal of Engineering & Technology 7, no. 4.37 (December 13, 2018): 168. http://dx.doi.org/10.14419/ijet.v7i4.37.24095.

Full text

Abstract:

In this study, the large data extraction techniques; include detection of patterns and secret relationships between factors numbering and bring in the required information. Rapid analysis of massive data can lead to innovation and concepts of the theoretical value. Compared with results from mining between traditional data sets and the vast amount of large heterogeneous data interdependent it has the ability expand the knowledge and ideas about the target domain. We studied in this research data mining on the Internet. The various networks that are used to extract data onto different locations complex may appear sometimes and has been used to extract information on the web technology to extract and data analysis (Marwah et al., 2016). In this research, we extracted the information on large quantities of the web pages and examined the pages of the site using Java code, and we added the extracted information on a special database for the web page. We used the data network function to get accurate results of evaluating and categorizing the data pages found, which identifies the trusted web or risky web pages, and imported the data onto a CSV extension. Consequently, examine and categorize these data using WEKA to obtain accurate results. We concluded from the results that the applied data mining algorithms are better than other techniques in classification and extraction of data and high performance.

APA, Harvard, Vancouver, ISO, and other styles

6

Kayed, Mohammed, and Chia-Hui Chang. "FiVaTech: Page-Level Web Data Extraction from Template Pages." IEEE Transactions on Knowledge and Data Engineering 22, no. 2 (February 2010): 249–63. http://dx.doi.org/10.1109/tkde.2009.82.

Full text

APA, Harvard, Vancouver, ISO, and other styles

7

Et. al., Shilpa Deshmukh,. "Efficient Methodology for Deep Web Data Extraction." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 1S (April 11, 2021): 286–93. http://dx.doi.org/10.17762/turcomat.v12i1s.1769.

Full text

Abstract:

Deep Web substance are gotten to by inquiries submitted to Web information bases and the returned information records are enwrapped in progressively created Web pages (they will be called profound Web pages in this paper). Removing organized information from profound Web pages is a difficult issue because of the fundamental mind boggling structures of such pages. As of not long ago, an enormous number of strategies have been proposed to address this issue, however every one of them have characteristic impediments since they are Web-page-programming-language subordinate. As the mainstream two-dimensional media, the substance on Web pages are constantly shown routinely for clients to peruse. This inspires us to look for an alternate path for profound Web information extraction to beat the constraints of past works by using some fascinating normal visual highlights on the profound Web pages. In this paper, a novel vision-based methodology that is Visual Based Deep Web Data Extraction (VBDWDE) Algorithm is proposed. This methodology basically uses the visual highlights on the profound Web pages to execute profound Web information extraction, including information record extraction and information thing extraction. We additionally propose another assessment measure amendment to catch the measure of human exertion expected to create wonderful extraction. Our investigations on a huge arrangement of Web information bases show that the proposed vision-based methodology is exceptionally viable for profound Web information extraction.

APA, Harvard, Vancouver, ISO, and other styles

8

GAO, XIAOYING, MENGJIE ZHANG, and PETER ANDREAE. "AUTOMATIC PATTERN CONSTRUCTION FOR WEB INFORMATION EXTRACTION." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 12, no. 04 (August 2004): 447–70. http://dx.doi.org/10.1142/s0218488504002928.

Full text

Abstract:

This paper describes a domain independent approach for automatically constructing information extraction patterns for semi-structured web pages. Given a randomly chosen page from a web site of similarly structured pages, the system identifies a region of the page that has a regular "tabular" structure, and then infers an extraction pattern that will match the "rows" of the region and identify the data elements. The approach was tested on three corpora containing a series of tabular web sites from different domains and achieved a success rate of at least 80%. A significant strength of the system is that it can infer extraction patterns from a single training page and does not require any manual labeling of the training page.

APA, Harvard, Vancouver, ISO, and other styles

9

Patnaik, Sudhir Kumar, and C. Narendra Babu. "Trends in web data extraction using machine learning." Web Intelligence 19, no. 3 (December 16, 2021): 169–90. http://dx.doi.org/10.3233/web-210465.

Full text

Abstract:

Web data extraction has seen significant development in the last decade since its inception in the early nineties. It has evolved from a simple manual way of extracting data from web page and documents to automated extraction to an intelligent extraction using machine learning algorithms, tools and techniques. Data extraction is one of the key components of end-to-end life cycle in web data extraction process that includes navigation, extraction, data enrichment and visualization. This paper presents the journey of web data extraction over the last many years highlighting evolution of tools, techniques, frameworks and algorithms for building intelligent web data extraction systems. The paper also throws light into challenges, opportunities for future research and emerging trends over the years in web data extraction with specific focus on machine learning techniques. Both traditional and machine learning approaches to manual and automated web data extraction are experimented and results published with few use cases demonstrating the challenges in web data extraction in the event of changes in the website layout. This paper introduces novel ideas such as self-healing capability in web data extraction and proactive error detection in the event of changes in website layout as an area of future research. This unique perspective will help readers to get deeper insights in to the present and future of web data extraction.

APA, Harvard, Vancouver, ISO, and other styles

10

Kumaresan, Umamageswari, and Kalpana Ramanujam. "A Framework for Automated Scraping of Structured Data Records From the Deep Web Using Semantic Labeling." International Journal of Information Retrieval Research 12, no. 1 (January 2022): 1–18. http://dx.doi.org/10.4018/ijirr.290830.

Full text

Abstract:

The intent of this research is to come up with an automated web scraping system which is capable of extracting structured data records embedded in semi-structured web pages. Most of the automated extraction techniques in the literature captures repeated pattern among a set of similarly structured web pages, thereby deducing the template used for the generation of those web pages and then data records extraction is done. All of these techniques exploit computationally intensive operations such as string pattern matching or DOM tree matching and then perform manual labeling of extracted data records. The technique discussed in this paper departs from the state-of-the-art approaches by determining informative sections in the web page through repetition of informative content rather than syntactic structure. From the experiments, it is clear that the system has identified data rich region with 100% precision for web sites belonging to different domains. The experiments conducted on the real world web sites prove the effectiveness and versatility of the proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

11

Hong, Xudong, Tao Shen, Longhua Shen, Zhengtao Yu, and Jianyi Guo. "Unstructured data extraction of Chinese expert web page." International Journal of Wireless and Mobile Computing 7, no. 2 (2014): 132. http://dx.doi.org/10.1504/ijwmc.2014.059709.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Zhang, Zuping, Jing Zhao, and Xiping Yan. "A Web Page Clustering Method Based on Formal Concept Analysis." Information 9, no. 9 (September 6, 2018): 228. http://dx.doi.org/10.3390/info9090228.

Full text

Abstract:

Web page clustering is an important technology for sorting network resources. By extraction and clustering based on the similarity of the Web page, a large amount of information on a Web page can be organized effectively. In this paper, after describing the extraction of Web feature words, calculation methods for the weighting of feature words are studied deeply. Taking Web pages as objects and Web feature words as attributes, a formal context is constructed for using formal concept analysis. An algorithm for constructing a concept lattice based on cross data links was proposed and was successfully applied. This method can be used to cluster the Web pages using the concept lattice hierarchy. Experimental results indicate that the proposed algorithm is better than previous competitors with regard to time consumption and the clustering effect.

APA, Harvard, Vancouver, ISO, and other styles

13

Deshmukh, Shilpa, P. P. Karde, and V. R. Thakare. "An Improved Approach for Deep Web Data Extraction." ITM Web of Conferences 40 (2021): 03045. http://dx.doi.org/10.1051/itmconf/20214003045.

Full text

Abstract:

The World Wide Web is a valuable wellspring of data which contains information in a wide range of organizations. The different organizations of pages go about as a boundary for performing robotized handling. Numerous business associations require information from the World Wide Web for doing insightful undertakings like business knowledge, item insight, serious knowledge, dynamic, assessment mining, notion investigation, and so on Numerous scientists face trouble in tracking down the most fitting diary for their exploration article distribution. Manual extraction is arduous which has directed the requirement for the computerized extraction measure. In this paper, approach called ADWDE is proposed. This drew closer is essentially founded on heuristic methods. The reason for this exploration is to plan an Automated Web Data Extraction System (AWDES) which can recognize the objective of information extraction with less measure of human intercession utilizing semantic marking and furthermore to perform extraction at a satisfactory degree of precision. In AWDES, there consistently exists a compromise between the degree of human intercession and precision. The objective of this examination is to diminish the degree of human intercession and simultaneously give exact extraction results independent of the business space to which the site page has a place.

APA, Harvard, Vancouver, ISO, and other styles

14

Liu, Wen Tao. "Web Page Data Collection Based on Multithread." Applied Mechanics and Materials 347-350 (August 2013): 2575–79. http://dx.doi.org/10.4028/www.scientific.net/amm.347-350.2575.

Full text

Abstract:

The web data collection is the process of collecting the semi-structured, large-scale and redundant data which include web content, web structure and web usage in the web by the crawler and it is often used for the information extraction, information retrieval, search engine and web data mining. In this paper, the web data collection principle is introduced and some related topics are discussed such as page download, coding problem, updated strategy, static and dynamic page. The multithread technology is described and multithread mode for the web data collection is proposed. The web data collection with multithread can get better resource utilization, better average response time and better performance.

APA, Harvard, Vancouver, ISO, and other styles

15

Ezzikouri, Hanane, Mohamed Fakir, Cherki Daoui, and Mohamed Erritali. "Extracting Knowledge from Web Data." Journal of Information Technology Research 7, no. 4 (October 2014): 27–41. http://dx.doi.org/10.4018/jitr.2014100103.

Full text

Abstract:

The user behavior on a website triggers a sequence of queries that have a result which is the display of certain pages. The Information about these queries (including the names of the resources requested and responses from the Web server) are stored in a text file called a log file. Analysis of server log file can provide significant and useful information. Web Mining is the extraction of interesting and potentially useful patterns and implicit information from artifacts or activity related to the World Wide Web. Web usage mining is a main research area in Web mining focused on learning about Web users and their interactions with Web sites. The motive of mining is to find users' access models automatically and quickly from the vast Web log file, such as frequent access paths, frequent access page groups and user clustering. Through Web Usage Mining, several information left by user access can be mined which will provide foundation for decision making of organizations, Also the process of Web mining was defined as the set of techniques designed to explore, process and analyze large masses of consecutive information activities on the Internet, has three main steps: data preprocessing, extraction of reasons of the use and the interpretation of results. This paper will start with the presentation of different formats of web log files, then it will present the different preprocessing method that have been used, and finally it presents a system for “Web content and Usage Mining'' for web data extraction and web site analysis using Data Mining Algorithms Apriori, FPGrowth, K-Means, KNN, and ID3.

APA, Harvard, Vancouver, ISO, and other styles

16

Chen, Guangxuan, Guangxiao Chen, Lei Zhang, and Qiang Liu. "An Incremental Acquisition Method for Web Forensics." International Journal of Digital Crime and Forensics 13, no. 6 (November 2021): 1–13. http://dx.doi.org/10.4018/ijdcf.2021110116.

Full text

Abstract:

In order to solve the problems of repeated acquisition, data redundancy and low efficiency in the process of website forensics, this paper proposes an incremental acquisition method orientecd to dynamic websites. This method realized the incremental collection on dynamically updated websites through acquiring and parsing web pages, URL deduplication, web page denoising, web page content extraction and hashing. Experiments show that the algorithm has relative high acquisition precision and recall rate, and can be combined with other data to perform effective digital forensics on dynamically updated real-time websites.

APA, Harvard, Vancouver, ISO, and other styles

17

MASSEROLI, MARCO, ANDREA STELLA, MYRIAM ALCALAY, and FRANCESCO PINCIROLI. "GENEWEBEX: GENE ANNOTATION WEB EXTRACTION, AGGREGATION, AND UPDATING FROM WEB-INTERFACED BIOMOLECULAR DATABANKS." International Journal of Software Engineering and Knowledge Engineering 15, no. 03 (June 2005): 511–26. http://dx.doi.org/10.1142/s0218194005002403.

Full text

Abstract:

Numerous genomic annotations are currently stored in different Web-accessible databanks that scientists need to mine with user-defined queries and in a batch mode to orderly integrate the diverse extracted data in suitable user-customizable working environments. Unfortunately, to date, most accessible databanks can be interrogated only for a single gene or protein at a time and generally the data retrieved are available in HTML page format only. We developed GeneWebEx to effectively mine data of interest in different HTML pages of Web-interfaced databanks, and organize extracted data for further analyses. GeneWebEx utilizes user-defined templates to identify data to extract, and aggregates and structures them in a database designed to allocate the various extractions from distinct biomolecular databanks. Moreover, a template-based module enables automatic updating of extracted data. Validations performed on GeneWebEx allowed us to efficiently gather relevant annotations from various sources, and comprehensively query them to highlight significant biological characteristics.

APA, Harvard, Vancouver, ISO, and other styles

18

Swami, Shridevi A., and Pujashree S. Vidap. "Towards Automatic Web Data Scraper and Aligner (WDSA)." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 13, no. 3 (April 15, 2014): 4308–18. http://dx.doi.org/10.24297/ijct.v13i3.2762.

Full text

Abstract:

Web is very immense and fast emerging source of information. Web browsers along with search engines have come forward as famous tools for retrieving and accessing the information present on web. Enormous growth of web made the data extraction from web harder than ever. This paper presents the Automatic Web Data Scraper and Aligner (WDSA). Automatic WDSA extracts the interested web data present in dynamically generated web page received from search engine when user gives a query. Automatic web data scraping is necessary because human being can identify the interested query relevant contents from query result web page, however it is tricky for computer applications. Extracted web data can be further transferred into a format suitable for use in applications like comparison shopping, data integrations, value added services etc. WDSA does this by aligning the extracted web data pairwise as well as holistically in table. The novel thing about Automatic WDSA is that Data Scraper and Aligner uses new approach which combines similarity of both tag and value, for extraction and alignment process. Also Data Scraper handles the data which is present in non contiguous fashion due to presence of auxiliary information like advertisement banners, navigational links, pop ups etc. Experimental results show that Automatic WDSA achieves high precision and recall. Further Automatic WDSA is compared with existing most widely used famous tools like Helium scraper, Outwit Hub, Screen Scraper etc. During comparison we observed that Manual labeling or extraction patterns of desired data is to be specified for working of existing tools while Automatic WDSA does not require any user involvement which made it fully automatic.

APA, Harvard, Vancouver, ISO, and other styles

19

Cohen, William W., and Wei Fan. "Learning page-independent heuristics for extracting data from Web pages." Computer Networks 31, no. 11-16 (May 1999): 1641–52. http://dx.doi.org/10.1016/s1389-1286(99)00047-x.

Full text

APA, Harvard, Vancouver, ISO, and other styles

20

Li, Gui, Zi Yang Han, Zhao Xin Chen, Zheng Yu Li, and Ping Sun. "Web Data Extraction and Integration in Domain." Advanced Materials Research 756-759 (September 2013): 1585–89. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.1585.

Full text

Abstract:

The purpose of WEB data extraction and integration is to provide the domain oriented value-added services. Based on the requirements of domain, and the features of web pages data. this paper proposes a WEB data schema and a domain data model. It also puts forward the web table positioning and web table records extracting based on WEB data schema and an integration algorithm based on the main data model. The experiment results are given to show effectiveness of the proposed algorithm and model.

APA, Harvard, Vancouver, ISO, and other styles

21

FERNÁNDEZ-VILLAMOR, JOSÉ IGNACIO, CARLOS ÁNGEL IGLESIAS, and MERCEDES GARIJO. "FIRST-ORDER LOGIC RULE INDUCTION FOR INFORMATION EXTRACTION IN WEB RESOURCES." International Journal on Artificial Intelligence Tools 21, no. 06 (December 2012): 1250032. http://dx.doi.org/10.1142/s0218213012500327.

Full text

Abstract:

Information extraction out of web pages, commonly known as screen scraping, is usually performed through wrapper induction, a technique that is based on the internal structure of HTML documents. As such, the main limitation of these kinds of techniques is that a generated wrapper is only useful for the web page it was designed for. To overcome this, in this paper it is proposed a system that generates first-order logic rules that can be used to extract data from web pages. These rules are based on visual features such as font size, elements positioning or types of contents. Thus, they do not depend on a document's internal structure, and are able to work on different sites. The system has been validated on a set of different web pages, showing very high precision and good recall, which validates the robustness and the generalization capabilities of the approach.

APA, Harvard, Vancouver, ISO, and other styles

22

Wei, Li, Ling Zhang, Hua Mei Li, and Xiao Zhou Chen. "Chinese Web Page Classification Based on Vector Space Model." Advanced Materials Research 846-847 (November 2013): 1801–4. http://dx.doi.org/10.4028/www.scientific.net/amr.846-847.1801.

Full text

Abstract:

Chinese web page classification has been considered as a hot research area in data mining. In this paper, Chinese web page classification algorithm based on vector space model is proposed. This algorithm makes use of supervised machine learning theory to implement a web page classifier. It combined text frequency and methods for feature extraction and improved traditional TFIDF weighting formula. The results show that the classifier was feasible and effective.

APA, Harvard, Vancouver, ISO, and other styles

23

Wang, Yuan Long, Hong Jiang, Zhao Hong Bing, and Li Zhang. "A Method of Web Information Extraction Based on Building Different Sub Trees." Advanced Materials Research 694-697 (May 2013): 2513–21. http://dx.doi.org/10.4028/www.scientific.net/amr.694-697.2513.

Full text

Abstract:

When extracting Web information, most researchers mixed the structure labels of DOM Tree with the text content. For solving this problem, we put forward a method of Web Information automatic extraction. Firstly, we get the set of DOM sub trees by partitioning the DOM Tree of the Web Page. Secondly, the nodes of all DOM sub trees are set the corresponding weights by the method this paper proposes. Based on this method, we get each set of different sub trees by comparing with the DOM sub trees which come from two the same data source and belongs to the same category. Thirdly, we get the data zone which contains the extracted information by computing the similarity of every two DOM sub trees in the set of different sub trees. Finally, the node path of every DOM sub tree in the data zone will be taken as the extraction rules which will be used to automatically extract the information from the new Web page of the same category. The experiment demonstrates that there are higher precision rate and recall rate. Meanwhile this method can save the time which the users spend on filtering the information.

APA, Harvard, Vancouver, ISO, and other styles

24

., P. Yesuraju. "A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM." International Journal of Research in Engineering and Technology 02, no. 04 (April 25, 2013): 635–39. http://dx.doi.org/10.15623/ijret.2013.0204040.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Chen, Ke Rui, Wan Li Zuo, Fan Zhang, and Feng Lin He. "Extracting Data Records Based on Global Schema." Applied Mechanics and Materials 20-23 (January 2010): 553–58. http://dx.doi.org/10.4028/www.scientific.net/amm.20-23.553.

Full text

Abstract:

With the rapid increasing of web data, deep web is the fastest growing web data carrier. Therefore, the research of deep web, especially on extracting data records from Result pages, has already become an urgent task. We present a data records extraction based on Global Schema method, which automatically extracts the query result records from web pages. This method first analyzes the Query interface and result records instances to build a Global Schema by ontology. Then, the Global Schema is used in the process of extracting data records from result pages and storing these data in a table. Experimental results indicate that this method is accurate to extract data records, as well as to save in a table with a Global Schema.

APA, Harvard, Vancouver, ISO, and other styles

26

Ghule, Sayalee. "Log File Data Extraction or Mining." International Journal for Research in Applied Science and Engineering Technology 9, no. VI (June 30, 2021): 4802–6. http://dx.doi.org/10.22214/ijraset.2021.35833.

Full text

Abstract:

Log records contain data generally Client Title, IP Address, Time Stamp, Get to Ask, number of Bytes Exchanged, Result Status, URL that Intimated, and Client Chairman. The log records are kept up by the internet servers. By analysing these log records gives a flawless thought to the client. The wide Web may be a solid store of web pages that gives the Net clients piles of information. With the change in the number and complexity of Websites, the degree of the net has gotten to be massively wide. Web Utilization Mining may be a division of web mining that consolidates the application of mining strategies to web server logs in coordination to expel the behaviour of clients. Log records contain basic data around the execution of a framework. This data is frequently utilized for investigating, operational profiling, finding quirks, recognizing security dangers, measuring execution,

APA, Harvard, Vancouver, ISO, and other styles

27

YANG, Shao-Hua. "Automatic Data Extraction from Template-Generated Web Pages." Journal of Software 19, no. 2 (July 9, 2008): 209–23. http://dx.doi.org/10.3724/sp.j.1001.2008.00209.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Dong, Yong Quan, Xiang Jun Zhao, and Gong Jie Zhang. "Web Data Extraction with Hierarchical Clustering and Rich Features." Applied Mechanics and Materials 55-57 (May 2011): 1003–8. http://dx.doi.org/10.4028/www.scientific.net/amm.55-57.1003.

Full text

Abstract:

A novel approach is proposed to automatically extract data records from detail pages using hierarchical clustering techniques. The approach uses the information of the listing pages to identify the content blocks in detail pages, which narrows the scopes of Web data extraction. Meanwhile, it also makes full use of the structure and content features to cluster content feature vectors. Finally, it aligns data elements of multiple details pages to extract the data records. Experiment results on test beds of real web pages show that the approach can achieve high extraction accuracy and outperforms the existing techniques substantially.

APA, Harvard, Vancouver, ISO, and other styles

29

Shu, Zhinian, and Xiaorong Li. "Automatic Extraction of Web Page Text Information Based on Network Topology Coincidence Degree." Wireless Communications and Mobile Computing 2022 (March 11, 2022): 1–10. http://dx.doi.org/10.1155/2022/9220661.

Full text

Abstract:

In order to effectively solve the above problems, an automatic extraction method of web text information based on network topology coincidence degree is proposed. Search engine, web crawler, and hypertext tag are used to classify web text information, and then, dimensionality reduction is carried out. After processing, the similarity of different features of web page text information is calculated, the similarity is sorted, and the similar text information is extracted according to the correlation based on segment estimation. The experimental results show that the designed method can simplify the complexity of the associated information of the data set and improve the amount of data collection and the success rate of information collection.

APA, Harvard, Vancouver, ISO, and other styles

30

Lin, Tao, Bao Hua Qiang, Shi Long, and He Qian. "Deep Web Data Extraction Based on Regular Expression." Advanced Materials Research 718-720 (July 2013): 2242–47. http://dx.doi.org/10.4028/www.scientific.net/amr.718-720.2242.

Full text

Abstract:

Data extraction is an important issue in Deep web data integration. In order to extract the query results of the Deep Web, it is firstly required to locate the target data block correctly. Due to the html source code of web pages can be parsed as well structured DOM, we proposed an effective algorithm for discerning the common path based on hierarchical DOM. Based on the common path and our predefined regular expression, the target data of the Deep Web can be extracted effectively. The experimental results on real websites show that our proposed algorithm is highly effective.

APA, Harvard, Vancouver, ISO, and other styles

31

Li, Gui, Cheng Chen, Zheng Yu Li, Zi Yang Han, and Ping Sun. "Web Data Extraction Based on Tag Path Clustering." Advanced Materials Research 756-759 (September 2013): 1590–94. http://dx.doi.org/10.4028/www.scientific.net/amr.756-759.1590.

Full text

Abstract:

Fully automatic methods that extract structured data from the Web have been studied extensively. The existing methods suffice for simple extraction, but they often fail to handle more complicated Web pages. This paper introduces a method based on tag path clustering to extract structured data. The method gets complete tag path collection by parsing the DOM tree of the Web document. Clustering of tag paths is performed based on introduced similarity measure and the data area can be targeted, then taking advantage of features of tag position, we can separate and filter record, finally complete data extraction. Experiments show this method achieves higher accuracy than previous methods.

APA, Harvard, Vancouver, ISO, and other styles

32

Ezeife, C. I., and Titas Mutsuddy. "Towards Comparative Mining of Web Document Objects with NFA." International Journal of Data Warehousing and Mining 8, no. 4 (October 2012): 1–21. http://dx.doi.org/10.4018/jdwm.2012100101.

Full text

Abstract:

The process of extracting comparative heterogeneous web content data which are derived and historical from related web pages is still at its infancy and not developed. Discovering potentially useful and previously unknown information or knowledge from web contents such as “list all articles on ’Sequential Pattern Mining’ written between 2007 and 2011 including title, authors, volume, abstract, paper, citation, year of publication,” would require finding the schema of web documents from different web pages, performing web content data integration, building their virtual or physical data warehouse before web content extraction and mining from the database. This paper proposes a technique for automatic web content data extraction, the WebOMiner system, which models web sites of a specific domain like Business to Customer (B2C) web sites, as object oriented database schemas. Then, non-deterministic finite state automata (NFA) based wrappers for recognizing content types from this domain are built and used for extraction of related contents from data blocks into an integrated database for future second level mining for deep knowledge discovery.

APA, Harvard, Vancouver, ISO, and other styles

33

Griazev, Kiril, and Simona Ramanauskaitė. "Multi-Purpose Dataset of Webpages and Its Content Blocks: Design and Structure Validation." Applied Sciences 11, no. 8 (April 7, 2021): 3319. http://dx.doi.org/10.3390/app11083319.

Full text

Abstract:

The need for automated data extraction is continuously growing due to the constant addition of information to the worldwide web. Researchers are developing new data extraction methods to achieve increased performance compared to existing methods. Comparing algorithms to evaluate their performance is vital when developing new solutions. Different algorithms require different datasets to test their performance due to the various data extraction approaches. Currently, most datasets tend to focus on a specific data extraction approach. Thus, they generally lack the data that may be useful for other extraction methods. That leads to difficulties when comparing the performance of algorithms that are vastly different in their approach. We propose a dataset of web page content blocks that includes various data points to counter this. We also validate its design and structure by performing block labeling experiments. Web developers of varying experience levels labeled multiple websites presented to them. Their labeling results were stored in the newly proposed dataset structure. The experiment proved the need for proposed data points and validated dataset structure suitability for multi-purpose dataset design.

APA, Harvard, Vancouver, ISO, and other styles

34

Shaukat, Masood, and Khushi. "A Novel Approach to Data Extraction on Hyperlinked Webpages." Applied Sciences 9, no. 23 (November 25, 2019): 5102. http://dx.doi.org/10.3390/app9235102.

Full text

Abstract:

The World Wide Web has an enormous amount of useful data presented as HTML tables. These tables are often linked to other web pages, providing further detailed information to certain attribute values. Extracting schema of such relational tables is a challenge due to the non-existence of a standard format and a lack of published algorithms. We downloaded 15,000 web pages using our in-house developed web-crawler, from various web sites. Tables from the HTML code were extracted and table rows were labeled with appropriate class labels. Conditional random fields (CRF) were used for the classification of table rows, and a nondeterministic finite automaton (NFA) algorithm was designed to identify simple, complex, hyperlinked, or non-linked tables. A simple schema for non-linked tables was extracted and for the linked-tables, relational schema in the form of primary and foreign-keys (PK and FK) were developed. Child tables were concatenated with the parent table’s attribute value (PK), serving as foreign keys (FKs). Resultantly, these tables could assist with performing better and stronger queries using the join operation. A manual checking of the linked web table results revealed a 99% precision and 68% recall values. Our 15,000-strong downloadable corpus and a novel algorithm will provide the basis for further research in this field.

APA, Harvard, Vancouver, ISO, and other styles

35

Srikanth, Manchikatla. "Semantic educational data extraction using structur-al domain relationships." International Journal of Engineering & Technology 7, no. 1.2 (December 28, 2017): 175. http://dx.doi.org/10.14419/ijet.v7i1.2.9060.

Full text

Abstract:

In the mining industry', some of the domains are most popular and it plays a vital role in the specific area. Educational Mining and Web-Data Extraction are the two important factors play a leading role in mining industry. The main objective of the proposed system is to extract the related contents from web using semantic (relating to meaning in language or logic) principles as well as to allow the providers to dynamically generate the web pages for educational content and allow the users to search and extract the data from server based on content. The main model of this system is to illustrate the adaptive learning system. For demonstration we consider the semantic principles for Educational content over dynamic environment. This site allows the providers to create web pages related to educational content dynamically and this will be getting approved by the Administrator to live in process. Once the site is live the users can search for the exact content present into the site based on semantic principles. The proposed model is designed for dynamic web data extraction and content analysis from the extracted data due to educational principles. In the proposed system Semantic Web Extraction (SWE) procedures are highly analyzed and utilized for content manipulations. Energetic data extraction scheme for users based on educational content rather than header, title, meta tags and descriptions.

APA, Harvard, Vancouver, ISO, and other styles

36

Akbar, Memen, and Ardianto Wibowo. "Ekstraksi Tabel HTML Bentuk Column-Row Wise ke dalam Basis Data." Jurnal Teknologi Informasi dan Ilmu Komputer 5, no. 6 (November 22, 2018): 653. http://dx.doi.org/10.25126/jtiik.201856905.

Full text

Abstract:

<p class="Abstrak">Pada halaman web, tabel adalah bagian penting dari masalah yang dijelaskan dalam sebuah artikel. Tabel yang terdapat pada halaman web berbeda dari tabel dalam basis data. Tabel di halaman web cenderung tidak memiliki aturan atau bentuk standar. Salah satu bentuk tabel yang tidak standar pada halaman web <em>adalah </em><em>column-row wise</em>. Penelitian ini menawarkan pendekatan untuk mengekstraksi isi tabel sedemikian sehingga arti dari keterkaitan antara dua atribut dan data dalam tabel <em>column-row wise</em> tidak hilang. Data yang diekstrak disimpan ke dalam basis data yang membentuk tiga tabel, yaitu tabel yang menyimpan atribut pertama, tabel yang menyimpan atribut kedua, dan tabel yang menyimpan atribut pertama, kedua, dan data dari atribut pertama dan kedua. Penelitian ini menghasilkan sebuah algoritma untuk mengekstrak data dari tabel yang berbentuk <em>column-row wise</em> pada sebuah halaman web. Algoritma yang dihasilkan dari penelitian ini diharapkan dapat diimplementasikan dalam berbagai bahasa pemrograman. Untuk pengujian, algoritma telah diimplementasikan dengan Bahasa pemrograman Python dan berhasil melakukan ekstraksi tabel dan menyimpannya dalam basis data.</p><p class="Abstrak"> </p><p class="Judul2"><strong><em>Abstract</em></strong></p><p class="Judul2"><em> Tables are an important part of a web page. The table contains tabulations of data or information that you want to convey from the web page. This data tabulation can be used for comparisons with similar tables or as a trigger for action. However, tables on web pages are independent of webpage makers. There is no standard form or layout for a table on a web page. One of the table layouts on a web page is column-row wise. This study offers an approach for extracting table contents such that the meaning of the linkage between two attributes and a data in the column-row wise table is not disappeared. The extracted data is stored into a database that forms three tables, ie the table that stores the first attribute, the table that stores the second attribute, and the table that stores the first, second, and second attributes of the two attributes. Output of this research is an algorithm to extract data of column-row wise table in a web page. The algorithm generated from this research is expected to be implemented in various programming languages. For testing, the algorithm is implemented in Python and success to extract table and save the data into database. Cyclomatic complexity number of the proposed algorithm is 12. This means that the complexity of the proposed algorithm is still high.</em></p>

APA, Harvard, Vancouver, ISO, and other styles

37

Gu, Jun Hua, Jie Song, Na Zhang, and Yan Liu Liu. "A Method of Web Information Automatic Extraction Based on XML." Applied Mechanics and Materials 20-23 (January 2010): 178–83. http://dx.doi.org/10.4028/www.scientific.net/amm.20-23.178.

Full text

Abstract:

With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain.

APA, Harvard, Vancouver, ISO, and other styles

38

Cetorelli, Valerio, Paolo Atzeni, Valter Crescenzi, and Franco Milicchio. "The smallest extraction problem." Proceedings of the VLDB Endowment 14, no. 11 (July 2021): 2445–58. http://dx.doi.org/10.14778/3476249.3476293.

Full text

Abstract:

We introduce landmark grammars , a new family of context-free grammars aimed at describing the HTML source code of pages published by large and templated websites and therefore at effectively tackling Web data extraction problems. Indeed, they address the inherent ambiguity of HTML, one of the main challenges of Web data extraction, which, despite over twenty years of research, has been largely neglected by the approaches presented in literature. We then formalize the Smallest Extraction Problem (SEP), an optimization problem for finding the grammar of a family that best describes a set of pages and contextually extract their data. Finally, we present an unsupervised learning algorithm to induce a landmark grammar from a set of pages sharing a common HTML template, and we present an automatic Web data extraction system. The experiments on consolidated benchmarks show that the approach can substantially contribute to improve the state-of-the-art.

APA, Harvard, Vancouver, ISO, and other styles

39

Haarman, Tim, Bastiaan Zijlema, and Marco Wiering. "Unsupervised Keyphrase Extraction for Web Pages." Multimodal Technologies and Interaction 3, no. 3 (July 31, 2019): 58. http://dx.doi.org/10.3390/mti3030058.

Full text

Abstract:

Keyphrase extraction is an important part of natural language processing (NLP) research, although little research is done in the domain of web pages. The World Wide Web contains billions of pages that are potentially interesting for various NLP tasks, yet it remains largely untouched in scientific research. Current research is often only applied to clean corpora such as abstracts and articles from academic journals or sets of scraped texts from a single domain. However, textual data from web pages differ from normal text documents, as it is structured using HTML elements and often consists of many small fragments. These elements are furthermore used in a highly inconsistent manner and are likely to contain noise. We evaluated the keyphrases extracted by several state-of-the-art extraction methods and found that they did not transfer well to web pages. We therefore propose WebEmbedRank, an adaptation of a recently proposed extraction method that can make use of structural information in web pages in a robust manner. We compared this novel method to other baselines and state-of-the-art methods using a manually annotated dataset and found that WebEmbedRank achieved significant improvements over existing extraction methods on web pages.

APA, Harvard, Vancouver, ISO, and other styles

40

AMIN, MOHAMMAD SHAFKAT, and HASAN JAMIL. "AN EFFICIENT WEB-BASED WRAPPER AND ANNOTATOR FOR TABULAR DATA." International Journal of Software Engineering and Knowledge Engineering 20, no. 02 (March 2010): 215–31. http://dx.doi.org/10.1142/s0218194010004657.

Full text

Abstract:

In the last few years, several works in the literature have addressed the problem of data extraction from web pages. The importance of this problem derives from the fact that, once extracted, data can be handled in a way similar to instances of a traditional database, which in turn can facilitate application of web data integration and various other domain specific problems. In this paper, we propose a novel table extraction technique that works on web pages generated dynamically from a back-end database. The proposed system can automatically discover table structure by relevant pattern mining from web pages in an efficient way, and can generate regular expression for the extraction process. Moreover, the proposed system can assign intuitive column names to the columns of the extracted table by leveraging Wikipedia knowledge base for the purpose of table annotation. To improve accuracy of the assignment, we exploit the structural homogeneity of the column values and their co-location information to weed out less likely candidates. This approach requires no human intervention and experimental results have shown its accuracy to be promising. Moreover, the wrapper generation algorithm works in linear time.

APA, Harvard, Vancouver, ISO, and other styles

41

Kerui, Chen, Wanli Zuo, Fengling He, Yongheng Chen, and Ying Wang. "Data extraction and annotation based on domain-specific ontology evolution for deep web." Computer Science and Information Systems 8, no. 3 (2011): 673–92. http://dx.doi.org/10.2298/csis101011023k.

Full text

Abstract:

Deep web respond to a user query result records encoded in HTML files. Data extraction and data annotation, which are important for many applications, extracts and annotates the record from the HTML pages. We proposed an domain-specific ontology based data extraction and annotation technique; we first construct mini-ontology for specific domain according to information of query interface and query result pages; then, use constructed mini-ontology for identifying data areas and mapping data annotations in data extraction; in order to adapt to new sample set, mini-ontology will evolve dynamically based on data extraction and data annotation. Experimental results demonstrate that this method has higher precision and recall in data extraction and data annotation.

APA, Harvard, Vancouver, ISO, and other styles

42

Embley, D. W., D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y. K. Ng, and R. D. Smith. "Conceptual-model-based data extraction from multiple-record Web pages." Data & Knowledge Engineering 31, no. 3 (November 1999): 227–51. http://dx.doi.org/10.1016/s0169-023x(99)00027-0.

Full text

APA, Harvard, Vancouver, ISO, and other styles

43

SinghSehgal, Manpreet, and Anuradha Anuradha. "HWPDE: Novel Approach for Data Extraction from Structured Web Pages." International Journal of Computer Applications 50, no. 8 (July 28, 2012): 22–27. http://dx.doi.org/10.5120/7791-0897.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Alarte, Julián, and Josep Silva. "Page-Level Main Content Extraction From Heterogeneous Webpages." ACM Transactions on Knowledge Discovery from Data 15, no. 6 (June 28, 2021): 1–105. http://dx.doi.org/10.1145/3451168.

Full text

Abstract:

The main content of a webpage is often surrounded by other boilerplate elements related to the template, such as menus, advertisements, copyright notices, and comments. For crawlers and indexers, isolating the main content from the template and other noisy information is an essential task, because processing and storing noisy information produce a waste of resources such as bandwidth, storage space, and computing time. Besides, the detection and extraction of the main content is useful in different areas, such as data mining, web summarization, and content adaptation to low resolutions. This work introduces a new technique for main content extraction. In contrast to most techniques, this technique not only extracts text, but also other types of content, such as images, and animations. It is a Document Object Model-based page-level technique, thus it only needs to load one single webpage to extract the main content. As a consequence, it is efficient enough as to be used online (in real-time). We have empirically evaluated the technique using a suite of real heterogeneous benchmarks producing very good results compared with other well-known content extraction techniques.

APA, Harvard, Vancouver, ISO, and other styles

45

Zadgaonkar, A. V., A. J. Agrawal, and S. Aote. "Facets extraction-based approach for query recommendation using data mining approach." International Journal of Engineering & Technology 7, no. 1 (January 30, 2018): 121. http://dx.doi.org/10.14419/ijet.v7i1.8944.

Full text

Abstract:

Search engines are popularly utilized for extracting desired information from World Wide Web by users. Efficiency of these search engines are dependent on how fast search results can be retrieved and whether these results reflects the desired info or not. For a particular query, vast amount of relevant information is scattered across the multiple web pages. Search engines generate multiple web links as a output. It has been a jigsaw puzzle for users to identify and select relevant links to extract further desired information. To address this issue, we are proposing an approach for Query Recommendation for getting relevant search results from web using facet mining techniques. Facets are the semantically related words for a query which defines its multiple aspects. We are extracting these aspects of a query from Wikipedia pages which is considered to be a trustworthy resource on the web. Our proposed system uses various text processing techniques to refine the results using lexical resource like WorldNet. In this paper we are discussing our approach and its implementation and results obtained. In the paper , Discussion on future research direction is included to conclude.

APA, Harvard, Vancouver, ISO, and other styles

46

Hany Salman, Rasha, Mahmood Zaki, and Nadia A. Shiltag. "A STUDYING OF WEB CONTENT MINING TOOLS." Al-Qadisiyah Journal Of Pure Science 25, no. 2 (April 7, 2020): 1–16. http://dx.doi.org/10.29350/qjps.2020.25.2.1067.

Full text

Abstract:

The web today has become an archive of information in any structure such content, sound, video, designs, and multimedia, with the progression of time overall web, the world wide web is now crowded with different data making extraction of virtual data burdensome process, web utilizes various information mining strategies to mine helpful information from page substance and web hyperlink. The fundamental employments of web content mining are to gather, sort out, classify, providing the best data accessible on the web for the client who needs to get it. The WCM tools are needful to examining some HTML reports, content and pictures at that point, the outcome is using by the web engine. This paper displays an overview of web mining categorization, web content technique and critical review and study of web content mining tools since (2011-2019) by building the table's a comparison of these instruments dependent on some important criteria

APA, Harvard, Vancouver, ISO, and other styles

47

Abburu, Dr Sunitha, and G. Suresh Babu. "A FRAME WORK FOR WEB INFORMATION EXTRACTION AND ANALYSIS." INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY 7, no. 2 (June 5, 2013): 574–79. http://dx.doi.org/10.24297/ijct.v7i2.3459.

Full text

Abstract:

Day by day the volume of information availability in the web is growing significantly. There are several data structures for information available in the web such as structured, semi-structured and unstructured. Majority of information in the web is presented in web pages. The information presented in web pages is semi-structured.Â But the information required for a context are scattered in different web documents. It is difficult to analyze the large volumes of semi-structured information presented in the web pages and to make decisions based on the analysis. The current research work proposed a frame work for a system that extracts information from various sources and prepares reports based on the knowledge built from the analysis. This simplifies Â data extraction, data consolidation, data analysis and decision making based on the information presented in the web pages.The proposed frame work integrates web crawling, information extraction and data mining technologies for better information analysis that helps in effective decision making.Â Â It enables people and organizations to extract information from various sourses of web and to make an effective analysis on the extracted data for effective decision making.Â The proposed frame work is applicable for any application domain. Manufacturing,sales,tourisum,e-learning are various application to menction few.The frame work is implemetnted and tested for the effectiveness of the proposed system and the results are promising.

APA, Harvard, Vancouver, ISO, and other styles

48

Yu, Lehe, and Zhengxiu Gui. "Analysis of Enterprise Social Media Intelligence Acquisition Based on Data Crawler Technology." Entrepreneurship Research Journal 11, no. 2 (February 22, 2021): 3–23. http://dx.doi.org/10.1515/erj-2020-0267.

Full text

Abstract:

Abstract There are generally hundreds of millions of nodes in social media, and they are connected to a huge social network through attention and fan relationships. The news is spread through this huge social network. This paper studies the acquisition technology of social media topic data and enterprise data. The topic positioning technology based on Sina meta search and topic related keywords is introduced, and the crawling efficiency of topic crawlers is analyzed. Aiming at the factors of diverse and variable webpage structure on the Internet, this paper proposes a new Web information extraction algorithm by studying the general laws existing in the webpage structure, combining DOM (Document Object Model) tree and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm. Several links in the algorithm are introduced in detail, including Web page processing, DOM tree construction, segmented text content acquisition, and web content extraction based on the DBSCAN algorithm. The simulation results show that the intelligence culture, intelligence system, technology platform and intelligence organization ecological collaboration strategy under the extraction of DOM tree and DBSCAN information can improve the level of intelligence participation of all employees. There is a significant positive correlation between the level of participation and the level of the intelligence environment of all employees. According to the research results, the DOM tree and DBSCAN information proposed in this paper can extract the enterprise’s employee intelligence and the effective implementation of relevant collaborative strategies, which can provide guidance for the effective implementation of the employee intelligence.

APA, Harvard, Vancouver, ISO, and other styles

49

Huang, Jui-Chan, Po-Chang Ko, Cher-Min Fong, Sn-Man Lai, Hsin-Hung Chen, and Ching-Tang Hsieh. "Statistical Modeling and Simulation of Online Shopping Customer Loyalty Based on Machine Learning and Big Data Analysis." Security and Communication Networks 2021 (February 18, 2021): 1–12. http://dx.doi.org/10.1155/2021/5545827.

Full text

Abstract:

With the increase in the number of online shopping users, customer loyalty is directly related to product sales. This research mainly explores the statistical modeling and simulation of online shopping customer loyalty based on machine learning and big data analysis. This research mainly uses machine learning clustering algorithm to simulate customer loyalty. Call the k-means interactive mining algorithm based on the Hash structure to perform data mining on the multidimensional hierarchical tree of corporate credit risk, continuously adjust the support thresholds for different levels of data mining according to specific requirements and select effective association rules until satisfactory results are obtained. After conducting credit risk assessment and early warning modeling for the enterprise, the initial preselected model is obtained. The information to be collected is first obtained by the web crawler from the target website to the temporary web page database, where it will go through a series of preprocessing steps such as completion, deduplication, analysis, and extraction to ensure that the crawled web page is correctly analyzed, to avoid incorrect data due to network errors during the crawling process. The correctly parsed data will be stored for the next step of data cleaning or data analysis. For writing a Java program to parse HTML documents, first set the subject keyword and URL and parse the HTML from the obtained file or string by analyzing the structure of the website. Secondly, use the CSS selector to find the web page list information, retrieve the data, and store it in Elements. In the overall fit test of the model, the root mean square error approximation (RMSEA) value is 0.053, between 0.05 and 0.08. The results show that the model designed in this study achieves a relatively good fitting effect and strengthens customers’ perception of shopping websites, and relationship trust plays a greater role in maintaining customer loyalty.

APA, Harvard, Vancouver, ISO, and other styles

50

Cramond, Fala, Alison O'Mara-Eves, Lee Doran-Constant, Andrew SC Rice, Malcolm Macleod, and James Thomas. "The development and evaluation of an online application to assist in the extraction of data from graphs for use in systematic reviews." Wellcome Open Research 3 (March 7, 2019): 157. http://dx.doi.org/10.12688/wellcomeopenres.14738.3.

Full text

Abstract:

Background: The extraction of data from the reports of primary studies, on which the results of systematic reviews depend, needs to be carried out accurately. To aid reliability, it is recommended that two researchers carry out data extraction independently. The extraction of statistical data from graphs in PDF files is particularly challenging, as the process is usually completely manual, and reviewers need sometimes to revert to holding a ruler against the page to read off values: an inherently time-consuming and error-prone process. Methods: To mitigate some of the above problems we integrated and customised two existing JavaScript libraries to create a new web-based graphical data extraction tool to assist reviewers in extracting data from graphs. This tool aims to facilitate more accurate and timely data extraction through a user interface which can be used to extract data through mouse clicks. We carried out a non-inferiority evaluation to examine its performance in comparison with participants’ standard practice for extracting data from graphs in PDF documents. Results: We found that the customised graphical data extraction tool is not inferior to users’ (N=10) prior standard practice. Our study was not designed to show superiority, but suggests that, on average, participants saved around 6 minutes per graph using the new tool, accompanied by a substantial increase in accuracy. Conclusions: Our study suggests that the incorporation of this type of tool in online systematic review software would be beneficial in facilitating the production of accurate and timely evidence synthesis to improve decision-making.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!