Dissertations / Theses on the topic 'Web Crawler'

To see the other types of publications on this topic, follow the link: Web Crawler.

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 dissertations / theses for your research on the topic 'Web Crawler.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.

1

PAES, Vinicius de Carvalho. "Crawler de Faces na Web." reponame:Repositório Institucional da UNIFEI, 2012. http://repositorio.unifei.edu.br/xmlui/handle/123456789/1099.

Full text
Abstract:
Submitted by repositorio repositorio (repositorio@unifei.edu.br) on 2018-02-26T15:29:14Z No. of bitstreams: 1 dissertacao_paes_2012.pdf: 2653704 bytes, checksum: ad170caa9b81a8332ad66d442fdf9289 (MD5)
Made available in DSpace on 2018-02-26T15:29:14Z (GMT). No. of bitstreams: 1 dissertacao_paes_2012.pdf: 2653704 bytes, checksum: ad170caa9b81a8332ad66d442fdf9289 (MD5) Previous issue date: 2012-11
O foco primordial neste projeto é definir a estrutura básica necessária para o desenvolvimento e aplicação prática de uma máquina de busca de faces, afim de garantir uma busca com parâmetros qualitativos apropriados.
APA, Harvard, Vancouver, ISO, and other styles
2

Nguyen, Qui V. "Enhancing a Web Crawler with Arabic Search." Thesis, Monterey, California: Naval Postgraduate School, 2012.

Find full text
Abstract:
Many advantages of the Internetâ ease of access, limited regulation, vast potential audience, and fast flow of informationâ have turned it into the most popular way to communicate and exchange ideas. Criminal and terrorist groups also use these advantages to turn the Internet into their new play/battle fields to conduct their illegal/terror activities. There are millions of Web sites in different languages on the Internet, but the lack of foreign language search engines makes it impossible to analyze foreign language Web sites efficiently. This thesis will enhance an open source Web crawler with Arabic search capability, thus improving an existing social networking tool to perform page correlation and analysis of Arabic Web sites. A social networking tool with Arabic search capabilities could become a valuable tool for the intelligence community. Its page correlation and analysis results could be used to collect open source intelligence and build a network of Web sites that are related to terrorist or criminal activities.
APA, Harvard, Vancouver, ISO, and other styles
3

Ali, Halil, and hali@cs rmit edu au. "Effective web crawlers." RMIT University. CS&IT, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081127.164414.

Full text
Abstract:
Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed.
APA, Harvard, Vancouver, ISO, and other styles
4

Kayisoglu, Altug. "Lokman: A Medical Ontology Based Topical Web Crawler." Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/2/12606468/index.pdf.

Full text
Abstract:
Use of ontology is an approach to overcome the &ldquo
search-on-the-net&rdquo
problem. An ontology based web information retrieval system requires a topical web crawler to construct a high quality document collection. This thesis focuses on implementing a topical web crawler with medical domain ontology in order to find out the advantages of ontological information in web crawling. Crawler is implemented with Best-First search algorithm. Design of the crawler is optimized to UMLS ontology. Crawler is tested with Harvest Rate and Target Recall Metrics and compared to a non-ontology based Best-First Crawler. Performed test results proved that ontology use in crawler URL selection algorithm improved the crawler performance by 76%.
APA, Harvard, Vancouver, ISO, and other styles
5

Pandya, Milan. "A Domain Based Approach to Crawl the Hidden Web." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_theses/32.

Full text
Abstract:
There is a lot of research work being performed on indexing the Web. More and more sophisticated Web crawlers are been designed to search and index the Web faster. But all these traditional crawlers crawl only the part of Web we call “Surface Web”. They are unable to crawl the hidden portion of the Web. These traditional crawlers retrieve contents only from surface Web pages which are just a set of Web pages linked by some hyperlinks and ignoring the hidden information. Hence, they ignore tremendous amount of information hidden behind these search forms in Web pages. Most of the published research has been done to detect such searchable forms and make a systematic search over these forms. Our approach here will be based on a Web crawler that analyzes search forms and fills tem with appropriate content to retrieve maximum relevant information from the database.
APA, Harvard, Vancouver, ISO, and other styles
6

Koron, Ronald Dean. "Developing a Semantic Web Crawler to Locate OWL Documents." Wright State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=wright1347937844.

Full text
APA, Harvard, Vancouver, ISO, and other styles
7

Stivala, Giada Martina. "Perceptual Web Crawlers." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019.

Find full text
Abstract:
Web crawlers are a fundamental component of web application scanners and are used to explore the attack surface of web applications. Crawlers work as follows. First, for each page, they extract URLs and UI elements that may lead to new pages. Then, they use a depth-first or breadth-first tree traversal to explore new pages. In this approach, crawlers cannot distinguish between "terminate user account" and "next page" buttons and they will click on both without taking into account the consequences of their actions. The goal of this project is to devise a new family of crawlers that builds on client-side code analysis and expand with the inference of the semantic of UI element by using visual clues. The new crawler will be able to identify in real time types and semantics of the UI elements, and it will use the semantics to choose the right action. This project will include the development of a prototype and evaluation against a selection of real-size web applications.
APA, Harvard, Vancouver, ISO, and other styles
8

Choudhary, Suryakant. "M-crawler: Crawling Rich Internet Applications Using Menu Meta-model." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/23118.

Full text
Abstract:
Web applications have come a long way both in terms of adoption to provide information and services and in terms of the technologies to develop them. With the emergence of richer and more advanced technologies such as Ajax, web applications have become more interactive, responsive and user friendly. These applications, often called Rich Internet Applications (RIAs) changed the traditional web applications in two primary ways: Dynamic manipulation of client side state and Asynchronous communication with the server. At the same time, such techniques also introduce new challenges. Among these challenges, an important one is the difficulty of automatically crawling these new applications. Crawling is not only important for indexing the contents but also critical to web application assessment such as testing for security vulnerabilities or accessibility. Traditional crawlers are no longer sufficient for these newer technologies and crawling in RIAs is either inexistent or far from perfect. There is a need for an efficient crawler for web applications developed using these new technologies. Further, as more and more enterprises use these new technologies to provide their services, the requirement for a better crawler becomes inevitable. This thesis studies the problems associated with crawling RIAs. Crawling RIAs is fundamentally more difficult than crawling traditional multi-page web applications. The thesis also presents an efficient RIA crawling strategy and compares it with existing methods.
APA, Harvard, Vancouver, ISO, and other styles
9

Lee, Hsin-Tsang. "IRLbot: design and performance analysis of a large-scale web crawler." Texas A&M University, 2008. http://hdl.handle.net/1969.1/85914.

Full text
Abstract:
This thesis shares our experience in designing web crawlers that scale to billions of pages and models their performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, breadth-first search (BFS) crawl order, and fixed per-host rate-limiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6:3 billion valid HTML pages (7:6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes.
APA, Harvard, Vancouver, ISO, and other styles
10

Karki, Rabin. "Fresh Analysis of Streaming Media Stored on the Web." Digital WPI, 2011. https://digitalcommons.wpi.edu/etd-theses/81.

Full text
Abstract:
With the steady increase in the bandwidth available to end users and Web sites hosting user generated content, there appears to be more multimedia content on the Web than ever before. Studies to quantify media stored on the Web done in 1997 and 2003 are now dated since the nature, size and number of streaming media objects on the Web have changed considerably. Although there have been more recent studies characterizing specific streaming media sites like YouTube, there are only a few studies that focus on characterizing the media stored on the Web as a whole. We build customized tools to crawl the Web, identify streaming media content and extract the characteristics of the streaming media found. We choose 16 different starting points and crawled 1.25 million Web pages from each starting point. Using the custom built tools, the media objects are identified and analyzed to determine attributes including media type, media length, codecs used for encoding, encoded bitrate, resolution, and aspect ratio. A little over half the media clips we encountered are video. MP3 and AAC are the most prevalent audio codecs whereas H.264 and FLV are the most common video codecs. The median size and encoded bitrates of stored media have increased since the last study. Information on the characteristics of stored multimedia and their trends over time can help system designers. The results can also be useful for empirical Internet measurements studies that attempt to mimic the behavior of streaming media traffic over the Internet.
APA, Harvard, Vancouver, ISO, and other styles
11

Englund, Malin, Christian Gullberg, and Jesper Wiklund. "A web crawler to effectively find web shops built with a specific e-commerce plug-in." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-325788.

Full text
Abstract:
Internetshopping är i dagens samhälle något väldigt vanligt. Att kunna köpa saker på internet och få det skickat till din dörr är något människor finner både bekvämt och lockande. Med efterfrågan kommer marknaden och internetbutiker har därför blivit ett populärt ställe att sälja sina varor på. Företag som vill sälja sina produkter till internetbutiker har problem med att hitta potentiella kunder på ett effektivt sätt. Det här projektet är ett försök till att lösa problemet genom att hitta många internetbutiker med specifierat e-handels plug-in, i detta fall WooCommerce, på kort tid. Lösningen var att skapa en web spindel med syftet att söka genom internet efter butiker. Resultatet av sökningen sparas i en databas där användaren kan hämta information om de hittade internetbutikerna. Projektet var framgångsrikt i termer av effektivitet men med möjlighet till förbättring i betraktande av robusthet och precision.
Nowadays online shopping has become very common. Being able to buy things online and get them sent to the door is something many people find convenient and appealing. With the demand comes the market and web shops have therefore become a popular place for companies to sell their items. Companies that want to sell their products to web shops can have a hard time finding potential customers in an efficient way. This project is an attempt to solve this problem, finding a large quantity of web shops with a specific e-commerce plug-in, in this case WooCommerce, in a short amount of time. The solution was to create a web crawler with the purpose of searching the Internet locating web shops. The result of the search is stored in a database where the user can retrieve information, such as revenue and company name, about the web shops found. It was a success in the sense of efficiency but with room for improvement considering robustness and accuracy.
APA, Harvard, Vancouver, ISO, and other styles
12

Anttila, Pontus. "Mot effektiv identifiering och insamling avbrutna länkar med hjälp av en spindel." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-67842.

Full text
Abstract:
I dagsläget har uppdragsgivaren ingen automatiserad metod för att samla in brutna länkar på deras hemsida, utan detta sker manuellt eller inte alls. Detta projekt har resulterat i en praktisk produkt som idag kan appliceras på uppdragsgivarens hemsida. Produktens mål är att automatisera arbetet med att hitta och samla in brutna länkar på hemsidan. Genom att på ett effektivt sätt samla in alla eventuellt brutna länkar, och placera dem i en separat lista så kan en administratör enkelt exportera listan och sedan åtgärda de brutna länkar som hittats. Uppdragsgivaren kommer att ha nytta av denna produkt då en hemsida utan brutna länkar höjer hemsidans kvalité, samtidigt som den ger besökare en bättre upplevelse.
Today, the customer has no automated method for finding and collecting broken links on their website. This is done manually or not at all. This project has resulted in a practical product, that can be applied to the customer’s website. The aim of the product is to ease the work when collecting and maintaining broken links on the website. This will be achieved by gathering all broken links effectively, and place them in a separate list that at will can be exported by an administrator who will then fix these broken links. The quality of the customer’s website will be higher, as all broken links will be easier to find and remove. This will ultimately give visitors a better experience.
APA, Harvard, Vancouver, ISO, and other styles
13

Desai, Lovekeshkumar. "A Distributed Approach to Crawl Domain Specific Hidden Web." Digital Archive @ GSU, 2007. http://digitalarchive.gsu.edu/cs_theses/47.

Full text
Abstract:
A large amount of on-line information resides on the invisible web - web pages generated dynamically from databases and other data sources hidden from current crawlers which retrieve content only from the publicly indexable Web. Specially, they ignore the tremendous amount of high quality content "hidden" behind search forms, and pages that require authorization or prior registration in large searchable electronic databases. To extracting data from the hidden web, it is necessary to find the search forms and fill them with appropriate information to retrieve maximum relevant information. To fulfill the complex challenges that arise when attempting to search hidden web i.e. lots of analysis of search forms as well as retrieved information also, it becomes eminent to design and implement a distributed web crawler that runs on a network of workstations to extract data from hidden web. We describe the software architecture of the distributed and scalable system and also present a number of novel techniques that went into its design and implementation to extract maximum relevant data from hidden web for achieving high performance.
APA, Harvard, Vancouver, ISO, and other styles
14

Zemlin, Toralf. "Entwurf eines konfigurierbaren Web-Crawler-Frameworks zur weiteren Verwendung fur Single-Hosted Media Retrieval." Master's thesis, Universitätsbibliothek Chemnitz, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-200801338.

Full text
Abstract:
Diese Arbeit beschreibt ein Webcrawler-Framework für die Professur Medieninformatik der Technischen Universität Chemnitz und dessen Kernimplementierung. Der Crawler traversiert den WWW-Graph. Jedes Dokument durchläuft dabei verschiedene Module des Frameworks. Ein Schedulingmodul entscheidet über die Reihenfolge der Traversierung. Schwerpunkt dieser Entwicklung ist die Erweiterungsmöglichkeit für unterschiedliche Variationen des Datensammlers. Es wird gezeigt, welche Informationen ein Dokument für wesentliche Entscheidungen begleiten müssen. Hierzu zählen Wiedererkennung von Dokumenten, Schedulingkriterien und URL-Indexpflege. Der Framework ist konfigurierbar. Das heißt, im Kern bezieht sich die Funktion auf Crawling. Zusätzlich sind Schnittstellen für Filter- und Speicherkomponenten vorgesehen. Der Crawler verfügt über eine Administrationsschnittstelle, mit Hilfe derer er gesteuert werden kann. Weiterhin sind Status und Statistiken über Ereignisse und Fortschritte vorgesehen. Außerdem werden Testkriterien aufgezeigt und Probleme diskutiert.
APA, Harvard, Vancouver, ISO, and other styles
15

Zemlin, Toralf Eibl Maximilian. "Entwurf eines konfigurierbaren Web-Crawler-Frameworks zur weiteren Verwendung fur Single-Hosted Media Retrieval." [S.l. : s.n.], 2008.

Find full text
APA, Harvard, Vancouver, ISO, and other styles
16

Moravec, Petr. "Monitoring internetu a jeho přínosy pro podnikání nástroji firmy SAS Institute." Master's thesis, Vysoká škola ekonomická v Praze, 2011. http://www.nusl.cz/ntk/nusl-165263.

Full text
Abstract:
This Thesis is focused on the ways of getting information from the World Wide Web source . The Introduction pays attention to the theoretical approach towards data collection options . The main part of the Introduction is engaged in the Web Crawler program as the possibility of data collection from internet and consequently they are followed by alternative methods of data collection. E.g. Google Search API. The next part of the Thesis is dedicated to SAS products and their meanings in the context of reporting and internet monitoring. SAS Intelligence platform is presented as the crucial Company platform In the framework of the platform there could be found concrete SAS solutions. SAS Web Crawler and Semantic Server are described in SAS Content Categorization solution. Whilst the first two parts of Thesis are focused on the theory , the third and closing part pays attention to practical examples. There are illustrated examples of Internet Data collection, which are mainly realized in SAS. The practical part of Thesis follows the theoretical one and it cannot be detached.
APA, Harvard, Vancouver, ISO, and other styles
17

Činčera, Jaroslav. "Pokročilý robot na procházení webu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237201.

Full text
Abstract:
This Master's thesis describes design and implementation of advanced web crawler. This crawler can be configured by user and is designed for web browsing according to specified parameters. Can acquire and evaluate content of web pages. Its configuration is performed by creating projects which are consisting of different types of steps. User can create simple action like downloading page, form submission, etc. or can create more complex and larger projects.
APA, Harvard, Vancouver, ISO, and other styles
18

Lloyd, Oskar, and Christoffer Nilsson. "How to Build a Web Scraper for Social Media." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20594.

Full text
Abstract:
In recent years, the act of scraping websites for information has become increasingly relevant. However, along with this increase in interest, the internet has also grown substantially and advances and improvements to websites over the years have in fact made it more difficult to scrape. One key reason for this is that scrapers simply account for a significant portion of the traffic to many websites, and so developers often implement anti-scraping measures along with the Robots Exclusion Protocol (robots.txt) to try to stymie this traffic. The popular use of dynamically loaded content – content which loads after user interaction – poses another problem for scrapers. In this paper, we have researched what kinds of issues commonly occur when scraping and crawling websites – more specifically when scraping social media – and how to solve them. In order to understand these issues better and to test solutions, a literature review was performed and design and creation methods were used to develop a prototype scraper using the frameworks Scrapy and Selenium. We found that automating interaction with dynamic elements worked best to solve the problem of dynamically loaded content. We also theorize that having an artificial random delay when scraping and randomizing intervals between each visit to a website would counteract some of the anti-scraping measures. Another, smaller aspect of our research was the legality and ethicality of scraping. Further thoughts and comments on potential solutions to other issues have also been included.
APA, Harvard, Vancouver, ISO, and other styles
19

RODRIGUES, Thiago Gomes. "ARAPONGA: Uma Ferramenta de Apoio a Recuperação de Informação na Web voltado a Segurança de Redes e Sistemas." Universidade Federal de Pernambuco, 2012. https://repositorio.ufpe.br/handle/123456789/11367.

Full text
Abstract:
Submitted by Daniella Sodre (daniella.sodre@ufpe.br) on 2015-03-09T12:40:54Z No. of bitstreams: 2 dissertacao_tgr_final_digital.pdf: 2171210 bytes, checksum: f12a3f4a3a1d0cb741406b75b56f43b7 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5)
Made available in DSpace on 2015-03-09T12:40:54Z (GMT). No. of bitstreams: 2 dissertacao_tgr_final_digital.pdf: 2171210 bytes, checksum: f12a3f4a3a1d0cb741406b75b56f43b7 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2012-03-07
A área de segurança de redes de computadores e sistemas apresenta-se como uma das maiores preocupações atualmente. À medida que o número de usuários de computadores aumenta, cresce no número de incidentes de segurança. A falta de comportamentos voltados à segurança, no que se refere a uso de hardware, e-mails ou configuração de programas são fatores facilitam a implantação de códigos maliciosos. O impacto da exploração de vulnerabilidades ou de falhas de softwares tem aumentado gradualmente e causado enormes prejuízos ao redor do mundo. A divulgação destas vulnerabilidades e boas práticas de segurança têm sido uma das soluções para este problema pois permitem que administradores de redes e sistemas consigam adquirir informações relevantes para mitigar o impacto de uma atividade maliciosa. Ao notar que divulgar informações de segurança é uma das saídas para combater as atividades maliciosas e também para diminuir o impacto de uma exploração bem sucedida, várias organizações resolveram publicar este tipo de conteúdo. Estas bases encontram-se espalhadas em diferentes sítios Web, o que faz com que equipes de administração de redes e sistemas demore muito tempo buscando informações necessárias para a resolução dos seus problemas. Além disto, a exposição do conteúdo não é um fator preponderante para a solução dos problemas. Baseado neste cenário, este trabalho de mestrado se propõe a criar um sistema de apoio à recuperação de informação na Web voltado à segurança de redes e sistemas.
APA, Harvard, Vancouver, ISO, and other styles
20

Yu, Liyang. "An Indexation and Discovery Architecture for Semantic Web Services and its Application in Bioinformatics." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_theses/20.

Full text
Abstract:
Recently much research effort has been devoted to the discovery of relevant Web services. It is widely recognized that adding semantics to service description is the solution to this challenge. Web services with explicit semantic annotation are called Semantic Web Services (SWS). This research proposes an indexation and discovery architecture for SWS, together with a prototype application in the area of bioinformatics. In this approach, a SWS repository is created and maintained by crawling both ontology-oriented UDDI registries and Web sites that hosting SWS. For a given service request, the proposed system invokes the matching algorithm and a candidate set is returned with different degree of matching considered. This approach can add more flexibility to the current industry standards by offering more choices to both the service requesters and publishers. Also, the prototype developed in this research shows the value can be added by using SWS in application areas such as bioinformatics.
APA, Harvard, Vancouver, ISO, and other styles
21

Lat, Radek. "Nástroj pro automatické kategorizování webových stránek." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236054.

Full text
Abstract:
Tato diplomová práce popisuje návrh a implementaci nástroje pro automatickou kategorizaci webových stránek. Cílem nástroje je aby byl schopen se z ukázkových webových stránek naučit, jak každá kategorie vypadá. Poté by měl nástroj zvládnout přiřadit naučené kategorie k dříve nespatřeným webovým stránkám. Nástroj by měl podporovat více kategorií a jazyků. Pro vývoj nástroje byly použity pokročilé techniky strojového učení, detekce jazyků a dolování dat. Nástroj je založen na open source knihovnách a je napsán v jazyce Python 3.3.
APA, Harvard, Vancouver, ISO, and other styles
22

Mfenyana, Sinesihle Ignetious. "Implementation of a facebook crawler for opinion monitoring and trend analysis purposes: a case study of government service delivery in Dwesa." Thesis, University of Fort Hare, 2014. http://hdl.handle.net/10353/d1016067.

Full text
Abstract:
The Internet has shifted from the Web 1.0 era to the Web 2.0 era. In the contemporary era of web 2.0, the Internet is being used to build and reflect social relationships among people who share similar interests and activities. This is done through services such as Social Networking Sites (Facebook, Twitter etc.) and the web blogs. Currently, there is a very high usage of Social Networking Sites (SNSs) and blogs where people share their views, opinions, and thoughts. This leads to the production of a lot of data by people who post such content on SNSs. As a result, SNSs and blogs become the ideal platforms for opinion monitoring and the trend analysis. These SNSs and Blogs could be used by service providers for tracking what the public thinks or requires. The reason being, having such knowledge can help in decision making and future planning. If service providers can keep track of such views, opinions or thoughts with regard to the services they provide, they can better their understanding about the public or clients’ needs and improve the provision of relevant services. This research project presents a system prototype for performing opinion monitoring and trend analysis on Facebook. The proposed system crawl Facebook, indexes the data and provides user interface (UI) where end users can search and see the trending of a topics of their choice. The system prototype could also be used to check the trending topics without having to search. The main objective of this research project was to develop a framework that will contribute in improving the way government officials, companies or any service providers and normal citizens communicate regarding services they provide. This research project is premised on the conceptualization that if the government officials, companies or any service providers can keep track of the citizen’s opinions, views and thoughts with regards to services they provide it can help improve the delivery of such services. This research and the implementation of the trend analysis tool is undertaken in the context of the Siyakhula Living Lab (SLL), an Information and Communication Technologies for Development (ICTD) intervention for Dwesa marginalized community.
APA, Harvard, Vancouver, ISO, and other styles
23

Toufik, Bennouas. "Modélisation de parcours du Web et calcul de communautés par émergence." Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2005. http://tel.archives-ouvertes.fr/tel-00137084.

Full text
Abstract:
Le graphe du Web, plus précisément le crawl qui permet de l'obtenir et les communautés qu'il contient est le sujet de cette thèse, qui est divisée en deux parties.
La première partie fait une analyse des grand réseaux d'interactions et introduit un nouveau modèle de crawls du Web. Elle commence par définir les propriétés communes des réseaux d'interactions, puis donne quelques modèles graphes aléatoires générant des graphes semblables aux réseaux d'interactions. Pour finir, elle propose un nouveau modèle de crawls aléatoires.
La second partie propose deux modèles de calcul de communautés par émergence dans le graphe du Web. Après un rappel sur les mesures d'importances, PageRank et HITS est présenté le modèle gravitationnel dans lequel les nœuds d'un réseau sont mobile et interagissent entre eux grâce aux liens entre eux. Les communautés émergent rapidement au bout de quelques itérations. Le second modèle est une amélioration du premier, les nœuds du réseau sont dotés d'un objectif qui consiste à atteindre sa communautés.
APA, Harvard, Vancouver, ISO, and other styles
24

Matulionis, Paulius. "Veiksmų ontologijos formavimas panaudojant internetinį tekstyną." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2012. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2012~D_20120620_113255-46777.

Full text
Abstract:
Šio baigiamojo magistro darbo tikslas yra veiksmų ontologijos formavimo panaudojant automatiniu būdu sukauptą internetinį tekstyną problematikos tyrimas. Tyrimo metu buvo analizuojami tekstynų žymėjimo standartai, sukauptas internetinis tekstynas, pasitelkiant priemones sukurtas darbo metu. Tyrimo metu buvo kuriamos ir tobulinamos jau esamos priemonės atlikti įvairius eksperimentus su tekstynais, duomenų sisteminimu, vaizdavimu ir ontologijos formavimu. Buvo sukurta procesų valdymo sistema, kuri buvo realizuota nuo front-end iki back-end sistemos kūrimo lygiu. Darbe yra pateikiamos detalios sistemos ir jos komponentų schemos atspindinčios visą sistemos veikimą. Tyrimo metu buvo atlikti eksperimentai susiję su veiksmų ontologijos formavimu. Darbe yra aprašyti ontologijos kūrimo žingsniai, pateikiamos problemos ir jų sprendimai bei pasiūlymai, ką galima būtų padaryti, kad eksperimentų rezultatai būtų dar tikslesni. Taip pat yra įvardinamos taisyklės, kurios buvo naudojamos reikalingų duomenų gavimui iš sukaupto tekstyno, taip pat taisyklės buvo apibendrintos ir pateiktos sukurtoms priemonėms suprantamu pavidalu.
The goal of the master thesis is to investigate the problem of the automated action ontology design using a corpus harvested from internet. A software package including tools for internet corpus harvesting, network service access, markup, ontology design and representation was developed and tested in the carried out experiment. A process management system was realized covering both front-end and the back-end system design levels. Detailed system and component models are presented, reflecting all the operations of the system. The thesis presents the results of experiments on building ontologies for several selected action verbs. Ontology building process is described, problems in recognizing separate elements of the action environment are analysed, suggestions on additional rules leading to more accurate results, are presented. Rules have been summarized and integrated into the designed software package.
APA, Harvard, Vancouver, ISO, and other styles
25

McLearn, Greg. "Autonomous Cooperating Web Crawlers." Thesis, University of Waterloo, 2002. http://hdl.handle.net/10012/1080.

Full text
Abstract:
A web crawler provides an automated way to discover web events ? creation, deletion, or updates of web pages. Competition among web crawlers results in redundant crawling, wasted resources, and less-than-timely discovery of such events. This thesis presents a cooperative sharing crawler algorithm and sharing protocol. Without resorting to altruistic practices, competing (yet cooperative) web crawlers can mutually share discovered web events with one another to maintain a more accurate representation of the web than is currently achieved by traditional polling crawlers. The choice to share or merge is entirely up to an individual crawler: sharing is the act of allowing a crawler M to access another crawler's web-event data (call this crawler S), and merging occurs when crawler M requests web-event data from crawler S. Crawlers can choose to share with competing crawlers if it can help reduce contention between peers for resources associated with the act of crawling. Crawlers can choose to merge from competing peers if it helps them to maintain a more accurate representation of the web at less cost than directly polling web pages. Crawlers can control how often they choose to merge through the use of a parameter ρ, which dictates the percentage of time spent either polling or merging with a peer. Depending on certain conditions, pathological behaviour can arise if polling or merging is the only form of data collection. Simulations of communities of simple cooperating web crawlers successfully show that a combination of polling and merging (0 < ρ < 1) can allow an individual member of the cooperating community a higher degree of accuracy in their representation of the web as compared to a traditional polling crawler. Furthermore, if web crawlers are allowed to evaluate their own performance, they can dynamically switch between periods of polling and merging to still perform better than traditional crawlers. The mutual performance gain increases as more crawlers are added to the community.
APA, Harvard, Vancouver, ISO, and other styles
26

Silva, Carlos Jesús Hernández da. "Geração automática de conteúdo audiovisual informativo para seniores." Master's thesis, Universidade de Aveiro, 2017. http://hdl.handle.net/10773/22543.

Full text
Abstract:
Mestrado em Comunicação Multimédia
A sociedade atual, a um nível global, encontra-se cada vez mais envelhecida e as suas necessidades e dificuldades, nomeadamente informativas, não são completamente supridas. Definir estratégias de envelhecimento ativo, a nível comunitário e individual, que possibilitem uma participação cívica pertinente e um contínuo crescimento das presentes e futuras gerações, deveria ser visto como um dos principais desafios a serem constantemente ultrapassados em paralelo com a evolução social, política, económica e tecnológica. A investigação aqui descrita enquadra-se no desenvolvimento de uma aplicação de televisão interativa como veículo de difusão de informação sobre serviços sociais de apoio a seniores e insere-se no projeto +TV4E. Pretende-se desenhar e desenvolver uma solução tecnológica capaz de criar, de forma automática, conteúdos que suprimam as necessidades informativas sobre, por exemplo, dados sociais, económicos ou meteorológicos, tendo em conta as especificidades do público-alvo, os seniores portugueses. A solução de iTV a construir terá como base uma aplicação que irá enriquecer a emissão televisiva com conteúdo informativo adequado a determinado perfil e preferências de cada set-top-box onde se disponibiliza, como a sua localização geográfica ou os seus comportamentos. Pretende-se que, durante uma emissão televisiva e após prévio aviso, seja disponibilizada informação sobre serviços sociais, sob a forma de conteúdo audiovisual informativo que será construído seguindo um determinado padrão e gerado de forma automática com conteúdo recolhido online em distintos serviços web.
Globally, modern societies are getting older and their needs and challenges, mainly informative, aren’t completely suppressed. One of the tops most important goals to achieve in parallel with social, political, economic and technological evolution is to define strategies for active aging, individually and in the community, which will make for a continuous and significant civic participation. The here described investigation fits in the development for a television application as a vehicle for the distribution of information about social services of support to seniors in accordance with the +TV4E project. It is intended to plan and develop a technological solution capable of automatically creating content to suppress the senior's informative needs about, for instance, social services, economics or meteorological data, having in mind their specifications. The iTV solution to build and be delivered in a set top box will have, as a basis, an application which will reinforce the television broadcast with adequate informative content for each set-top-box profile and preferences, as location, or behavioural analytics. The aim is to, during a television broadcast and upon prior notice, show audio-visual informative content, automatically generated from different online web services, about social and public services, composed following a certain structure.
APA, Harvard, Vancouver, ISO, and other styles
27

Wan, Shengye. "Protecting Web Contents Against Persistent Crawlers." W&M ScholarWorks, 2016. https://scholarworks.wm.edu/etd/1477068008.

Full text
Abstract:
Web crawlers have been developed for several malicious purposes like downloading server data without permission from website administrator. Armored stealthy crawlers are evolving against new anti-crawler mechanisms in the arms race between the crawler developers and crawler defenders. In this paper, we develop a new anti-crawler mechanism called PathMarker to detect and constrain crawlers that crawl content of servers stealthy and persistently. The basic idea is to add a marker to each web page URL and then encrypt the URL and marker. By using the URL path and user information contained in the marker as the novel features of machine learning, we could accurately detect stealthy crawlers at the earliest stage. Besides effectively detecting crawlers, PathMarker can also dramatically suppress the efficiency of crawlers before they are detected by misleading the crawlers visiting same page's URL with different markers. We deploy our approach on a forum website to collect normal users' data. The evaluation results show that PathMarker can quickly capture all 12 open-source and in-house crawlers, plus two external crawlers (i.e., Googlebots and Yahoo Slurp).
APA, Harvard, Vancouver, ISO, and other styles
28

Wara, Ummul. "A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop : TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234285.

Full text
Abstract:
Due to the transformation of different recommendation system from contentbased to hybrid cross-domain-based, there is an urge to prepare a socialnetwork dataset which will provide sufficient data as well as detail-level annotation from a predefined hierarchical clothing category and attribute based vocabulary by considering user interactions. However, existing fashionbased datasets lack either in hierarchical-category based representation or user interactions of social network. The thesis intends to represent two datasets- one from photo-sharing platform Instagram which gathers fashionistas images with all possible user-interactions and another from online-shop Zalando with every cloths detail. We present a design of a customized crawler that enables the user to crawl data based on category or attributes. Moreover, an efficient and collaborative web-solution is designed and implemented to facilitate large-scale hierarchical category-based detaillevel annotation of Instagram data. By considering all user-interactions, the developed solution provides a detail-level annotation facility that reflects the user’s preference. The web-solution is evaluated by the team as well as the Amazon Turk Service. The annotated output from different users proofs the usability of the web-solution in terms of availability and clarity. In addition to data crawling and annotation web-solution development, this project analyzes the Instagram and Zalando data distribution in terms of cloth category, subcategory and pattern to provide meaningful insight over data. Researcher community will benefit by using these datasets if they intend to work on a rich annotated dataset that represents social network and resembles in-detail cloth information.
Med tanke på trenden inom forskning av rekommendationssystem, där allt fler rekommendationssystem blir hybrida och designade för flera domäner, så finns det ett behov att framställa en datamängd från sociala medier som innehåller detaljerad information om klädkategorier, klädattribut, samt användarinteraktioner. Nuvarande datasets med inriktning mot mode saknar antingen en hierarkisk kategoristruktur eller information om användarinteraktion från sociala nätverk. Detta projekt har syftet att ta fram två dataset, ett dataset som insamlats från fotodelningsplattformen Instagram, som innehåller foton, text och användarinteraktioner från fashionistas, samt ett dataset som insamlats från klädutbutdet som ges av onlinebutiken Zalando. Vi presenterar designen av en webbcrawler som är anpassad för att kunna hämta data från de nämnda domänerna och är optimiserad för mode och klädattribut. Vi presenterar även en effektiv webblösning som är designad och implementerad för att möjliggöra annotering av stora mängder data från Instagram med väldigt detaljerad information om kläder. Genom att vi inkluderar användarinteraktioner i applikationen så kan vår webblösning ge användaranpassad annotering av data. Webblösningen har utvärderats av utvecklarna samt genom AmazonTurk tjänsten. Den annoterade datan från olika användare demonstrerar användarvänligheten av webblösningen. Utöver insamling av data och utveckling av ett system för webb-baserad annotering av data så har datadistributionerna i två modedomäner, Instagram och Zalando, analyserats. Datadistributionerna analyserades utifrån klädkategorier och med syftet att ge datainsikter. Forskning inom detta område kan dra nytta av våra resultat och våra datasets. Specifikt så kan våra datasets användas i domäner som kräver information om detaljerad klädinformation och användarinteraktioner.
APA, Harvard, Vancouver, ISO, and other styles
29

Castillejo, Sierra Miguel. "Redes temáticas en la web: estudio de caso de la red temática de la transparencia en Chile." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/378362.

Full text
Abstract:
El objeto de estudio de esta investigación son las Redes Temáticas, concretamente las Redes Temáticas en la Web y su potencial para extraer datos objetivos de las corrientes de opinión que se generan en torno a un tema de discusión o controversia social. Esta investigación se estructura a través de cuatro objetivos: caracterizar los componentes de las redes temáticas; caracterizar y evaluar las herramientas para el análisis de redes temáticas en la web; diseñar un Sistema de Análisis de Redes Temáticas en la Web; y aplicar el Sistema de Análisis al caso de estudio de la Red Temática de la Transparencia en Chile. Como conclusiones, presentamos y caracterizamos los componentes de una red temática en la web: redes de hiperenlaces, actores y temas; analizamos los resultados de la evaluación de las herramientas que consideramos más adecuadas para el de análisis de redes temáticas en la web: IssueCrawler, SocSciBot, Webometric Analyst y VOSON; construimos un sistema de análisis dividido en tres fases: análisis de redes de hiperenlaces, análisis de actores y análisis de temas; y finalmente discutimos los resultados del análisis de la Red Temática de la Transparencia en Chile y los posibles desarrollos futuros de la investigación.
The object of study of this research are Issue Networks, namely the Issue networks that are active within the domain of the Internet and their potential to extract objective data from the opinion flows that are generated in regard to an issue of discussion or social controversy. This research is founded on four objectives: the characterization of the components of issue networks; the identification, description and evaluation of existing tools for the analysis of issue networks on the Internet; creation of an Analysis System of Issue Networks on the Internet; and, lastly, the application of the Analysis System to the case study of the Issue Network for Transparency in Chile. In conclusion, we introduce the characteristics of the components of an Issue Network on the Internet: hyperlinks, actors and issue networks; we present the results of the evaluation of the tools that we consider most suitable for the analysis of Issue Networks on the Internet: IssueCrawler, SocSciBot, Webometric Analyst and VOSON; we build an analysis system divided into three parts: network analysis of hyperlinks, stakeholder analysis and issue analysis; and finally we discuss the results of the analysis of the Issue Network for Transparency in Chile and the possible future developments of the investigation.
APA, Harvard, Vancouver, ISO, and other styles
30

Josefsson, Ågren Fredrik, and Oscar Järpehult. "Characterizing the Third-Party Authentication Landscape : A Longitudinal Study of how Identity Providers are Used in Modern Websites." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-178035.

Full text
Abstract:
Third-party authentication services are becoming more common since it eases the login procedure by not forcing users to create a new login for every website thatuses authentication. Even though it simplifies the login procedure the users still have to be conscious about what data is being shared between the identity provider (IDP) and the relying party (RP). This thesis presents a tool for collecting data about third-party authentication that outperforms previously made tools with regards to accuracy, precision and recall. The developed tool was used to collect information about third-party authentication on a set of websites. The collected data revealed that third-party login services offered by Facebook and Google are most common and that Twitters login service is significantly less common. Twitter's login service shares the most data about the users to the RPs and often gives the RPs permissions to perform write actions on the users Twitter account.  In addition to our large-scale automatic data collection, three manual data collections were performed and compared to previously made manual data collections from a nine-year period. The longitudinal comparison showed that over the nine-year period the login services offered by Facebook and Google have been dominant.It is clear that less information about the users are being shared today compared to earlier years for Apple, Facebook and Google. The Twitter login service is the only IDP that have not changed their permission policies. This could be the reason why the usage of the Twitter login service on websites have decreased.  The results presented in this thesis helps provide a better understanding of what personal information is exchanged by IDPs which can guide users to make well educated decisions on the web.
APA, Harvard, Vancouver, ISO, and other styles
31

Rude, Howard Nathan. "Intelligent Caching to Mitigate the Impact of Web Robots on Web Servers." Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1482416834896541.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Yan, Hui. "Data analytics and crawl from hidden web databases." Thesis, University of Macau, 2015. http://umaclib3.umac.mo/record=b3335862.

Full text
APA, Harvard, Vancouver, ISO, and other styles
33

Mir, Taheri Seyed Mohammad. "Distributed Crawling of Rich Internet Applications." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32089.

Full text
Abstract:
Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Different solutions have been proposed to reduce the time and cost of crawling. New generation of web applications, known as Rich Internet Applications (RIAs), pose major challenges to the web crawlers. RIAs shift a portion of the computation to the client side. Shifting a portion of the application to the client browser influences the web crawler in two ways: First, the one-to-one correlation between the URL and the state of the application, that exists in traditional web applications, is broken. Second, reaching a state of the application is no longer a simple operation of navigating to the target URL, but often means navigating to a seed URL and executing a chain of events from it. Due to these challenges, crawling a RIA can take a prohibitively long time. This thesis studies applying distributed computing and parallel processing principles to the field of RIA crawling to reduce the time. We propose different algorithms to concurrently crawl a RIA over several nodes. The proposed algorithms are used as a building block to construct a distributed crawler of RIAs. The different algorithms proposed represent different trade-offs between communication and computation. This thesis explores the effect of making different trade-offs and their effect on the time it takes to crawl RIAs. We study the cost of running a distributed RIA crawl with client-server architecture and compare it with a peer-to-peer architecture. We further study distribution of different crawling strategies, namely: Breath-First search, Depth-First search, Greedy algorithm, and Probabilistic algorithm. To measure the effect of different design decisions in practice, a prototype of each algorithm is implemented. The implemented prototypes are used to obtain empirical performance measurements and to refine the algorithms. The ultimate refined algorithm is used for experimentation with a wide range of applications under different circumstances. This thesis finally includes two theoretical studies of load balancing algorithms and distributed component-based crawling and sets the stage for future studies.
APA, Harvard, Vancouver, ISO, and other styles
34

Ferreira, Juliana Sabino. "Uma abordagem para captura automatizada de dados abertos governamentais." Universidade Federal de São Carlos, 2017. https://repositorio.ufscar.br/handle/ufscar/9246.

Full text
Abstract:
Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-06T16:01:21Z No. of bitstreams: 1 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5)
Rejected by Milena Rubi ( ri.bso@ufscar.br), reason: Bom dia Juliana! Além da dissertação, você deve submeter também a carta comprovante devidamente preenchida e assinada pelo orientador. O modelo da carta encontra-se na página inicial do site do Repositório Institucional. Att., Milena P. Rubi Bibliotecária CRB8-6635 Biblioteca Campus Sorocaba on 2018-01-08T11:07:30Z (GMT)
Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-09T00:48:08Z No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:15:53Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:16:03Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Made available in DSpace on 2018-01-09T11:16:12Z (GMT). No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) Previous issue date: 2017-11-07
Não recebi financiamento
Currently open government data run an important job on regards to public transparency, besides being obligated by law. But most of this data are stored in non-standard ways, isolated and independent, making it very hard for its use by third party systems providers. This work proposes the creation of an approach for capturing this open government data in an automated way, allowing its use in various applications. For that a Web Crawler was built for the capture and storing of this open government data, as well as an API for making this data available in JSON format, that way developers can easily use this data on their application. We also performed an evaluation of the API for developers with different levels of experience.
Atualmente os dados abertos governamentais exercem um papel fundamental na transparência pública na gestão dos governos, além de ser uma obrigação legal. Porém grande parte desses dados são publicados em formatos diversos, isolados e independentes, dificultado seu reaproveitamento por sistemas de terceiros que poderiam reusar informações disponibilizadas em tais portais. Este trabalho propõe a criação de uma abordagem para captura de dados abertos governamentais de forma automatizada, permitindo sua reutilização em outras aplicações. Para isso foi construído um Web Crawler para captura e armazenamento de Dados Abertos Governamentais (DAG) e a API DAG Prefeituras para disponibilizar esses dados no formato JSON para que outros desenvolvedores possam utilizar esses dados em suas aplicações. Também foi realizada uma avaliação do uso da API para desenvolvedores com diferentes níveis de experiência
APA, Harvard, Vancouver, ISO, and other styles
35

Romandini, Nicolò. "Evaluation and implementation of reinforcement learning and pattern recognition algorithms for task automation on web interfaces." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text
Abstract:
Automated task execution in a web context is a major challenge today. One of the main fields in which this is needed is undoubtedly that of Information Security, where it is becoming increasingly necessary to find techniques that allow security tests to be carried out without human intervention. Not only to relieve programmers from performing repetitive tasks, but above all to be able to perform many more tests in the same amount of time. Although techniques already exist to automate the execution of actions on web interfaces, these solutions are often limited to running in the environment for which they were designed. It is, indeed, impossible for them to execute the learnt behaviour in different and unseen environments. The aim of this thesis project is to analyse different Machine Learning techniques in order to find an optimal solution to this problem. In other words, to obtain an agent capable of executing a task in all the environments in which it operates. The approaches analysed and implemented can be traced back to two areas of Machine Learning, Reinforcement Learning and Pattern Recognition. Each approach was tested using real web applications in order to measure their abilities in a context as close to reality as possible. Although Reinforcement Learning approaches were found to be the most automated, they failed to achieve satisfactory results. On the contrary, the Pattern Recognition approach was found to be the most capable of executing tasks, even complex ones, in different and unseen environments, requiring, however, a lot of preliminary work.
APA, Harvard, Vancouver, ISO, and other styles
36

Kemmer, Julian. "Der Sandmann : von E.T.A. Hoffmann bis Freddy Krüger." Thesis, Högskolan Dalarna, Tyska, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-31682.

Full text
Abstract:
In dieser Arbei wird, auf Grundlage der Figur des Sandmanns von E.T.A. Hoffmann, nach Gruselfiguren in der Moderne gesucht, die ebenfalls als Sandmänner ansehen kann. Die Figuren, die mit dem Original aus der schwarzen Romantik verglichen werden, kommen dabei aus unterschiedlichen Medien und sind nicht nur aus Büchern, sondern auch aus der Musik, der Graphic Novel, sowie dem Film. Es wird auf intermediale Weise untersucht, wie der jeweils gezeigte Sandmann aussieht, welche Eigenschaften er aufweist und wie er sich verhält. Die ausgewählten Sandmänner kommen teilweise aus dem europäischen und teilweise aus dem amerikanischen Raum.
APA, Harvard, Vancouver, ISO, and other styles
37

Tsai, Jing-Ru, and 蔡京儒. "Combining BDI Agent with Web Crawler." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/81206412170578655278.

Full text
Abstract:
碩士
國立中央大學
資訊工程學系
104
How to deal with the huge information and pick-up the useful information for a user in this big data era? We take an approach to combining web crawler with multi-agent systems, which is regarded as a suitable way to develop an intelligent software system.   This research uses Java agent development framework (JADE) as the underlying platform, upon which the belief-desire-intention (BDI) model is added to empower agents with thinking ability. Further, we used web crawler to crawl web page information that a particular agent needs. Combining BDI agent with web crawler thus forms our model.   The advantage of this approach is that the web page search strategy of this BDI agent+Crawler can adjust dynamically according to the change of the environment. This makes an agent browse web page information in the way just like a real person.
APA, Harvard, Vancouver, ISO, and other styles
38

Lai, Jui-fu, and 賴睿甫. "NUBot, a Client Based Web Crawler." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/41028393316471275818.

Full text
Abstract:
碩士
國立中正大學
資訊工程所
97
As the internet grows, huge number of web pages are created every day, and it becomes a tough challenge to build a search engine that is able to scale up with the growth of the web space. In this thesis, we propose a new architecture of data crawler for the search engine to crawl the vast space of web pages in a more efficient way. Under our new architecture, we implemented a prototype of data crawler that can increase the crawling domain and perform the crawling in a more effective and efficient way. We archive this goal by distributing the crawlers around the world to crawl web pages close to them to minimize the access overhead between the crawler and the crawled pages. The crawler will compress the data and then transfer it to the master servers to reduce the bandwidth overhead. The master will use a SeenDB to filter out those already crawled urls, and use a DN2IP (DomainName to IP) process to get the IP for each url. The new urls will then be placed in a url pool implemented with multiple queue, waiting for their turn to be crawled.
APA, Harvard, Vancouver, ISO, and other styles
39

Yang, Jian-Xin, and 楊健鑫. "Intelligent Customer Service based on Web Crawler." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/f8dv9d.

Full text
Abstract:
碩士
大同大學
資訊工程學系(所)
107
In the past one or two years, smart customer service has gradually become more and more popular. The company's line number, banks, government agencies, etc. have all exhibited their smart customer service systems. Coupled with the increasing popularity of IPv6, 5G is expected to be transferred in the next decade. By then, the Internet will have a huge amount of information circulation and become the world's largest encyclopedia. If you can use this encyclopedia as the knowledge base of customer service, the ability to make customer service smarter. Therefore, another focus of this research is to reorganize websites in various fields into dialogue tree by web crawlers as knowledge base in various fields. Most of the current smart customer service is oriented to the target knowledge field, such as after-sales service, financial question and answer, disease inquiry, etc., so special training is needed for different target areas. This study applies Seq2seq to smart customer service, and has added well-received technologies such as Attention Model and two-way LSTM. It aims to build a general-purpose smart customer service that the computer can directly learn the grammar of the language instead of specific one. The knowledge base of the field is studied so that all sentences using the language can be understood by the computer regardless of the domain, making it easier for smart customer service to enter people's lives. This study used the Seq2Seq model to improve the problem that the RNN / LSTM input and output must be kept the same length, and the Attention Model is applied to the Seq2Seq. By adding and improving the Context Vector, the problem of the accuracy of the sentence increases as the length is solved. Finally, Bi-directional LSTM was added to improve the accuracy of sentences in a polysemy. The experiments in this study were divided into four groups. The purpose of the first group of experiments was to prove that the neural network model (Seq2Seq + Attention Model + Bi-directional LSTM) used in this study is superior to other neural networks in terms of natural language processing. The model (LSTM, Seq2Seq, Seq2Seq + Attention Model) were trained using 5,000 questions from the open source Chinese corpus [25], and tested with 1000 questions different from the training materials. The output is determined to be correct if the intent and entities meet the expectations. The accuracy rates are 63.4%, 69.2%, 76.1, 82.1 for LSTM, Seq2Seq, Seq2Seq + Attention Model, and Seq2Seq + Attention Model. + Bi-directional LSTM. The second set of experiment compared RasaNLU and the knowledge of the target field in this study, in order to verify whether the accuracy of the neural network model in the target field is better than the traditional statistical-based machine learning method. 5000 data of the Taiwan water company is used as training materials, and 1000 water-related questions were tested. The correct rates of RasaNLU and this study were 86.4% and 87.1%, respectively. The third group of experiments were respectively for RasaNLU and this study. Question and answer in the target field, the purpose is to verify whether the universality of the neural network model is better than the traditional statistical-based machine learning method. 5,000 popular websites including mainland microblog, post bar, watercress and Taiwan The PTT gossip version and other 8 open source chat database as the training materials, and extract 1000 questions from the same corpus different from the training materials for testing, RasaNLU and the correct study The rates were 46.3% and 83.2%, respectively; the last set of experiments compared this study with the small love students, Google Assistant, Siri, and Samsung bixby. As can be seen from the first set of experiments above, the combination of Seq2Seq with Attention Model and Bi-directional LSTM does have better accuracy than other neural networks. In the second and third groups of experiments, this study is similar to the traditional machine learning accuracy in the target knowledge field, but the accuracy rate in the general field is greatly improved, and in the target knowledge field or the general field, this study The subsequent scalability is better. The fourth group of experiments compared with the voice assistants on the market can be seen that although this study cannot provide an accurate answer to the fuzzy answer, it has better accuracy in the professional field.
APA, Harvard, Vancouver, ISO, and other styles
40

HUANG, WEI-LIN, and 黃威霖. "Constructing Data Visualization Query Systems with Web Crawler." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/wpbt86.

Full text
Abstract:
碩士
國立高雄海洋科技大學
海事資訊科技研究所
106
Many stock investments were proposed with only one or few stock indices. To counting this problem and to constructing a simple investing strategy, this study used news, fundamental, technical and chip analysis to constructing a Standard Operating Procedures (SOP). The data were collected by using R web crawler from Taipei Exchange and Taiwan Stock Exchange. The proposed method was used technical and chip analysis to filter investing stocks. Subsequently, this study record the results to compare the return of TAIEX with that of the proposed method. The result shown the return of the proposed method is better than that of TAIEX approximately 500%. Finally, this study use Power BI to visualize the research result in order to construct a simple investment system. Investors can use this system to get investing recommendation on next trading day.
APA, Harvard, Vancouver, ISO, and other styles
41

Jao, Jui-Chien, and 饒瑞謙. "VulCrawl: Adaptive Entry Point Crawler for Web Vulnerability Scanner." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/vt26t7.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Chang, Hao, and 張皓. "GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/3vtz8y.

Full text
Abstract:
碩士
國立中央大學
土木工程學系
104
With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb.
APA, Harvard, Vancouver, ISO, and other styles
43

Santos, Nuno Gonçalo Mateus. "Dark Web Module Data Collection." Master's thesis, 2018. http://hdl.handle.net/10316/83543.

Full text
Abstract:
Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia
Este documento é o artefacto resultante de um estágio proposto pela Dognaedis, Lda à Universidade de Coimbra. A Dognaedis é uma empresa de ciber segurança, que utiliza informação recolhida pelas ferramentas ao ser dispor para proteger os seus clientes. Havia, no entanto, um vazio que precisava de ser preenchido nas fontes de informação que eram monitorizadas: a dark web. Para preencher esse vazio este estágio foi criado. O seu objectivo é especificar, e implementar, uma solução para um módulo de inteligência da dark web para um dos produtos da empresa, o Portolan. O intuito deste modulo é fazer crawl a websites ”escondidos” em redes de anonimato e deles extrair inteligência, de forma a estender as fontes de informação da plataforma. Neste documento o leitor irá encontrar informação relativa ao trabalho de pesquisa realizado, que compreende o estado da arte a nı́vel de web crawlers e extractores de informação, que permitiu a identificação de técnicas e tecnologias úteis neste âmbito. A especificação da solução para o problema é também apresentada, incluindo a análise de requisitos e o desenho arquitetural. Isto inclui a exposição das funcionalidades propostas, da arquitetura final e das razões por trás da decisões tomadas. Ao leitor será também apresentada uma descrição da metodologia de desenvolvimento que foi seguida e uma descrição da implementação em si, expondo as funcionalidades do módulo e como estas foram atingidas. Finalizando, existe ainda a explicação do processo de validação, que foi realizado de forma a garantir que o produto final estava de acordo com a especificação.
This document is the resulting artefact of an intership proposed by Dognaedis, Lda to the University of Coimbra. Dognaedis is a cyber security company, that uses the information gathered by the tools at its disposal to be able to protect its clients. There was, however, a void that needed to be filled from the sources of information that were monitored: the dark web. In order to fill that void, this intership was created. Its goal is to specify, and implement, a solution for a dark web intelligence module for one of the company’s products, Portolan. The goal of this module is to crawl websites ”hidden” in anonymity networks and extract intelligence from them, in order to extend the sources of information of the platform. As a result, in this document the reader will find information that refers to the research work, that comprises the state of the art regarding web crawlers and information extractors, which allowed the identification of useful techniques and technologies. The specification of the solution for the problem is also presented, including requirement analysis and architectural design. This includes the exposition of the functionalities proposed, the final architecture and the reasons behind the decisions that were made. The reader will also be presented with a description of the development methodology that was followed and a description of the implementation itself, exposing the functionalities of the module and how they were achieved. Finally, there is also the explanation of the validation process, that was conducted to ensure that the final product matched the specification.
APA, Harvard, Vancouver, ISO, and other styles
44

Chen, Feng-Kai, and 陳楓凱. "The Design, Development, And Validation Of A Supervised Adaptable Web Crawler." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/47316108869710822073.

Full text
Abstract:
碩士
國立臺北大學
資訊管理研究所
100
The web crawling function is an essential component of any automatic information extraction system, which needs to trawl web sites for up-to-date information. Researches have tried different way to develop a flexible and adaptable web crawler that is capable of parsing web pages following a set of pre-defined web syntax rules, and these rules may be learned and derived from the target web sites. A universal solution is elusive since the markup language used by web sites is often loose and syntactically incomplete. This research designed, developed, and validated a supervised adaptable web crawler, which is capable of derive extraction rules from a web page segment selected by the user. The derived rules are used by the web crawler to extract the desired information from the website. This supervised rule learning and application scenario makes the information component easier to maintain when the syntax of web pages from a target web site changed. A working web page syntax rule extracting and crawling system written in Java was implemented and tested against two popular citation data web sites. The syntax rule is extracted by highlighting a portion of web pages that the user is interested in. The XML-based web syntax rules are generated by the system. These rules are then used by the crawler to extract the desired citation information from the target web sites. In case of the syntax of the web pages in the target web site changed, the system is capable of detecting the change and re-generates most of the correct rules for the crawler to use.
APA, Harvard, Vancouver, ISO, and other styles
45

Lee, Yuan-Chih, and 李元智. "Apply Web Crawler Technology to the Rainfall Prediction of Meteorological Station." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/f257k9.

Full text
Abstract:
碩士
華梵大學
資訊管理學系碩士班
105
In recent years, extreme rainfall events have occurred frequently that rainfall characteristics and intensity have changed in Taiwan such as area precipitation enhancement, rainfall duration growth, and accumulative precipitation increasing. It needs to pay attention to the part of heavy rainfall for a long time. With the rapid development of information technology and internet, more and more government agencies are releasing government-possessed raw data online in non-proprietary formats for the publics to free access, and then it becomes more convenient to get the information. With the data analysis technology and data mining technology has been improved, big data begin to dramatically develop in many fields.   This thesis uses the big data analysis platform Spark and R language to develop the analysis of rainfall modeling by Decision Tree and Random Forest. To get meteorological information in internet, it refers to the data of Pinglin station of the Central Weather Bureau of the Ministry of Transportation and Communications in Taiwan. For the station of the rainfall, the meteorological factors such as temperature and humidity are used in the Web Crawler collection function of R language. Thereafter, the dataset was preprocessed by data mining technology. It establishes the relevant rules to investigate the results of data analysis from the observed rainfall and other relevant information. To provide the prediction and decision-making for regional rainfall, it shows the prosperity of climate information application service.   For meteorological data, pre-processing, application analysis and computation of Random Forest algorithm are used to R Studio and Spark platform. The root mean square error of training data and test data are 7.585893 and 13.07361 for the analysis results of Random Forest on R Studio. For running Random Forest on Spark platform, the root mean square error of training data and test data are 7.843388 and 11.35844, respectively. From simulation results, R Studio has better performance in training data and Spark platform has better performance in test data.
APA, Harvard, Vancouver, ISO, and other styles
46

XU, BO-EN, and 徐柏恩. "Automatic Broadcast News System by Web Crawler Based on Raspberry Pi3." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/867n46.

Full text
Abstract:
碩士
明志科技大學
電子工程系碩士班
106
Reading news is a daily habit for many people. With the arrival of the Internet, online news platforms have gradually replaced newspapers to become the media for people to read news. Most jobs today rely on computers. The eye pressure increases as people are getting busier and busier at work. In order not to cause more eye pressure because of reading a lot of words on the news, an automated news reader system was developed with this study. This system can retrieve news from online news platforms, read the news in speech, and furthermore help a user select the news with the content that interests him or her so that the user does not need to spend efforts on selecting news. The development platform used in this study is Raspberry Pi3. The web crawler automated news reader system was developed with the python programming language, which can retrieve news provided by online news platforms and categorize the news based on the SQLite database. The system records the usage and analyzes the results every time so that the news content provided next time can better meet the needs of the users. The system also designed a GUI (Graphical User Interface) for users to use the system. A user can mount the host in a wall, refrigerator, headboard, or so on to match his or her daily habits. This study is aimed at helping people gain access to the news and information that people are interested in, reduce eye strain, and improve the quality of daily life.
APA, Harvard, Vancouver, ISO, and other styles
47

Chang, Yi Min, and 張毅民. "Design and Implementation of a Web Crawler Based on Service Oriented Architecture." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/83890012599656974069.

Full text
Abstract:
碩士
長庚大學
資訊工程學系
100
Now we are familiar with the network information from the World Wide Web concept is proposed to the present, its contents at an alarming rate, rapid growth, the mode of business, people's reading habits, even habits are gradually be affected by this large and rich information platform, through the help of search engine leaving the flow of information is changing dramatically, due to its dynamic characteristics, through the help of search engines to make this information can be effective for people to use . Modern search engines are based on the Crawler based search engines, a search engine is good or bad is to look at the data collection (Data Collection) is good or bad to make a decision, Web Crawler System is responsible for this work, so it can be said A Web Crawler System good or bad decision a search engine is good or bad is not excessive. Web Crawler System of its architecture can be divided into two, a Centralized Distributed architecture and the other is Non-Centralized Distributed architecture, modern times the Crawler based search engines are mostly based on the design architecture of the first, and such a design framework most of the work (such as DNS Lookup, URL Filter, etc.) by the Control Center (Control Center) is responsible for, when the download is too large number of pages (Web Page) will result in the Control Center encounter bottlenecks such as URLs Overlapping , making the Web Crawler other machine in the System Control Center encounter a problem, but has not been assigned duties, resulting in the machine idle status and a waste of resources, so I designed a service-oriented architecture (SOA)-based Web Crawler System The aim is to work and large-scale Web Crawler System will simplify Control Center involve features into several different service modules, making the sub-server (Slave Server) system idle lower risk of the use of resources to be effective. In Chapter III of this paper there are more detailed description. Part in the fourth chapter, will show you I really made to the Web Crawler Based on Service Oriented Architecture of its work performance and statistics I have designed the system one day be able to retrieve the number of Web Pages, In addition, we will test my URL Filter Module is designed to filter in a quick time out duplicate URLs and URL Filter Module 3.3, described in detail.
APA, Harvard, Vancouver, ISO, and other styles
48

Hsu, Sheng-Ming, and 許陞銘. "Utilizing Web Crawler and Artificial Intelligence to Build Automatic Web-based System for Predicting Household Electricity Consumption." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/92vb74.

Full text
Abstract:
碩士
國立臺灣科技大學
營建工程系
107
The development and use of electrical energy give people a convenient and comfortable life. However, people consume a large amount of unnecessary energy to increase comfort, creating an energy crisis and global warming, and damaging some ecological circles. The world is actively promoting energy saving and carbon reduction to alleviate this problem. Residential electricity comprises about 20% of Taiwan's total electricity consumption, and has greater electric elasticity than electricity for industrial and business uses, representing high energy-saving potential. This study aims to assist government to formulate the direction of energy conservation policies. Additionally, the Taiwan power company and green energy industry, which are both operated by government, need to utilize the smart grid to realize the state of electricity consumption, in order to facilitate distribution. The public can use this platform to supervise the implementation of energy conservation plans. Accordingly, this investigation establishes an automated network system platform, providing information on residential electricity consumption in each county and city. After literature review, this collected data from 20 counties and cities each month over a period of 72 months. The data included 17 influence factors with residential electricity consumption during a month as a dependent variable. Data mining technology was employed to forecast future residential electricity demand. The forecasting systems adopted in this work were (1) linear regression, (2) classification and regression tree, (3) support vector machine/regression, (4) artificial neural networks, (5) Voting method and (6) Bagging method. Bagging-ANNs achieved the best performance among the tested models. A natural-inspired optimization method, namely PSO, was then applied to enhance the accuracy as well as stability of Bagging-ANNs, to develop a hybrid ensemble model, PSO-Bagging-ANNs. The correlation coefficient between prediction values and actual values was 0.99; the mean absolute error was 2,059,993kWh; the root mean square error was 5,311,887 kWh, and the mean absolute percentage error was 1.17%. The average of monthly electricity consumption in Taiwan is about 200,000,000kWh. The MAE is about 20,000kWh. The accuracy rate of the model is up to 1%. Evaluation indicators show that the proposed model is accurate, and provides effective information for reference. An automatic web-based system based on this model and combined with a web crawler and scheduled to run automatically to provide information on monthly residential electricity consumption in each county and city.
APA, Harvard, Vancouver, ISO, and other styles
49

Lin, Meng-chun, and 林盟鈞. "Information Retrieval System Based on Topic Web Crawler to Improve Retrieval Performance in CLIR." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/84369464259544160664.

Full text
Abstract:
碩士
朝陽科技大學
資訊工程系碩士班
99
The paper describes how to build an efficient topic web crawler and use it to improve the performance of cross language information retrieval (CLIR). A topic web crawler can extract web pages related to a certain topic. A topic web crawler is built by combining a standard crawler and a relevance classifier. Given some seed URLs, the crawler gets web pages from the World Wide Web, and the relevance classifier judges which pages are relevant. The URLs in the relevant pages are treated as seeds for further web page retrieval. In this paper, we will adopt topic web crawler as a way of query expansion for CLIR. The topic web crawler extracts candidate query terms form web page. We conduct experiments to compare the method to previous works, i.e. extract candidate query terms from Wikipedia to assist CLIR. We also combine these resources to do query expansion, i.e. combining the topic web crawler, Wikipedia, and Okapi BM25 algorithm, to improve our information retrieval system performance. We test our system on the NTCIR-8 IR4QA data set to evaluate our CLIR system. The experiment result shows that query expansion from combining resources gives better performance than query expansion from single resource.
APA, Harvard, Vancouver, ISO, and other styles
50

Lee, Yi-Ting, and 李懿庭. "Design and Implementation of Foreclosure Information System based on Web Crawler and Neural Network." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/2qja3g.

Full text
Abstract:
碩士
國立臺北科技大學
電子工程系
106
With the advent of the era of big data, more and more examples of AI-related technologies are used in life. Neural networks can be combined with applications at various level, such as semantic analysis and image recognition, etc. In this thesis, we apply neural networks to the prediction of foreclosure house prices, which makes it easier to analyze the price. The proposed platform consists of four parts: data collection, neural networks, back end and front end. The data are collected by controlling Chrome Browser with Selenium WebDriver. For the neural networks part, Keras Library is used to set up and train the network. With regard to the back end technology, ASP.NET WebAPI framework is used to deal with the connection between the front end and the database. Finally, ReactJs is applied to develop the front end technology. The proposed foreclosure houses information platform collects data on foreclosure houses and auctions based on web crawler. After rearranging the data, the system then inputs the collected information to the Neural network for training and price prediction. The information includes the cities and the districts where the houses locate, the size of the houses, land rights, dates of auction, whether final walk-through has been undergone, the maximum and the minimum of the house prices, and types of land weight. The final results are presented on the website and serve as an index in analyzing the house prices.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography