Dissertations / Theses on the topic 'Web Crawler'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 50 dissertations / theses for your research on the topic 'Web Crawler.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
PAES, Vinicius de Carvalho. "Crawler de Faces na Web." reponame:Repositório Institucional da UNIFEI, 2012. http://repositorio.unifei.edu.br/xmlui/handle/123456789/1099.
Full textMade available in DSpace on 2018-02-26T15:29:14Z (GMT). No. of bitstreams: 1 dissertacao_paes_2012.pdf: 2653704 bytes, checksum: ad170caa9b81a8332ad66d442fdf9289 (MD5) Previous issue date: 2012-11
O foco primordial neste projeto é definir a estrutura básica necessária para o desenvolvimento e aplicação prática de uma máquina de busca de faces, afim de garantir uma busca com parâmetros qualitativos apropriados.
Nguyen, Qui V. "Enhancing a Web Crawler with Arabic Search." Thesis, Monterey, California: Naval Postgraduate School, 2012.
Find full textAli, Halil, and hali@cs rmit edu au. "Effective web crawlers." RMIT University. CS&IT, 2008. http://adt.lib.rmit.edu.au/adt/public/adt-VIT20081127.164414.
Full textKayisoglu, Altug. "Lokman: A Medical Ontology Based Topical Web Crawler." Master's thesis, METU, 2005. http://etd.lib.metu.edu.tr/upload/2/12606468/index.pdf.
Full textsearch-on-the-net&rdquo
problem. An ontology based web information retrieval system requires a topical web crawler to construct a high quality document collection. This thesis focuses on implementing a topical web crawler with medical domain ontology in order to find out the advantages of ontological information in web crawling. Crawler is implemented with Best-First search algorithm. Design of the crawler is optimized to UMLS ontology. Crawler is tested with Harvest Rate and Target Recall Metrics and compared to a non-ontology based Best-First Crawler. Performed test results proved that ontology use in crawler URL selection algorithm improved the crawler performance by 76%.
Pandya, Milan. "A Domain Based Approach to Crawl the Hidden Web." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_theses/32.
Full textKoron, Ronald Dean. "Developing a Semantic Web Crawler to Locate OWL Documents." Wright State University / OhioLINK, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=wright1347937844.
Full textStivala, Giada Martina. "Perceptual Web Crawlers." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019.
Find full textChoudhary, Suryakant. "M-crawler: Crawling Rich Internet Applications Using Menu Meta-model." Thèse, Université d'Ottawa / University of Ottawa, 2012. http://hdl.handle.net/10393/23118.
Full textLee, Hsin-Tsang. "IRLbot: design and performance analysis of a large-scale web crawler." Texas A&M University, 2008. http://hdl.handle.net/1969.1/85914.
Full textKarki, Rabin. "Fresh Analysis of Streaming Media Stored on the Web." Digital WPI, 2011. https://digitalcommons.wpi.edu/etd-theses/81.
Full textEnglund, Malin, Christian Gullberg, and Jesper Wiklund. "A web crawler to effectively find web shops built with a specific e-commerce plug-in." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-325788.
Full textNowadays online shopping has become very common. Being able to buy things online and get them sent to the door is something many people find convenient and appealing. With the demand comes the market and web shops have therefore become a popular place for companies to sell their items. Companies that want to sell their products to web shops can have a hard time finding potential customers in an efficient way. This project is an attempt to solve this problem, finding a large quantity of web shops with a specific e-commerce plug-in, in this case WooCommerce, in a short amount of time. The solution was to create a web crawler with the purpose of searching the Internet locating web shops. The result of the search is stored in a database where the user can retrieve information, such as revenue and company name, about the web shops found. It was a success in the sense of efficiency but with room for improvement considering robustness and accuracy.
Anttila, Pontus. "Mot effektiv identifiering och insamling avbrutna länkar med hjälp av en spindel." Thesis, Karlstads universitet, Institutionen för matematik och datavetenskap (from 2013), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kau:diva-67842.
Full textToday, the customer has no automated method for finding and collecting broken links on their website. This is done manually or not at all. This project has resulted in a practical product, that can be applied to the customer’s website. The aim of the product is to ease the work when collecting and maintaining broken links on the website. This will be achieved by gathering all broken links effectively, and place them in a separate list that at will can be exported by an administrator who will then fix these broken links. The quality of the customer’s website will be higher, as all broken links will be easier to find and remove. This will ultimately give visitors a better experience.
Desai, Lovekeshkumar. "A Distributed Approach to Crawl Domain Specific Hidden Web." Digital Archive @ GSU, 2007. http://digitalarchive.gsu.edu/cs_theses/47.
Full textZemlin, Toralf. "Entwurf eines konfigurierbaren Web-Crawler-Frameworks zur weiteren Verwendung fur Single-Hosted Media Retrieval." Master's thesis, Universitätsbibliothek Chemnitz, 2008. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-200801338.
Full textZemlin, Toralf Eibl Maximilian. "Entwurf eines konfigurierbaren Web-Crawler-Frameworks zur weiteren Verwendung fur Single-Hosted Media Retrieval." [S.l. : s.n.], 2008.
Find full textMoravec, Petr. "Monitoring internetu a jeho přínosy pro podnikání nástroji firmy SAS Institute." Master's thesis, Vysoká škola ekonomická v Praze, 2011. http://www.nusl.cz/ntk/nusl-165263.
Full textČinčera, Jaroslav. "Pokročilý robot na procházení webu." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2010. http://www.nusl.cz/ntk/nusl-237201.
Full textLloyd, Oskar, and Christoffer Nilsson. "How to Build a Web Scraper for Social Media." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20594.
Full textRODRIGUES, Thiago Gomes. "ARAPONGA: Uma Ferramenta de Apoio a Recuperação de Informação na Web voltado a Segurança de Redes e Sistemas." Universidade Federal de Pernambuco, 2012. https://repositorio.ufpe.br/handle/123456789/11367.
Full textMade available in DSpace on 2015-03-09T12:40:54Z (GMT). No. of bitstreams: 2 dissertacao_tgr_final_digital.pdf: 2171210 bytes, checksum: f12a3f4a3a1d0cb741406b75b56f43b7 (MD5) license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) Previous issue date: 2012-03-07
A área de segurança de redes de computadores e sistemas apresenta-se como uma das maiores preocupações atualmente. À medida que o número de usuários de computadores aumenta, cresce no número de incidentes de segurança. A falta de comportamentos voltados à segurança, no que se refere a uso de hardware, e-mails ou configuração de programas são fatores facilitam a implantação de códigos maliciosos. O impacto da exploração de vulnerabilidades ou de falhas de softwares tem aumentado gradualmente e causado enormes prejuízos ao redor do mundo. A divulgação destas vulnerabilidades e boas práticas de segurança têm sido uma das soluções para este problema pois permitem que administradores de redes e sistemas consigam adquirir informações relevantes para mitigar o impacto de uma atividade maliciosa. Ao notar que divulgar informações de segurança é uma das saídas para combater as atividades maliciosas e também para diminuir o impacto de uma exploração bem sucedida, várias organizações resolveram publicar este tipo de conteúdo. Estas bases encontram-se espalhadas em diferentes sítios Web, o que faz com que equipes de administração de redes e sistemas demore muito tempo buscando informações necessárias para a resolução dos seus problemas. Além disto, a exposição do conteúdo não é um fator preponderante para a solução dos problemas. Baseado neste cenário, este trabalho de mestrado se propõe a criar um sistema de apoio à recuperação de informação na Web voltado à segurança de redes e sistemas.
Yu, Liyang. "An Indexation and Discovery Architecture for Semantic Web Services and its Application in Bioinformatics." Digital Archive @ GSU, 2006. http://digitalarchive.gsu.edu/cs_theses/20.
Full textLat, Radek. "Nástroj pro automatické kategorizování webových stránek." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2014. http://www.nusl.cz/ntk/nusl-236054.
Full textMfenyana, Sinesihle Ignetious. "Implementation of a facebook crawler for opinion monitoring and trend analysis purposes: a case study of government service delivery in Dwesa." Thesis, University of Fort Hare, 2014. http://hdl.handle.net/10353/d1016067.
Full textToufik, Bennouas. "Modélisation de parcours du Web et calcul de communautés par émergence." Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2005. http://tel.archives-ouvertes.fr/tel-00137084.
Full textLa première partie fait une analyse des grand réseaux d'interactions et introduit un nouveau modèle de crawls du Web. Elle commence par définir les propriétés communes des réseaux d'interactions, puis donne quelques modèles graphes aléatoires générant des graphes semblables aux réseaux d'interactions. Pour finir, elle propose un nouveau modèle de crawls aléatoires.
La second partie propose deux modèles de calcul de communautés par émergence dans le graphe du Web. Après un rappel sur les mesures d'importances, PageRank et HITS est présenté le modèle gravitationnel dans lequel les nœuds d'un réseau sont mobile et interagissent entre eux grâce aux liens entre eux. Les communautés émergent rapidement au bout de quelques itérations. Le second modèle est une amélioration du premier, les nœuds du réseau sont dotés d'un objectif qui consiste à atteindre sa communautés.
Matulionis, Paulius. "Veiksmų ontologijos formavimas panaudojant internetinį tekstyną." Master's thesis, Lithuanian Academic Libraries Network (LABT), 2012. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2012~D_20120620_113255-46777.
Full textThe goal of the master thesis is to investigate the problem of the automated action ontology design using a corpus harvested from internet. A software package including tools for internet corpus harvesting, network service access, markup, ontology design and representation was developed and tested in the carried out experiment. A process management system was realized covering both front-end and the back-end system design levels. Detailed system and component models are presented, reflecting all the operations of the system. The thesis presents the results of experiments on building ontologies for several selected action verbs. Ontology building process is described, problems in recognizing separate elements of the action environment are analysed, suggestions on additional rules leading to more accurate results, are presented. Rules have been summarized and integrated into the designed software package.
McLearn, Greg. "Autonomous Cooperating Web Crawlers." Thesis, University of Waterloo, 2002. http://hdl.handle.net/10012/1080.
Full textSilva, Carlos Jesús Hernández da. "Geração automática de conteúdo audiovisual informativo para seniores." Master's thesis, Universidade de Aveiro, 2017. http://hdl.handle.net/10773/22543.
Full textA sociedade atual, a um nível global, encontra-se cada vez mais envelhecida e as suas necessidades e dificuldades, nomeadamente informativas, não são completamente supridas. Definir estratégias de envelhecimento ativo, a nível comunitário e individual, que possibilitem uma participação cívica pertinente e um contínuo crescimento das presentes e futuras gerações, deveria ser visto como um dos principais desafios a serem constantemente ultrapassados em paralelo com a evolução social, política, económica e tecnológica. A investigação aqui descrita enquadra-se no desenvolvimento de uma aplicação de televisão interativa como veículo de difusão de informação sobre serviços sociais de apoio a seniores e insere-se no projeto +TV4E. Pretende-se desenhar e desenvolver uma solução tecnológica capaz de criar, de forma automática, conteúdos que suprimam as necessidades informativas sobre, por exemplo, dados sociais, económicos ou meteorológicos, tendo em conta as especificidades do público-alvo, os seniores portugueses. A solução de iTV a construir terá como base uma aplicação que irá enriquecer a emissão televisiva com conteúdo informativo adequado a determinado perfil e preferências de cada set-top-box onde se disponibiliza, como a sua localização geográfica ou os seus comportamentos. Pretende-se que, durante uma emissão televisiva e após prévio aviso, seja disponibilizada informação sobre serviços sociais, sob a forma de conteúdo audiovisual informativo que será construído seguindo um determinado padrão e gerado de forma automática com conteúdo recolhido online em distintos serviços web.
Globally, modern societies are getting older and their needs and challenges, mainly informative, aren’t completely suppressed. One of the tops most important goals to achieve in parallel with social, political, economic and technological evolution is to define strategies for active aging, individually and in the community, which will make for a continuous and significant civic participation. The here described investigation fits in the development for a television application as a vehicle for the distribution of information about social services of support to seniors in accordance with the +TV4E project. It is intended to plan and develop a technological solution capable of automatically creating content to suppress the senior's informative needs about, for instance, social services, economics or meteorological data, having in mind their specifications. The iTV solution to build and be delivered in a set top box will have, as a basis, an application which will reinforce the television broadcast with adequate informative content for each set-top-box profile and preferences, as location, or behavioural analytics. The aim is to, during a television broadcast and upon prior notice, show audio-visual informative content, automatically generated from different online web services, about social and public services, composed following a certain structure.
Wan, Shengye. "Protecting Web Contents Against Persistent Crawlers." W&M ScholarWorks, 2016. https://scholarworks.wm.edu/etd/1477068008.
Full textWara, Ummul. "A Framework for Fashion Data Gathering, Hierarchical-Annotation and Analysis for Social Media and Online Shop : TOOLKIT FOR DETAILED STYLE ANNOTATIONS FOR ENHANCED FASHION RECOMMENDATION." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-234285.
Full textMed tanke på trenden inom forskning av rekommendationssystem, där allt fler rekommendationssystem blir hybrida och designade för flera domäner, så finns det ett behov att framställa en datamängd från sociala medier som innehåller detaljerad information om klädkategorier, klädattribut, samt användarinteraktioner. Nuvarande datasets med inriktning mot mode saknar antingen en hierarkisk kategoristruktur eller information om användarinteraktion från sociala nätverk. Detta projekt har syftet att ta fram två dataset, ett dataset som insamlats från fotodelningsplattformen Instagram, som innehåller foton, text och användarinteraktioner från fashionistas, samt ett dataset som insamlats från klädutbutdet som ges av onlinebutiken Zalando. Vi presenterar designen av en webbcrawler som är anpassad för att kunna hämta data från de nämnda domänerna och är optimiserad för mode och klädattribut. Vi presenterar även en effektiv webblösning som är designad och implementerad för att möjliggöra annotering av stora mängder data från Instagram med väldigt detaljerad information om kläder. Genom att vi inkluderar användarinteraktioner i applikationen så kan vår webblösning ge användaranpassad annotering av data. Webblösningen har utvärderats av utvecklarna samt genom AmazonTurk tjänsten. Den annoterade datan från olika användare demonstrerar användarvänligheten av webblösningen. Utöver insamling av data och utveckling av ett system för webb-baserad annotering av data så har datadistributionerna i två modedomäner, Instagram och Zalando, analyserats. Datadistributionerna analyserades utifrån klädkategorier och med syftet att ge datainsikter. Forskning inom detta område kan dra nytta av våra resultat och våra datasets. Specifikt så kan våra datasets användas i domäner som kräver information om detaljerad klädinformation och användarinteraktioner.
Castillejo, Sierra Miguel. "Redes temáticas en la web: estudio de caso de la red temática de la transparencia en Chile." Doctoral thesis, Universitat Pompeu Fabra, 2016. http://hdl.handle.net/10803/378362.
Full textThe object of study of this research are Issue Networks, namely the Issue networks that are active within the domain of the Internet and their potential to extract objective data from the opinion flows that are generated in regard to an issue of discussion or social controversy. This research is founded on four objectives: the characterization of the components of issue networks; the identification, description and evaluation of existing tools for the analysis of issue networks on the Internet; creation of an Analysis System of Issue Networks on the Internet; and, lastly, the application of the Analysis System to the case study of the Issue Network for Transparency in Chile. In conclusion, we introduce the characteristics of the components of an Issue Network on the Internet: hyperlinks, actors and issue networks; we present the results of the evaluation of the tools that we consider most suitable for the analysis of Issue Networks on the Internet: IssueCrawler, SocSciBot, Webometric Analyst and VOSON; we build an analysis system divided into three parts: network analysis of hyperlinks, stakeholder analysis and issue analysis; and finally we discuss the results of the analysis of the Issue Network for Transparency in Chile and the possible future developments of the investigation.
Josefsson, Ågren Fredrik, and Oscar Järpehult. "Characterizing the Third-Party Authentication Landscape : A Longitudinal Study of how Identity Providers are Used in Modern Websites." Thesis, Linköpings universitet, Institutionen för datavetenskap, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-178035.
Full textRude, Howard Nathan. "Intelligent Caching to Mitigate the Impact of Web Robots on Web Servers." Wright State University / OhioLINK, 2016. http://rave.ohiolink.edu/etdc/view?acc_num=wright1482416834896541.
Full textYan, Hui. "Data analytics and crawl from hidden web databases." Thesis, University of Macau, 2015. http://umaclib3.umac.mo/record=b3335862.
Full textMir, Taheri Seyed Mohammad. "Distributed Crawling of Rich Internet Applications." Thesis, Université d'Ottawa / University of Ottawa, 2015. http://hdl.handle.net/10393/32089.
Full textFerreira, Juliana Sabino. "Uma abordagem para captura automatizada de dados abertos governamentais." Universidade Federal de São Carlos, 2017. https://repositorio.ufscar.br/handle/ufscar/9246.
Full textRejected by Milena Rubi ( ri.bso@ufscar.br), reason: Bom dia Juliana! Além da dissertação, você deve submeter também a carta comprovante devidamente preenchida e assinada pelo orientador. O modelo da carta encontra-se na página inicial do site do Repositório Institucional. Att., Milena P. Rubi Bibliotecária CRB8-6635 Biblioteca Campus Sorocaba on 2018-01-08T11:07:30Z (GMT)
Submitted by Juliana Ferreira (julianasabfer@gmail.com) on 2018-01-09T00:48:08Z No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:15:53Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Approved for entry into archive by Milena Rubi ( ri.bso@ufscar.br) on 2018-01-09T11:16:03Z (GMT) No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5)
Made available in DSpace on 2018-01-09T11:16:12Z (GMT). No. of bitstreams: 2 Dissertação 2.1- avaliação da proposta+conclusao+final- REVISADA.pdf: 5906746 bytes, checksum: 0e38cac22651d3e8fc9d0919fc9e0159 (MD5) Termo de encaminhamento da versão definitiva.pdf: 214426 bytes, checksum: 41e6d886f9d6683d460f0de7d83c35d3 (MD5) Previous issue date: 2017-11-07
Não recebi financiamento
Currently open government data run an important job on regards to public transparency, besides being obligated by law. But most of this data are stored in non-standard ways, isolated and independent, making it very hard for its use by third party systems providers. This work proposes the creation of an approach for capturing this open government data in an automated way, allowing its use in various applications. For that a Web Crawler was built for the capture and storing of this open government data, as well as an API for making this data available in JSON format, that way developers can easily use this data on their application. We also performed an evaluation of the API for developers with different levels of experience.
Atualmente os dados abertos governamentais exercem um papel fundamental na transparência pública na gestão dos governos, além de ser uma obrigação legal. Porém grande parte desses dados são publicados em formatos diversos, isolados e independentes, dificultado seu reaproveitamento por sistemas de terceiros que poderiam reusar informações disponibilizadas em tais portais. Este trabalho propõe a criação de uma abordagem para captura de dados abertos governamentais de forma automatizada, permitindo sua reutilização em outras aplicações. Para isso foi construído um Web Crawler para captura e armazenamento de Dados Abertos Governamentais (DAG) e a API DAG Prefeituras para disponibilizar esses dados no formato JSON para que outros desenvolvedores possam utilizar esses dados em suas aplicações. Também foi realizada uma avaliação do uso da API para desenvolvedores com diferentes níveis de experiência
Romandini, Nicolò. "Evaluation and implementation of reinforcement learning and pattern recognition algorithms for task automation on web interfaces." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.
Find full textKemmer, Julian. "Der Sandmann : von E.T.A. Hoffmann bis Freddy Krüger." Thesis, Högskolan Dalarna, Tyska, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:du-31682.
Full textTsai, Jing-Ru, and 蔡京儒. "Combining BDI Agent with Web Crawler." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/81206412170578655278.
Full text國立中央大學
資訊工程學系
104
How to deal with the huge information and pick-up the useful information for a user in this big data era? We take an approach to combining web crawler with multi-agent systems, which is regarded as a suitable way to develop an intelligent software system. This research uses Java agent development framework (JADE) as the underlying platform, upon which the belief-desire-intention (BDI) model is added to empower agents with thinking ability. Further, we used web crawler to crawl web page information that a particular agent needs. Combining BDI agent with web crawler thus forms our model. The advantage of this approach is that the web page search strategy of this BDI agent+Crawler can adjust dynamically according to the change of the environment. This makes an agent browse web page information in the way just like a real person.
Lai, Jui-fu, and 賴睿甫. "NUBot, a Client Based Web Crawler." Thesis, 2009. http://ndltd.ncl.edu.tw/handle/41028393316471275818.
Full text國立中正大學
資訊工程所
97
As the internet grows, huge number of web pages are created every day, and it becomes a tough challenge to build a search engine that is able to scale up with the growth of the web space. In this thesis, we propose a new architecture of data crawler for the search engine to crawl the vast space of web pages in a more efficient way. Under our new architecture, we implemented a prototype of data crawler that can increase the crawling domain and perform the crawling in a more effective and efficient way. We archive this goal by distributing the crawlers around the world to crawl web pages close to them to minimize the access overhead between the crawler and the crawled pages. The crawler will compress the data and then transfer it to the master servers to reduce the bandwidth overhead. The master will use a SeenDB to filter out those already crawled urls, and use a DN2IP (DomainName to IP) process to get the IP for each url. The new urls will then be placed in a url pool implemented with multiple queue, waiting for their turn to be crawled.
Yang, Jian-Xin, and 楊健鑫. "Intelligent Customer Service based on Web Crawler." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/f8dv9d.
Full text大同大學
資訊工程學系(所)
107
In the past one or two years, smart customer service has gradually become more and more popular. The company's line number, banks, government agencies, etc. have all exhibited their smart customer service systems. Coupled with the increasing popularity of IPv6, 5G is expected to be transferred in the next decade. By then, the Internet will have a huge amount of information circulation and become the world's largest encyclopedia. If you can use this encyclopedia as the knowledge base of customer service, the ability to make customer service smarter. Therefore, another focus of this research is to reorganize websites in various fields into dialogue tree by web crawlers as knowledge base in various fields. Most of the current smart customer service is oriented to the target knowledge field, such as after-sales service, financial question and answer, disease inquiry, etc., so special training is needed for different target areas. This study applies Seq2seq to smart customer service, and has added well-received technologies such as Attention Model and two-way LSTM. It aims to build a general-purpose smart customer service that the computer can directly learn the grammar of the language instead of specific one. The knowledge base of the field is studied so that all sentences using the language can be understood by the computer regardless of the domain, making it easier for smart customer service to enter people's lives. This study used the Seq2Seq model to improve the problem that the RNN / LSTM input and output must be kept the same length, and the Attention Model is applied to the Seq2Seq. By adding and improving the Context Vector, the problem of the accuracy of the sentence increases as the length is solved. Finally, Bi-directional LSTM was added to improve the accuracy of sentences in a polysemy. The experiments in this study were divided into four groups. The purpose of the first group of experiments was to prove that the neural network model (Seq2Seq + Attention Model + Bi-directional LSTM) used in this study is superior to other neural networks in terms of natural language processing. The model (LSTM, Seq2Seq, Seq2Seq + Attention Model) were trained using 5,000 questions from the open source Chinese corpus [25], and tested with 1000 questions different from the training materials. The output is determined to be correct if the intent and entities meet the expectations. The accuracy rates are 63.4%, 69.2%, 76.1, 82.1 for LSTM, Seq2Seq, Seq2Seq + Attention Model, and Seq2Seq + Attention Model. + Bi-directional LSTM. The second set of experiment compared RasaNLU and the knowledge of the target field in this study, in order to verify whether the accuracy of the neural network model in the target field is better than the traditional statistical-based machine learning method. 5000 data of the Taiwan water company is used as training materials, and 1000 water-related questions were tested. The correct rates of RasaNLU and this study were 86.4% and 87.1%, respectively. The third group of experiments were respectively for RasaNLU and this study. Question and answer in the target field, the purpose is to verify whether the universality of the neural network model is better than the traditional statistical-based machine learning method. 5,000 popular websites including mainland microblog, post bar, watercress and Taiwan The PTT gossip version and other 8 open source chat database as the training materials, and extract 1000 questions from the same corpus different from the training materials for testing, RasaNLU and the correct study The rates were 46.3% and 83.2%, respectively; the last set of experiments compared this study with the small love students, Google Assistant, Siri, and Samsung bixby. As can be seen from the first set of experiments above, the combination of Seq2Seq with Attention Model and Bi-directional LSTM does have better accuracy than other neural networks. In the second and third groups of experiments, this study is similar to the traditional machine learning accuracy in the target knowledge field, but the accuracy rate in the general field is greatly improved, and in the target knowledge field or the general field, this study The subsequent scalability is better. The fourth group of experiments compared with the voice assistants on the market can be seen that although this study cannot provide an accurate answer to the fuzzy answer, it has better accuracy in the professional field.
HUANG, WEI-LIN, and 黃威霖. "Constructing Data Visualization Query Systems with Web Crawler." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/wpbt86.
Full text國立高雄海洋科技大學
海事資訊科技研究所
106
Many stock investments were proposed with only one or few stock indices. To counting this problem and to constructing a simple investing strategy, this study used news, fundamental, technical and chip analysis to constructing a Standard Operating Procedures (SOP). The data were collected by using R web crawler from Taipei Exchange and Taiwan Stock Exchange. The proposed method was used technical and chip analysis to filter investing stocks. Subsequently, this study record the results to compare the return of TAIEX with that of the proposed method. The result shown the return of the proposed method is better than that of TAIEX approximately 500%. Finally, this study use Power BI to visualize the research result in order to construct a simple investment system. Investors can use this system to get investing recommendation on next trading day.
Jao, Jui-Chien, and 饒瑞謙. "VulCrawl: Adaptive Entry Point Crawler for Web Vulnerability Scanner." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/vt26t7.
Full textChang, Hao, and 張皓. "GeoWeb Crawler: An Extensible and Scalable Web Crawling Framework for Discovering Geospatial Web Resources." Thesis, 2016. http://ndltd.ncl.edu.tw/handle/3vtz8y.
Full text國立中央大學
土木工程學系
104
With the advance of the World-Wide Web (WWW) technology, people can easily share content on the Web, including geospatial data and web services. While geospatial resources are being published at an ever-increasing speed, the "big geospatial data management" issues start attracting attention. Among the big geospatial data issues, this research focuses on discovering distributed geospatial resources. As resources are scattered on the globally distributed WWW, users are facing difficulties in finding the resources they need. While the WWW has Web search engines addressing web resource discovery issues, we envision that the geospatial Web (i.e., GeoWeb) also requires GeoWeb search engines for users to efficiently find GeoWeb resources. To realize a GeoWeb search engine, one of the first steps is to proactively discover GeoWeb resources on the WWW. Hence, in this study, we propose the GeoWeb Crawler, an extensible Web crawling framework that can find various types of GeoWeb resources, such as Open Geospatial Consortium (OGC) web services, Keyhole Markup Language (KML) and ESRI Shapefiles. In addition, to promote the performance of the GeoWeb Crawler, we apply the distributed computing concept in the framework to easily scale horizontally. By using 8 machines, we had 13 times performance improvement on the crawling process. Furthermore, while regular web crawlers are ideal for discovering resources with hyperlinks, the GeoWeb Crawler should customize connectors to find the resources hidden behind open or proprietary web services. The result shows that for 10 targeted open-standard-based resource types and 3 non-open-standard-based resource types, the GeoWeb Crawler discovered 7,351 geospatial services, and 194,003 datasets, which are 3.8 to 47.5 times more than what users can find with existing approaches. Based on the crawling level distribution of discovered resources, the result indicates that Google search provide us good seeds to discover resources efficiently. However, the deeper levels we crawl, the more unnecessary effort we spend. Based on the proposed solution, we built the GeoWeb search engine prototype, GeoHub. According to the experimental result, the proposed GeoWeb Crawler framework is proven to be extensible and scalable to provide comprehensive index of GeoWeb.
Santos, Nuno Gonçalo Mateus. "Dark Web Module Data Collection." Master's thesis, 2018. http://hdl.handle.net/10316/83543.
Full textEste documento é o artefacto resultante de um estágio proposto pela Dognaedis, Lda à Universidade de Coimbra. A Dognaedis é uma empresa de ciber segurança, que utiliza informação recolhida pelas ferramentas ao ser dispor para proteger os seus clientes. Havia, no entanto, um vazio que precisava de ser preenchido nas fontes de informação que eram monitorizadas: a dark web. Para preencher esse vazio este estágio foi criado. O seu objectivo é especificar, e implementar, uma solução para um módulo de inteligência da dark web para um dos produtos da empresa, o Portolan. O intuito deste modulo é fazer crawl a websites ”escondidos” em redes de anonimato e deles extrair inteligência, de forma a estender as fontes de informação da plataforma. Neste documento o leitor irá encontrar informação relativa ao trabalho de pesquisa realizado, que compreende o estado da arte a nı́vel de web crawlers e extractores de informação, que permitiu a identificação de técnicas e tecnologias úteis neste âmbito. A especificação da solução para o problema é também apresentada, incluindo a análise de requisitos e o desenho arquitetural. Isto inclui a exposição das funcionalidades propostas, da arquitetura final e das razões por trás da decisões tomadas. Ao leitor será também apresentada uma descrição da metodologia de desenvolvimento que foi seguida e uma descrição da implementação em si, expondo as funcionalidades do módulo e como estas foram atingidas. Finalizando, existe ainda a explicação do processo de validação, que foi realizado de forma a garantir que o produto final estava de acordo com a especificação.
This document is the resulting artefact of an intership proposed by Dognaedis, Lda to the University of Coimbra. Dognaedis is a cyber security company, that uses the information gathered by the tools at its disposal to be able to protect its clients. There was, however, a void that needed to be filled from the sources of information that were monitored: the dark web. In order to fill that void, this intership was created. Its goal is to specify, and implement, a solution for a dark web intelligence module for one of the company’s products, Portolan. The goal of this module is to crawl websites ”hidden” in anonymity networks and extract intelligence from them, in order to extend the sources of information of the platform. As a result, in this document the reader will find information that refers to the research work, that comprises the state of the art regarding web crawlers and information extractors, which allowed the identification of useful techniques and technologies. The specification of the solution for the problem is also presented, including requirement analysis and architectural design. This includes the exposition of the functionalities proposed, the final architecture and the reasons behind the decisions that were made. The reader will also be presented with a description of the development methodology that was followed and a description of the implementation itself, exposing the functionalities of the module and how they were achieved. Finally, there is also the explanation of the validation process, that was conducted to ensure that the final product matched the specification.
Chen, Feng-Kai, and 陳楓凱. "The Design, Development, And Validation Of A Supervised Adaptable Web Crawler." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/47316108869710822073.
Full text國立臺北大學
資訊管理研究所
100
The web crawling function is an essential component of any automatic information extraction system, which needs to trawl web sites for up-to-date information. Researches have tried different way to develop a flexible and adaptable web crawler that is capable of parsing web pages following a set of pre-defined web syntax rules, and these rules may be learned and derived from the target web sites. A universal solution is elusive since the markup language used by web sites is often loose and syntactically incomplete. This research designed, developed, and validated a supervised adaptable web crawler, which is capable of derive extraction rules from a web page segment selected by the user. The derived rules are used by the web crawler to extract the desired information from the website. This supervised rule learning and application scenario makes the information component easier to maintain when the syntax of web pages from a target web site changed. A working web page syntax rule extracting and crawling system written in Java was implemented and tested against two popular citation data web sites. The syntax rule is extracted by highlighting a portion of web pages that the user is interested in. The XML-based web syntax rules are generated by the system. These rules are then used by the crawler to extract the desired citation information from the target web sites. In case of the syntax of the web pages in the target web site changed, the system is capable of detecting the change and re-generates most of the correct rules for the crawler to use.
Lee, Yuan-Chih, and 李元智. "Apply Web Crawler Technology to the Rainfall Prediction of Meteorological Station." Thesis, 2017. http://ndltd.ncl.edu.tw/handle/f257k9.
Full text華梵大學
資訊管理學系碩士班
105
In recent years, extreme rainfall events have occurred frequently that rainfall characteristics and intensity have changed in Taiwan such as area precipitation enhancement, rainfall duration growth, and accumulative precipitation increasing. It needs to pay attention to the part of heavy rainfall for a long time. With the rapid development of information technology and internet, more and more government agencies are releasing government-possessed raw data online in non-proprietary formats for the publics to free access, and then it becomes more convenient to get the information. With the data analysis technology and data mining technology has been improved, big data begin to dramatically develop in many fields. This thesis uses the big data analysis platform Spark and R language to develop the analysis of rainfall modeling by Decision Tree and Random Forest. To get meteorological information in internet, it refers to the data of Pinglin station of the Central Weather Bureau of the Ministry of Transportation and Communications in Taiwan. For the station of the rainfall, the meteorological factors such as temperature and humidity are used in the Web Crawler collection function of R language. Thereafter, the dataset was preprocessed by data mining technology. It establishes the relevant rules to investigate the results of data analysis from the observed rainfall and other relevant information. To provide the prediction and decision-making for regional rainfall, it shows the prosperity of climate information application service. For meteorological data, pre-processing, application analysis and computation of Random Forest algorithm are used to R Studio and Spark platform. The root mean square error of training data and test data are 7.585893 and 13.07361 for the analysis results of Random Forest on R Studio. For running Random Forest on Spark platform, the root mean square error of training data and test data are 7.843388 and 11.35844, respectively. From simulation results, R Studio has better performance in training data and Spark platform has better performance in test data.
XU, BO-EN, and 徐柏恩. "Automatic Broadcast News System by Web Crawler Based on Raspberry Pi3." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/867n46.
Full text明志科技大學
電子工程系碩士班
106
Reading news is a daily habit for many people. With the arrival of the Internet, online news platforms have gradually replaced newspapers to become the media for people to read news. Most jobs today rely on computers. The eye pressure increases as people are getting busier and busier at work. In order not to cause more eye pressure because of reading a lot of words on the news, an automated news reader system was developed with this study. This system can retrieve news from online news platforms, read the news in speech, and furthermore help a user select the news with the content that interests him or her so that the user does not need to spend efforts on selecting news. The development platform used in this study is Raspberry Pi3. The web crawler automated news reader system was developed with the python programming language, which can retrieve news provided by online news platforms and categorize the news based on the SQLite database. The system records the usage and analyzes the results every time so that the news content provided next time can better meet the needs of the users. The system also designed a GUI (Graphical User Interface) for users to use the system. A user can mount the host in a wall, refrigerator, headboard, or so on to match his or her daily habits. This study is aimed at helping people gain access to the news and information that people are interested in, reduce eye strain, and improve the quality of daily life.
Chang, Yi Min, and 張毅民. "Design and Implementation of a Web Crawler Based on Service Oriented Architecture." Thesis, 2012. http://ndltd.ncl.edu.tw/handle/83890012599656974069.
Full text長庚大學
資訊工程學系
100
Now we are familiar with the network information from the World Wide Web concept is proposed to the present, its contents at an alarming rate, rapid growth, the mode of business, people's reading habits, even habits are gradually be affected by this large and rich information platform, through the help of search engine leaving the flow of information is changing dramatically, due to its dynamic characteristics, through the help of search engines to make this information can be effective for people to use . Modern search engines are based on the Crawler based search engines, a search engine is good or bad is to look at the data collection (Data Collection) is good or bad to make a decision, Web Crawler System is responsible for this work, so it can be said A Web Crawler System good or bad decision a search engine is good or bad is not excessive. Web Crawler System of its architecture can be divided into two, a Centralized Distributed architecture and the other is Non-Centralized Distributed architecture, modern times the Crawler based search engines are mostly based on the design architecture of the first, and such a design framework most of the work (such as DNS Lookup, URL Filter, etc.) by the Control Center (Control Center) is responsible for, when the download is too large number of pages (Web Page) will result in the Control Center encounter bottlenecks such as URLs Overlapping , making the Web Crawler other machine in the System Control Center encounter a problem, but has not been assigned duties, resulting in the machine idle status and a waste of resources, so I designed a service-oriented architecture (SOA)-based Web Crawler System The aim is to work and large-scale Web Crawler System will simplify Control Center involve features into several different service modules, making the sub-server (Slave Server) system idle lower risk of the use of resources to be effective. In Chapter III of this paper there are more detailed description. Part in the fourth chapter, will show you I really made to the Web Crawler Based on Service Oriented Architecture of its work performance and statistics I have designed the system one day be able to retrieve the number of Web Pages, In addition, we will test my URL Filter Module is designed to filter in a quick time out duplicate URLs and URL Filter Module 3.3, described in detail.
Hsu, Sheng-Ming, and 許陞銘. "Utilizing Web Crawler and Artificial Intelligence to Build Automatic Web-based System for Predicting Household Electricity Consumption." Thesis, 2019. http://ndltd.ncl.edu.tw/handle/92vb74.
Full text國立臺灣科技大學
營建工程系
107
The development and use of electrical energy give people a convenient and comfortable life. However, people consume a large amount of unnecessary energy to increase comfort, creating an energy crisis and global warming, and damaging some ecological circles. The world is actively promoting energy saving and carbon reduction to alleviate this problem. Residential electricity comprises about 20% of Taiwan's total electricity consumption, and has greater electric elasticity than electricity for industrial and business uses, representing high energy-saving potential. This study aims to assist government to formulate the direction of energy conservation policies. Additionally, the Taiwan power company and green energy industry, which are both operated by government, need to utilize the smart grid to realize the state of electricity consumption, in order to facilitate distribution. The public can use this platform to supervise the implementation of energy conservation plans. Accordingly, this investigation establishes an automated network system platform, providing information on residential electricity consumption in each county and city. After literature review, this collected data from 20 counties and cities each month over a period of 72 months. The data included 17 influence factors with residential electricity consumption during a month as a dependent variable. Data mining technology was employed to forecast future residential electricity demand. The forecasting systems adopted in this work were (1) linear regression, (2) classification and regression tree, (3) support vector machine/regression, (4) artificial neural networks, (5) Voting method and (6) Bagging method. Bagging-ANNs achieved the best performance among the tested models. A natural-inspired optimization method, namely PSO, was then applied to enhance the accuracy as well as stability of Bagging-ANNs, to develop a hybrid ensemble model, PSO-Bagging-ANNs. The correlation coefficient between prediction values and actual values was 0.99; the mean absolute error was 2,059,993kWh; the root mean square error was 5,311,887 kWh, and the mean absolute percentage error was 1.17%. The average of monthly electricity consumption in Taiwan is about 200,000,000kWh. The MAE is about 20,000kWh. The accuracy rate of the model is up to 1%. Evaluation indicators show that the proposed model is accurate, and provides effective information for reference. An automatic web-based system based on this model and combined with a web crawler and scheduled to run automatically to provide information on monthly residential electricity consumption in each county and city.
Lin, Meng-chun, and 林盟鈞. "Information Retrieval System Based on Topic Web Crawler to Improve Retrieval Performance in CLIR." Thesis, 2011. http://ndltd.ncl.edu.tw/handle/84369464259544160664.
Full text朝陽科技大學
資訊工程系碩士班
99
The paper describes how to build an efficient topic web crawler and use it to improve the performance of cross language information retrieval (CLIR). A topic web crawler can extract web pages related to a certain topic. A topic web crawler is built by combining a standard crawler and a relevance classifier. Given some seed URLs, the crawler gets web pages from the World Wide Web, and the relevance classifier judges which pages are relevant. The URLs in the relevant pages are treated as seeds for further web page retrieval. In this paper, we will adopt topic web crawler as a way of query expansion for CLIR. The topic web crawler extracts candidate query terms form web page. We conduct experiments to compare the method to previous works, i.e. extract candidate query terms from Wikipedia to assist CLIR. We also combine these resources to do query expansion, i.e. combining the topic web crawler, Wikipedia, and Okapi BM25 algorithm, to improve our information retrieval system performance. We test our system on the NTCIR-8 IR4QA data set to evaluate our CLIR system. The experiment result shows that query expansion from combining resources gives better performance than query expansion from single resource.
Lee, Yi-Ting, and 李懿庭. "Design and Implementation of Foreclosure Information System based on Web Crawler and Neural Network." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/2qja3g.
Full text國立臺北科技大學
電子工程系
106
With the advent of the era of big data, more and more examples of AI-related technologies are used in life. Neural networks can be combined with applications at various level, such as semantic analysis and image recognition, etc. In this thesis, we apply neural networks to the prediction of foreclosure house prices, which makes it easier to analyze the price. The proposed platform consists of four parts: data collection, neural networks, back end and front end. The data are collected by controlling Chrome Browser with Selenium WebDriver. For the neural networks part, Keras Library is used to set up and train the network. With regard to the back end technology, ASP.NET WebAPI framework is used to deal with the connection between the front end and the database. Finally, ReactJs is applied to develop the front end technology. The proposed foreclosure houses information platform collects data on foreclosure houses and auctions based on web crawler. After rearranging the data, the system then inputs the collected information to the Neural network for training and price prediction. The information includes the cities and the districts where the houses locate, the size of the houses, land rights, dates of auction, whether final walk-through has been undergone, the maximum and the minimum of the house prices, and types of land weight. The final results are presented on the website and serve as an index in analyzing the house prices.