Дисертації: "Web usage data mining techniques"

1

Khalil, Faten. "Combining web data mining techniques for web page access prediction." University of Southern Queensland, Faculty of Sciences, 2008. http://eprints.usq.edu.au/archive/00004341/.

Повний текст джерела

Анотація:

[Abstract]: Web page access prediction gained its importance from the ever increasing number of e-commerce Web information systems and e-businesses. Web page prediction, that involves personalising the Web users’ browsing experiences, assists Web masters in the improvement of the Web site structure and helps Web users in navigating the site and accessing the information they need. The most widely used approach for this purpose is the pattern discovery process of Web usage mining that entails many techniques like Markov model, association rules and clustering. Implementing pattern discovery techniques as such helps predict the next page tobe accessed by theWeb user based on the user’s previous browsing patterns. However, each of the aforementioned techniques has its own limitations, especiallywhen it comes to accuracy and space complexity. This dissertation achieves better accuracy as well as less state space complexity and rules generated by performingthe following combinations. First, we combine low-order Markov model and association rules. Markov model analysis are performed on the data sets. If the Markov model prediction results in a tie or no state, association rules are used for prediction. The outcome of this integration is better accuracy, less Markov model state space complexity and less number of generated rules than using each of the methods individually. Second, we integrate low-order Markov model and clustering. The data sets are clustered and Markov model analysis are performed oneach cluster instead of the whole data sets. The outcome of the integration is better accuracy than the first combination with less state space complexity than higherorder Markov model. The last integration model involves combining all three techniques together: clustering, association rules and low-order Markov model. The data sets are clustered and Markov model analysis are performed on each cluster. If the Markov model prediction results in close accuracies for the same item, association rules are used for prediction. This integration model achievesbetter Web page access prediction accuracy, less Markov model state space complexity and less number of rules generated than the previous two models.

Стилі APA, Harvard, Vancouver, ISO та ін.

2

Nagi, Mohamad. "Integrating Network Analysis and Data Mining Techniques into Effective Framework for Web Mining and Recommendation. A Framework for Web Mining and Recommendation." Thesis, University of Bradford, 2015. http://hdl.handle.net/10454/14200.

Повний текст джерела

Анотація:

The main motivation for the study described in this dissertation is to benefit from the development in technology and the huge amount of available data which can be easily captured, stored and maintained electronically. We concentrate on Web usage (i.e., log) mining and Web structure mining. Analysing Web log data will reveal valuable feedback reflecting how effective the current structure of a web site is and to help the owner of a web site in understanding the behaviour of the web site visitors. We developed a framework that integrates statistical analysis, frequent pattern mining, clustering, classification and network construction and analysis. We concentrated on the statistical data related to the visitors and how they surf and pass through the various pages of a given web site to land at some target pages. Further, the frequent pattern mining technique was used to study the relationship between the various pages constituting a given web site. Clustering is used to study the similarity of users and pages. Classification suggests a target class for a given new entity by comparing the characteristics of the new entity to those of the known classes. Network construction and analysis is also employed to identify and investigate the links between the various pages constituting a Web site by constructing a network based on the frequency of access to the Web pages such that pages get linked in the network if they are identified in the result of the frequent pattern mining process as frequently accessed together. The knowledge discovered by analysing a web site and its related data should be considered valuable for online shoppers and commercial web site owners. Benefitting from the outcome of the study, a recommendation system was developed to suggest pages to visitors based on their profiles as compared to similar profiles of other visitors. The conducted experiments using popular datasets demonstrate the applicability and effectiveness of the proposed framework for Web mining and recommendation. As a by product of the proposed method, we demonstrate how it is effective in another domain for feature reduction by concentrating on gene expression data analysis as an application with some interesting results reported in Chapter 5.

Стилі APA, Harvard, Vancouver, ISO та ін.

3

Khasawneh, Natheer Yousef. "Toward Better Website Usage: Leveraging Data Mining Techniques and Rough Set Learning to Construct Better-to-use Websites." Akron, OH : University of Akron, 2005. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=akron1120534472.

Повний текст джерела

Анотація:

Dissertation (Ph. D.)--University of Akron, Dept. of Electrical and Computer Engineering, 2005.
"August, 2005." Title from electronic dissertation title page (viewed 01/14/2006) Advisor, John Durkin; Committee members, John Welch, James Grover, Yueh-Jaw Lin, Yingcai Xiao, Chien-Chung Chan; Department Chair, Alex Jose De Abreu-Garcia; Dean of the College, George Haritos; Dean of the Graduate School, George R. Newkome. Includes bibliographical references.

Стилі APA, Harvard, Vancouver, ISO та ін.

4

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns : the development and evaluation of new Web mining methods that enhance information retrieval and improve the understanding of users' Web behavior in websites and social blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Повний текст джерела

Анотація:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

Стилі APA, Harvard, Vancouver, ISO та ін.

5

Ammari, Ahmad N. "Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs." Thesis, University of Bradford, 2010. http://hdl.handle.net/10454/5269.

Повний текст джерела

Анотація:

The rapid growth of the World Wide Web in the last decade makes it the largest publicly accessible data source in the world, which has become one of the most significant and influential information revolution of modern times. The influence of the Web has impacted almost every aspect of humans' life, activities and fields, causing paradigm shifts and transformational changes in business, governance, and education. Moreover, the rapid evolution of Web 2.0 and the Social Web in the past few years, such as social blogs and friendship networking sites, has dramatically transformed the Web from a raw environment for information consumption to a dynamic and rich platform for information production and sharing worldwide. However, this growth and transformation of the Web has resulted in an uncontrollable explosion and abundance of the textual contents, creating a serious challenge for any user to find and retrieve the relevant information that he truly seeks to find on the Web. The process of finding a relevant Web page in a website easily and efficiently has become very difficult to achieve. This has created many challenges for researchers to develop new mining techniques in order to improve the user experience on the Web, as well as for organizations to understand the true informational interests and needs of their customers in order to improve their targeted services accordingly by providing the products, services and information that truly match the requirements of every online customer. With these challenges in mind, Web mining aims to extract hidden patterns and discover useful knowledge from Web page contents, Web hyperlinks, and Web usage logs. Based on the primary kinds of Web data used in the mining process, Web mining tasks can be categorized into three main types: Web content mining, which extracts knowledge from Web page contents using text mining techniques, Web structure mining, which extracts patterns from the hyperlinks that represent the structure of the website, and Web usage mining, which mines user's Web navigational patterns from Web server logs that record the Web page access made by every user, representing the interactional activities between the users and the Web pages in a website. The main goal of this thesis is to contribute toward addressing the challenges that have been resulted from the information explosion and overload on the Web, by proposing and developing novel Web mining-based approaches. Toward achieving this goal, the thesis presents, analyzes, and evaluates three major contributions. First, the development of an integrated Web structure and usage mining approach that recommends a collection of hyperlinks for the surfers of a website to be placed at the homepage of that website. Second, the development of an integrated Web content and usage mining approach to improve the understanding of the user's Web behavior and discover the user group interests in a website. Third, the development of a supervised classification model based on recent Social Web concepts, such as Tag Clouds, in order to improve the retrieval of relevant articles and posts from Web social blogs.

Стилі APA, Harvard, Vancouver, ISO та ін.

6

Norguet, Jean-Pierre. "Semantic analysis in web usage mining." Doctoral thesis, Universite Libre de Bruxelles, 2006. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/210890.

Повний текст джерела

Анотація:

With the emergence of the Internet and of the World Wide Web, the Web site has become a key communication channel in organizations. To satisfy the objectives of the Web site and of its target audience, adapting the Web site content to the users' expectations has become a major concern. In this context, Web usage mining, a relatively new research area, and Web analytics, a part of Web usage mining that has most emerged in the corporate world, offer many Web communication analysis techniques. These techniques include prediction of the user's behaviour within the site, comparison between expected and actual Web site usage, adjustment of the Web site with respect to the users' interests, and mining and analyzing Web usage data to discover interesting metrics and usage patterns. However, Web usage mining and Web analytics suffer from significant drawbacks when it comes to support the decision-making process at the higher levels in the organization.

Indeed, according to organizations theory, the higher levels in the organizations need summarized and conceptual information to take fast, high-level, and effective decisions. For Web sites, these levels include the organization managers and the Web site chief editors. At these levels, the results produced by Web analytics tools are mostly useless. Indeed, most of these results target Web designers and Web developers. Summary reports like the number of visitors and the number of page views can be of some interest to the organization manager but these results are poor. Finally, page-group and directory hits give the Web site chief editor conceptual results, but these are limited by several problems like page synonymy (several pages contain the same topic), page polysemy (a page contains several topics), page temporality, and page volatility.

Web usage mining research projects on their part have mostly left aside Web analytics and its limitations and have focused on other research paths. Examples of these paths are usage pattern analysis, personalization, system improvement, site structure modification, marketing business intelligence, and usage characterization. A potential contribution to Web analytics can be found in research about reverse clustering analysis, a technique based on self-organizing feature maps. This technique integrates Web usage mining and Web content mining in order to rank the Web site pages according to an original popularity score. However, the algorithm is not scalable and does not answer the page-polysemy, page-synonymy, page-temporality, and page-volatility problems. As a consequence, these approaches fail at delivering summarized and conceptual results.

An interesting attempt to obtain such results has been the Information Scent algorithm, which produces a list of term vectors representing the visitors' needs. These vectors provide a semantic representation of the visitors' needs and can be easily interpreted. Unfortunately, the results suffer from term polysemy and term synonymy, are visit-centric rather than site-centric, and are not scalable to produce. Finally, according to a recent survey, no Web usage mining research project has proposed a satisfying solution to provide site-wide summarized and conceptual audience metrics.

In this dissertation, we present our solution to answer the need for summarized and conceptual audience metrics in Web analytics. We first described several methods for mining the Web pages output by Web servers. These methods include content journaling, script parsing, server monitoring, network monitoring, and client-side mining. These techniques can be used alone or in combination to mine the Web pages output by any Web site. Then, the occurrences of taxonomy terms in these pages can be aggregated to provide concept-based audience metrics. To evaluate the results, we implement a prototype and run a number of test cases with real Web sites.

According to the first experiments with our prototype and SQL Server OLAP Analysis Service, concept-based metrics prove extremely summarized and much more intuitive than page-based metrics. As a consequence, concept-based metrics can be exploited at higher levels in the organization. For example, organization managers can redefine the organization strategy according to the visitors' interests. Concept-based metrics also give an intuitive view of the messages delivered through the Web site and allow to adapt the Web site communication to the organization objectives. The Web site chief editor on his part can interpret the metrics to redefine the publishing orders and redefine the sub-editors' writing tasks. As decisions at higher levels in the organization should be more effective, concept-based metrics should significantly contribute to Web usage mining and Web analytics.

Doctorat en sciences appliquées
info:eu-repo/semantics/nonPublished

Стилі APA, Harvard, Vancouver, ISO та ін.

7

Sobolewska, Katarzyna-Ewa. "Web links utility assessment using data mining techniques." Thesis, Blekinge Tekniska Högskola, Avdelningen för programvarusystem, 2006. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-2936.

Повний текст джерела

Анотація:

This paper is focusing on the data mining solutions for the WWW, specifically how it can be used for the hyperlinks evaluation. We are focusing on the hyperlinks used in the web sites systems and on the problem which consider evaluation of its utility. Since hyperlinks reflect relation to other webpage one can expect that there exist way to verify if users follow desired navigation paths. The Challenge is to use available techniques to discover usage behavior patterns and interpret them. We have evaluated hyperlinks of the selected pages from www.bth.se web site. By using web expert’s help the usefulness of the data mining as the assessment basis was validated. The outcome of the research shows that data mining gives decision support for the changes in the web site navigational structure.
akasha.kate@gmail.com

Стилі APA, Harvard, Vancouver, ISO та ін.

8

Bayir, Murat Ali. "A New Reactive Method For Processing Web Usage Data." Master's thesis, METU, 2007. http://etd.lib.metu.edu.tr/upload/12607323/index.pdf.

Повний текст джерела

Анотація:

In this thesis, a new reactive session reconstruction method '
Smart-SRA'
is introduced. Web usage mining is a type of web mining, which exploits data mining techniques to discover valuable information from navigations of Web users. As in classical data mining, data processing and pattern discovery are the main issues in web usage mining. The first phase of the web usage mining is the data processing phase including session reconstruction. Session reconstruction is the most important task of web usage mining since it directly affects the quality of the extracted frequent patterns at the final step, significantly. Session reconstruction methods can be classified into two categories, namely '
reactive'
and '
proactive'
with respect to the data source and the data processing time. If the user requests are processed after the server handles them, this technique is called as &lsquo
reactive&rsquo
, while in &lsquo
proactive&rsquo
strategies this processing occurs during the interactive browsing of the web site. Smart-SRA is a reactive session reconstruction techique, which uses web log data and the site topology. In order to compare Smart-SRA with previous reactive methods, a web agent simulator has been developed. Our agent simulator models behavior of web users and generates web user navigations as well as the log data kept by the web server. In this way, the actual user sessions will be known and the successes of different techniques can be compared. In this thesis, it is shown that the sessions generated by Smart-SRA are more accurate than the sessions constructed by previous heuristics.

Стилі APA, Harvard, Vancouver, ISO та ін.

9

Wu, Hao-cun, and 吳浩存. "A multidimensional data model for monitoring web usage and optimizing website topology." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2004. http://hub.hku.hk/bib/B29528215.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

10

Özakar, Belgin Püskülcü Halis. "Finding And Evaluating Patterns In Wes Repository Using Database Technology And Data Mining Algorithms/." [s.l.]: [s.n.], 2002. http://library.iyte.edu.tr/tezler/master/bilgisayaryazilimi/T000130.pdf.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

11

Karlsson, Sophie. "Datainsamling med Web Usage Mining : Lagringsstrategier för loggning av serverdata." Thesis, Högskolan i Skövde, Institutionen för informationsteknologi, 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:his:diva-9467.

Повний текст джерела

Анотація:

Webbapplikationers komplexitet och mängden avancerade tjänster ökar. Loggning av aktiviteter kan öka förståelsen över användares beteenden och behov, men används i för stor mängd utan relevant information. Mer avancerade system medför ökade krav för prestandan och loggning blir än mer krävande för systemen. Det finns behov av smartare system, utveckling inom tekniker för prestandaförbättringar och tekniker för datainsamling. Arbetet kommer undersöka hur svarstider påverkas vid loggning av serverdata, enligt datainsamlingsfasen i web usage mining, beroende på lagringsstrategier. Hypotesen är att loggning kan försämra svarstider ytterligare. Experiment genomförs där fyra olika lagringsstrategier används för att lagra serverdata med olika tabell- och databasstrukturer, för att se vilken strategi som påverkar svarstiderna minst. Experimentet påvisar statistiskt signifikant skillnad mellan lagringsstrategierna enligt ANOVA. Lagringsstrategi 4 påvisar bäst effekt för prestandans genomsnittliga svarstid, jämfört med lagringsstrategi 2 som påvisar mest negativ effekt för den genomsnittliga svarstiden. Framtida arbete vore intressant för att stärka resultaten.
Web applications complexity and the amount of advanced services increases. Logging activities can increase the understanding of users behavior and needs, but is used too much without relevant information. More advanced systems brings increased requirements for performance and logging becomes even more demanding for the systems. There is need of smarter systems, development within the techniques for performance improvements and techniques for data collection. This work will investigate how response times are affected when logging server data, according to the data collection phase in web usage mining, depending on storage strategies. The hypothesis is that logging may degrade response times even further. An experiment was conducted in which four different storage strategies are used to store server data with different table- and database structures, to see which strategy affects the response times least. The experiment proves statistically significant difference between the storage strategies with ANOVA. Storage strategy 4 proves the best effect for the performance average response time compared with storage strategy 2, which proves the most negative effect for the average response time. Future work would be interesting for strengthening the results.

Стилі APA, Harvard, Vancouver, ISO та ін.

12

Shun, Yeuk Kiu. "Web mining from client side user activity log /." View Abstract or Full-Text, 2002. http://library.ust.hk/cgi/db/thesis.pl?COMP%202002%20SHUN.

Повний текст джерела

Анотація:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2002.
Includes bibliographical references (leaves 85-90). Also available in electronic version. Access restricted to campus users.

Стилі APA, Harvard, Vancouver, ISO та ін.

13

Wang, Hui. "Mining novel Web user behavior models for access prediction /." View Abstract or Full-Text, 2003. http://library.ust.hk/cgi/db/thesis.pl?COMP%202003%20WANG.

Повний текст джерела

Анотація:

Thesis (M. Phil.)--Hong Kong University of Science and Technology, 2003.
Includes bibliographical references (leaves 83-91). Also available in electronic version. Access restricted to campus users.

Стилі APA, Harvard, Vancouver, ISO та ін.

14

Zhao, Hongkun. "Automatic wrapper generation for the extraction of search result records from search engines." Diss., Online access via UMI:, 2007.

Знайти повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

15

Agarwal, Khushbu. "A partition based approach to approximate tree mining a memory hierarchy perspective /." Columbus, Ohio : Ohio State University, 2008. http://rave.ohiolink.edu/etdc/view?acc%5Fnum=osu1196284256.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

16

Färholt, Fredric. "Less Detectable Web Scraping Techniques." Thesis, Linnéuniversitetet, Institutionen för datavetenskap och medieteknik (DM), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:lnu:diva-104887.

Повний текст джерела

Анотація:

Web scraping is an efficient way of gathering data, and it has also become much eas- ier to perform and offers a high success rate. People no longer need to be tech-savvy when scraping data since several easy-to-use platform services exist. This study conducts experiments to see if people can scrape in an undetectable fashion using a popular and intelligent JavaScript library (Puppeteer). Three web scraper algorithms, where two of them use movement patterns from real-world web users, demonstrate how to retrieve information automatically from the web. They operate on a website built for this research that utilizes known semi-security mechanisms, honeypot, and activity logging, making it possible to collect and evaluate data from the algorithms and the website. The result shows that it may be possible to construct a web scraper algorithm with less detectability using Puppeteer. One of the algorithms reveals that it is possible to control computer performance using built-in methods in Puppeteer.
Webbskrapning är ett effektivt sätt att hämta data på, det har även blivit en aktivitet som är enkel att genomföra och chansen att en lyckas är hög. Användare behöver inte längre vara fantaster inom teknik när de skrapar data, det finns idag mängder olika och lättanvändliga plattformstjänster. Den här studien utför experi- ment för att se hur personer kan skrapa på ett oupptäckbart sätt med ett populärt och intelligent JavaScript bibliotek (Puppeteer). Tre webbskrapningsalgoritmer, där två av dem använder rörelsemönster från riktiga webbanvändare, demonstrerar hur en kan samla information. Webbskrapningsalgoritmerna har körts på en hemsida som ingått i experimentet med kännbar säkerhet, honeypot, och aktivitetsloggning, nå- got som gjort det möjligt att samla och utvärdera data från både algoritmerna och hemsidan. Resultatet visar att det kan vara möljligt att skrapa på ett oupptäckbart sätt genom att använda Puppeteer. En av algoritmerna avslöjar även möjligheten att kontrollera prestanda genom att använda inbyggda metoder i Puppeteer.

Стилі APA, Harvard, Vancouver, ISO та ін.

17

Vollino, Bruno Winiemko. "Descoberta de perfis de uso de web services." reponame:Biblioteca Digital de Teses e Dissertações da UFRGS, 2013. http://hdl.handle.net/10183/83669.

Повний текст джерела

Анотація:

Durante o ciclo de vida de um web service, diversas mudanças são feitas na sua interface, eventualmente causando incompatibilidades em relação aos seus clientes e ocasionando a quebra de suas aplicações. Os provedores precisam tomar decisões sobre mudanças em seus serviços frequentemente, muitas vezes sem um bom entendimento a respeito do efeito destas mudanças sobre seus clientes. Os trabalhos e ferramentas existentes não fornecem ao provedor um conhecimento adequado a respeito do uso real das funcionalidades da interface de um serviço, considerando os diferentes tipos de consumidores, o que impossibilita avaliar o impacto das mudanças. Este trabalho apresenta um framework para a descoberta de perfis de uso de serviços web, os quais constituem um modelo descritivo dos padrões de uso dos diferentes grupos de clientes do serviço, com relação ao uso das funcionalidades em sua interface. O framework auxilia no processo de descoberta de conhecimento através de tarefas semiautomáticas e parametrizáveis para a preparação e análise de dados de uso, minimizando a necessidade de intervenção do usuário. O framework engloba o monitoramento de interações de web services, a carga de dados de uso pré-processados em uma base de dados unificada, e a geração de perfis de uso. Técnicas de mineração de dados são utilizadas para agrupar clientes de acordo com seus padrões de uso de funcionalidades, e esses grupos são utilizados na construção de perfis de uso de serviços. Todo o processo é configurado através de parâmetros, permitindo que o usuário determine o nível de detalhe das informações sobre o uso incluídas nos perfis e os critérios para avaliar a similaridade entre clientes. A proposta é validada por meio de experimentos com dados sintéticos, simulados de acordo com características esperadas no comportamento de clientes de um serviço real. Os resultados dos experimentos demonstram que o framework proposto permite a descoberta de perfis de uso de serviço úteis, e fornecem evidências a respeito da parametrização adequada do framework.
During the life cycle of a web service, several changes are made in its interface, which possibly are incompatible with regard to current usage and may break client applications. Providers must make decisions about changes on their services, most often without insight on the effect these changes will have over their customers. Existing research and tools fail to input provider with proper knowledge about the actual usage of the service interface’s features, considering the distinct types of customers, making it impossible to assess the actual impact of changes. This work presents a framework for the discovery of web service usage profiles, which constitute a descriptive model of the usage patterns found in distinct groups of clients, concerning the usage of service interface features. The framework supports a user in the process of knowledge discovery over service usage data through semi-automatic and configurable tasks, which assist the preparation and analysis of usage data with the minimum user intervention possible. The framework performs the monitoring of web services interactions, loads pre-processed usage data into a unified database, and supports the generation of usage profiles. Data mining techniques are used to group clients according to their usage patterns of features, and these groups are used to build service usage profiles. The entire process is configured via parameters, which allows the user to determine the level of detail of the usage information included in the profiles, and the criteria for evaluating the similarity between client applications. The proposal is validated through experiments with synthetic data, simulated according to features expected in the use of a real service. The experimental results demonstrate that the proposed framework allows the discovery of useful service usage profiles, and provide evidences about the proper parameterization of the framework.

Стилі APA, Harvard, Vancouver, ISO та ін.

18

Pabarškaitė, Židrina. "Enhancements of pre-processing, analysis and presentation techniques in web log mining." Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2009. http://vddb.library.lt/obj/LT-eLABa-0001:E.02~2009~D_20090713_142203-05841.

Повний текст джерела

Анотація:

As Internet is becoming an important part of our life, more attention is paid to the information quality and how it is displayed to the user. The research area of this work is web data analysis and methods how to process this data. This knowledge can be extracted by gathering web servers’ data – log files, where all users’ navigational patters about browsing are recorded. The research object of the dissertation is web log data mining process. General topics that are related with this object: web log data preparation methods, data mining algorithms for prediction and classification tasks, web text mining. The key target of the thesis is to develop methods how to improve knowledge discovery steps mining web log data that would reveal new opportunities to the data analyst. While performing web log analysis, it was discovered that insufficient interest has been paid to web log data cleaning process. By reducing the number of redundant records data mining process becomes much more effective and faster. Therefore a new original cleaning framework was introduced which leaves records that only corresponds to the real user clicks. People tend to understand technical information more if it is similar to a human language. Therefore it is advantageous to use decision trees for mining web log data, as they generate web usage patterns in the form of rules which are understandable to humans. However, it was discovered that users browsing history length is different, therefore specific data... [to full text]
Internetui skverbiantis į mūsų gyvenimą, vis didesnis dėmesys kreipiamas į informacijos pateikimo kokybę, bei į tai, kaip informacija yra pateikta. Disertacijos tyrimų sritis yra žiniatinklio serverių kaupiamų duomenų gavyba bei duomenų pateikimo galutiniam naudotojui gerinimo būdai. Tam reikalingos žinios išgaunamos iš žiniatinklio serverio žurnalo įrašų, kuriuose fiksuojama informacija apie išsiųstus vartotojams žiniatinklio puslapius. Darbo tyrimų objektas yra žiniatinklio įrašų gavyba, o su šiuo objektu susiję dalykai: žiniatinklio duomenų paruošimo etapų tobulinimas, žiniatinklio tekstų analizė, duomenų analizės algoritmai prognozavimo ir klasifikavimo uždaviniams spręsti. Pagrindinis disertacijos tikslas – perprasti svetainių naudotojų elgesio formas, tiriant žiniatinklio įrašus, tobulinti paruošimo, analizės ir rezultatų interpretavimo etapų metodologijas. Darbo tyrimai atskleidė naujas žiniatinklio duomenų analizės galimybes. Išsiaiškinta, kad internetinių duomenų – žiniatinklio įrašų švarinimui buvo skirtas nepakankamas dėmesys. Parodyta, kad sumažinus nereikšmingų įrašų kiekį, duomenų analizės procesas tampa efektyvesnis. Todėl buvo sukurtas naujas metodas, kurį pritaikius žinių pateikimas atitinka tikruosius vartotojų maršrutus. Tyrimo metu nustatyta, kad naudotojų naršymo istorija yra skirtingų ilgių, todėl atlikus specifinį duomenų paruošimą – suformavus fiksuoto ilgio vektorius, tikslinga taikyti iki šiol nenaudotus praktikoje sprendimų medžių algoritmus... [toliau žr. visą tekstą]

Стилі APA, Harvard, Vancouver, ISO та ін.

19

Villar, Escobar Osvaldo Pablo. "Minería y Personalización de un Sitio Web para Celulares." Tesis, Universidad de Chile, 2007. http://www.repositorio.uchile.cl/handle/2250/104823.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

20

Kliegr, Tomáš. "Clickstream Analysis." Master's thesis, Vysoká škola ekonomická v Praze, 2007. http://www.nusl.cz/ntk/nusl-2065.

Повний текст джерела

Анотація:

Thesis introduces current research trends in clickstream analysis and proposes a new heuristic that could be used for dimensionality reduction of semantically enriched data in Web Usage Mining (WUM). Click-fraud and conversion fraud are identified as key prospective application areas for WUM. Thesis documents a conversion fraud vulnerability of Google Analytics and proposes defense - a new clickstream acquisition software, which collects data in sufficient granularity and structure to allow for data mining approaches to fraud detection. Three variants of K-means clustering algorithms and three association rule data mining systems are evaluated and compared on real-world web usage data.

Стилі APA, Harvard, Vancouver, ISO та ін.

21

Nenadić, Oleg. "An implementation of correspondence analysis in R and its application in the analysis of web usage /." Göttingen : Cuvillier, 2007. http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&doc_number=016229974&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

22

Persson, Pontus. "Identifying Early Usage Patterns That Increase User Retention Rates In A Mobile Web Browser." Thesis, Linköpings universitet, Databas och informationsteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-137793.

Повний текст джерела

Анотація:

One of the major challenges for modern technology companies is user retentionmanagement. This work focuses on identifying early usage patterns that signifyincreased retention rates in a mobile web browser.This is done using a targetedparallel implementation of the association rule mining algorithm FP-Growth.Different item subset selection techniques including clustering and otherstatistical methods have been used in order to reduce the mining time and allowfor lower support thresholds.A lot of interesting rules have been mined. The best retention-wise ruleimplies a retention rate of 99.5%. The majority of the rules analyzed in thiswork implies a retention rate increase between 150% and 200%.

Стилі APA, Harvard, Vancouver, ISO та ін.

23

Gomes, João Fernando dos Anjos. "Recomendação de navegação em portais da internet como um serviço suportado em ferramentas Web Analytics." Master's thesis, Instituto Politécnico de Setúbal. Escola Superior de Ciências Empresariais, 2016. http://hdl.handle.net/10400.26/17292.

Повний текст джерела

Анотація:

Dissertação apresentada para cumprimento dos requisitos necessários à obtenção do grau de Mestre de Sistemas de Informação Organizacionais
Com o constante crescimento da utilização da Internet o número de websites e respetivas páginas contínua a evoluir também, por este motivo, verifica-se uma necessidade de alinhar a experiência de utilização com os objetivos gerais de um website. Para satisfazer esta necessidade o sistema de recomendação proposto sugere páginas ao utilizador que possam ser do seu interesse com base em perfis de navegação de um website em geral. A maioria dos sistemas de recomendação são baseados em regras de associação ou palavras chave (quando o conteúdo é considerado). No entanto, quando os dados não são suficientes ou são muito dispersos e a ordem é considerada, uma abordagem tradicional pode ser inadequada. Por outro lado, assumindo outro paradigma, a área de Web Analytics, tem obtido um crescimento considerável, através de ferramentas robustas que permitem a recolha e análise de dados da internet, a fim de compreender e otimizar eficiência e eficácia do website. O presente artigo propõe o desenvolvimento de um sistema de recomendação baseado na ferramenta Google Analytics. O protótipo é composto por dois componentes principais que são: 1) um serviço responsável pela construção e lógica associada à criação das recomendações; 2) uma biblioteca incorporável em qualquer website que providenciará um widget de recomendação configurável. Avaliações preliminares constataram que a implementação segue a lógica do modelo proposto.
As the Internet usage keeps increasing, the number of web sites and hence the number of web pages also keeps increasing, so there is a need to align the user experience with the overall websites purposes. Toward this requirement, the proposed recommendation systems suggest the user pages that might be of its interest based on past navigation profiles of overall site usage. Most of existing recommendation systems are based on association rules or based on keywords (when content is considered). However, on usage data shortage or sparse data and if sequential order is to be considered such traditional approaches may become unsuitable. Conversely, the Web Analytics arena, assuming other paradigm, has experienced a considerable growth through mature tools that allow the collection and analysis of internet data in order to understand and optimize website efficiency and efficacy. This work proposes the development of a recommendation system based on the Google Analytics tool. The prototype is constituted by two main components which are: 1) a service responsible for the construction and associated logic that underlies recommendations generation; 2) an embeddable library on any website that will furnish website with a configurable recommendation widget. Preliminary evaluations had showed that the implementation follows the logic of the proposed model.

Стилі APA, Harvard, Vancouver, ISO та ін.

24

Kilic, Sefa. "Clustering Frequent Navigation Patterns From Website Logs Using Ontology And Temporal Information." Master's thesis, METU, 2012. http://etd.lib.metu.edu.tr/upload/12613979/index.pdf.

Повний текст джерела

Анотація:

Given set of web pages labeled with ontological items, the level of similarity between two web pages is measured using the level of similarity between ontological items of pages labeled with. Using similarity measure between two pages, degree of similarity between two sequences of web page visits can be calculated as well. Using clustering algorithms, similar frequent sequences are grouped and representative sequences are selected from these groups. A new sequence is compared with all clusters and it is assigned to most similar one. Representatives of the most similar cluster can be used in several real world cases. They can be used for predicting and prefetching the next page user will visit or for helping the navigation of user in the website. They can also be used to improve the structure of website for easier navigation. In this study the effect of time spent on each web page during the session is analyzed.

Стилі APA, Harvard, Vancouver, ISO та ін.

25

Vlk, Vladimír. "Získávání znalostí z webových logů." Master's thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2013. http://www.nusl.cz/ntk/nusl-236196.

Повний текст джерела

Анотація:

This master's thesis deals with creating of an application, goal of which is to perform data preprocessing of web logs and finding association rules in them. The first part deals with the concept of Web mining. The second part is devoted to Web usage mining and notions related to it. The third part deals with design of the application. The forth section is devoted to describing the implementation of the application. The last section deals with experimentation with the application and results interpretation.

Стилі APA, Harvard, Vancouver, ISO та ін.

26

Mair, Patrick, and Marcus Hudec. "Session Clustering Using Mixtures of Proportional Hazards Models." Department of Statistics and Mathematics, WU Vienna University of Economics and Business, 2008. http://epub.wu.ac.at/598/1/document.pdf.

Повний текст джерела

Анотація:

Emanating from classical Weibull mixture models we propose a framework for clustering survival data with various proportionality restrictions imposed. By introducing mixtures of Weibull proportional hazards models on a multivariate data set a parametric cluster approach based on the EM-algorithm is carried out. The problem of non-response in the data is considered. The application example is a real life data set stemming from the analysis of a world-wide operating eCommerce application. Sessions are clustered due to the dwell times a user spends on certain page-areas. The solution allows for the interpretation of the navigation behavior in terms of survival and hazard functions. A software implementation by means of an R package is provided. (author´s abstract)
Series: Research Report Series / Department of Statistics and Mathematics

Стилі APA, Harvard, Vancouver, ISO та ін.

27

Suleiman, Iyad. "Integrating data mining and social network techniques into the development of a Web-based adaptive play-based assessment tool for school readiness." Thesis, University of Bradford, 2013. http://hdl.handle.net/10454/7293.

Повний текст джерела

Анотація:

A major challenge that faces most families is effectively anticipating how ready to start school a given child is. Traditional tests are not very effective as they depend on the skills of the expert conducting the test. It is argued that automated tools are more attractive especially when they are extended with games capabilities that would be the most attractive for the children to be seriously involved in the test. The first part of this thesis reviews the school readiness approaches applied in various countries. This motivated the development of the sophisticated system described in the thesis. Extensive research was conducted to enrich the system with features that consider machine learning and social network aspects. A modified genetic algorithm was integrated into a web-based stealth assessment tool for school readiness. The research goal is to create a web-based stealth assessment tool that can learn the user's skills and adjust the assessment tests accordingly. The user plays various sessions from various games, while the Genetic Algorithm (GA) selects the upcoming session or group of sessions to be presented to the user according to his/her skills and status. The modified GA and the learning procedure were described. A penalizing system and a fitness heuristic for best choice selection were integrated into the GA. Two methods for learning were presented, namely a memory system and a no-memory system. Several methods were presented for the improvement of the speed of learning. In addition, learning mechanisms were introduced in the social network aspect to address further usage of stealth assessment automation. The effect of the relatives and friends on the readiness of the child was studied by investigating the social communities to which the child belongs and how the trend in these communities will reflect on to the child under investigation. The plan is to develop this framework further by incorporating more information related to social network construction and analysis. Also, it is planned to turn the framework into a self adaptive one by utilizing the feedback from the usage patterns to learn and adjust the evaluation process accordingly.

Стилі APA, Harvard, Vancouver, ISO та ін.

28

Chen, Xiaowei. "Measurement, analysis and improvement of BitTorrent Darknets." HKBU Institutional Repository, 2013. http://repository.hkbu.edu.hk/etd_ra/1545.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

29

Calderón-Benavides, Liliana. "Unsupervised Identification of the User’s Query Intent in Web Search." Doctoral thesis, Universitat Pompeu Fabra, 2011. http://hdl.handle.net/10803/51299.

Повний текст джерела

Анотація:

This doctoral work focuses on identifying and understanding the intents that motivate a user to perform a search on the Web. To this end, we apply machine learning models that do not require more information than the one provided by the very needs of the users, which in this work are represented by their queries. The knowledge and interpretation of this invaluable information can help search engines to obtain resources especially relevant to users, and thus improve their satisfaction. By means of unsupervised learning techniques, which have been selected according to the context of the problem being solved, we show that is not only possible to identify the user’s intents, but that this process can be conducted automatically. The research conducted in this thesis has involved an evolutionary process that starts from the manual analysis of different sets of real user queries from a search engine. The work passes through the proposition of a new classification of user’s query intents; the application of different unsupervised learning techniques to identify those intents; up to determine that the user’s intents, rather than being considered as an uni–dimensional problem, should be conceived as a composition of several aspects, or dimensions (i.e., as a multi–dimensional problem), that contribute to clarify and to establish what the user’s intents are. Furthermore, from this last proposal, we have configured a framework for the on–line identification of the user’s query intent. Overall, the results from this research have shown to be effective for the problem of identifying user’s query intent.
Este trabajo doctoral se enfoca en identificar y entender las intenciones que motivan a los usuarios a realizar búsquedas en la Web a través de la aplicación de métodos de aprendizaje automático que no requieren datos adicionales más que las necesidades de información de los mismos usuarios, representadas a través de sus consultas. El conocimiento y la interpretación de esta información, de valor incalculable, puede ayudar a los sistemas de búsqueda Web a encontrar recursos particularmente relevantes y así mejorar la satisfacción de sus usuarios. A través del uso de técnicas de aprendizaje no supervisado, las cuales han sido seleccionadas dependiendo del contexto del problema a solucionar, y cuyos resultados han demostrado ser efectivos para cada uno de los problemas planteados, a lo largo de este trabajo se muestra que no solo es posible identificar las intenciones de los usuarios, sino que este es un proceso que se puede llevar a cabo de manera automática. La investigación desarrollada en esta tesis ha implicado un proceso evolutivo, el cual inicia con el análisis de la clasificación manual de diferentes conjuntos de consultas que usuarios reales han sometido a un motor de búsqueda. El trabajo pasa a través de la proposición de una nueva clasificación de las intenciones de consulta de usuarios, y el uso de diferentes técnicas de aprendizaje no supervisado para identificar dichas intenciones, llegando hasta establecer que éste no es un problema unidimensional, sino que debería ser considerado como un problema de múltiples dimensiones, donde cada una de dichas dimensiones, o facetas, contribuye a clarificar y establecer cuál es la intención del usuario. A partir de este último trabajo, hemos creado un modelo para la identificar la intención del usuario en un escenario on–line.

Стилі APA, Harvard, Vancouver, ISO та ін.

30

Song, Ge. "Méthodes parallèles pour le traitement des flux de données continus." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC059/document.

Повний текст джерела

Анотація:

Nous vivons dans un monde où une grande quantité de données est généré en continu. Par exemple, quand on fait une recherche sur Google, quand on achète quelque chose sur Amazon, quand on clique en ‘Aimer’ sur Facebook, quand on upload une image sur Instagram, et quand un capteur est activé, etc., de nouvelles données vont être généré. Les données sont différentes d’une simple information numérique, mais viennent dans de nombreux format. Cependant, les données prisent isolément n’ont aucun sens. Mais quand ces données sont reliées ensemble on peut en extraire de nouvelles informations. De plus, les données sont sensibles au temps. La façon la plus précise et efficace de représenter les données est de les exprimer en tant que flux de données. Si les données les plus récentes ne sont pas traitées rapidement, les résultats obtenus ne sont pas aussi utiles. Ainsi, un système parallèle et distribué pour traiter de grandes quantités de flux de données en temps réel est un problème de recherche important. Il offre aussi de bonne perspective d’application. Dans cette thèse nous étudions l’opération de jointure sur des flux de données, de manière parallèle et continue. Nous séparons ce problème en deux catégories. La première est la jointure en parallèle et continue guidée par les données. La second est la jointure en parallèle et continue guidée par les requêtes
We live in a world where a vast amount of data is being continuously generated. Data is coming in a variety of ways. For example, every time we do a search on Google, every time we purchase something on Amazon, every time we click a ‘like’ on Facebook, every time we upload an image on Instagram, every time a sensor is activated, etc., it will generate new data. Data is different than simple numerical information, it now comes in a variety of forms. However, isolated data is valueless. But when this huge amount of data is connected, it is very valuable to look for new insights. At the same time, data is time sensitive. The most accurate and effective way of describing data is to express it as a data stream. If the latest data is not promptly processed, the opportunity of having the most useful results will be missed.So a parallel and distributed system for processing large amount of data streams in real time has an important research value and a good application prospect. This thesis focuses on the study of parallel and continuous data stream Joins. We divide this problem into two categories. The first one is Data Driven Parallel and Continuous Join, and the second one is Query Driven Parallel and Continuous Join

Стилі APA, Harvard, Vancouver, ISO та ін.

31

Van, der Westhuizen Frederick Jacques. "Lifetime value modelling / Frederick Jacques van der Westhuizen." Thesis, North-West University, 2009. http://hdl.handle.net/10394/2521.

Повний текст джерела

Анотація:

Given the increase in popularity of Lifetime Value (LTV), the argument is that the topic will assume an increasingly central role in research and marketing. As such, the decision to assess the state of the field in Lifetime Value Modelling, and outline challenges unique to choice researchers in customer relationship management (CRM). As the research has argued, there are an excess of issues and analytical challenges that remain unresolved. The researcher hopes that this thesis inspires new answers and new approaches to resolve LTV. The scope of this project covers the building of a LTV model through multiple regression. This thesis is exclusively focused on modelling tenure. In this regard, there are a variety of benchmark statistical techniques arising from survival analysis, which could be applied, to tenure modelling. Tenure prediction will be looked at using survival analysis and compared with "crossbreed" data mining techniques that use multiple regression in concurrence with statistical techniques. It will be demonstrated how data mining tools complement the statistical models, and show that their mutual usage overcomes many of the shortcomings of each singular tool set, resulting in LTV models that are both accurate and comprehensible. Bank XYZ is used as an example and is based on a real scenario of one of the Banks of South Africa.
Thesis (M.Sc. (Computer Science))--North-West University, Vaal Triangle Campus, 2009.

Стилі APA, Harvard, Vancouver, ISO та ін.

32

Castellanos-Paez, Sandra. "Apprentissage de routines pour la prise de décision séquentielle." Thesis, Université Grenoble Alpes (ComUE), 2019. http://www.theses.fr/2019GREAM043.

Повний текст джерела

Анотація:

Intuitivement, un système capable d'exploiter son expérience devrait être capable d'atteindre de meilleures performances. Une façon de tirer parti des expériences passées est d'apprendre des macros (c.-à-d. des routines), elle peuvent être ensuite utilisés pour améliorer la performance du processus de résolution de nouveaux problèmes. Le défi de la planification automatique est de développer des techniques de planification capables d'explorer efficacement l'espace de recherche qui croît exponentiellement. L'apprentissage de macros à partir de connaissances précédemment acquises s'avère bénéfique pour l'amélioration de la performance d'un planificateur.Cette thèse contribue principalement au domaine de la planification automatique, et plus spécifiquement à l’apprentissage de macros pour la planification classique. Nous nous sommes concentrés sur le développement d'un modèle d'apprentissage indépendant du domaine qui identifie des séquences d'actions (même non adjacentes) à partir de plans solutions connus. Ce dernier sélectionne les routines les plus utiles (c'est-à-dire les macros), grâce à une évaluation a priori, pour améliorer le domaine de planification.Tout d'abord, nous avons étudié la possibilité d'utiliser la fouille de motifs séquentiels pour extraire des séquences fréquentes d'actions à partir de plans de solutions connus, et le lien entre la fréquence d'une macro et son utilité. Nous avons découvert que la fréquence seule peut ne pas fournir une sélection cohérente de macro-actions utiles (c.-à-d. des séquences d'actions avec des objets constants).Ensuite, nous avons discuté du problème de l'apprentissage des macro-opérateurs (c'est-à-dire des séquences d'actions avec des objets variables) en utilisant des algorithmes classiques de fouille de motifs dans la planification. Malgré les efforts, nous nous sommes trouvés dans une impasse dans le processus de sélection car les structures de filtrage de la fouille de motifs ne sont pas adaptées à la planification.Finalement, nous avons proposé une nouvelle approche appelée METEOR, qui permet de trouver les séquences fréquentes d'opérateurs d'un ensemble de plans sans perte d'information sur leurs caractéristiques. Cette approche a été conçue pour l'extraction des macro-opérateurs à partir de plans solutions connus, et pour la sélection d'un ensemble optimal de macro-opérateurs maximisant le gain en nœuds. Il s'est avéré efficace pour extraire avec succès des macro-opérateurs de différentes longueurs pour quatre domaines de référence différents. De plus, grâce à la phase de sélection l'approche a montré un impact positif sur le temps de recherche sans réduire drastiquement la qualité des plans
Intuitively, a system capable of exploiting its past experiences should be able to achieve better performance. One way to build on past experiences is to learn macros (i.e. routines). They can then be used to improve the performance of the solving process of new problems. In automated planning, the challenge remains on developing powerful planning techniques capable of effectively explore the search space that grows exponentially. Learning macros from previously acquired knowledge has proven to be beneficial for improving a planner's performance. This thesis contributes mainly to the field of automated planning, and it is more specifically related to learning macros for classical planning. We focused on developing a domain-independent learning framework that identifies sequences of actions (even non-adjacent) from past solution plans and selects the most useful routines (i.e. macros), based on a priori evaluation, to enhance the planning domain.First, we studied the possibility of using sequential pattern mining for extracting frequent sequences of actions from past solution plans, and the link between the frequency of a macro and its utility. We found out that the frequency alone may not provide a consistent selection of useful macro-actions (i.e. sequences of actions with constant objects).Second, we discussed the problem of learning macro-operators (i.e. sequences of actions with variable objects) by using classic pattern mining algorithms in planning. Despite the efforts, we find ourselves in a dead-end with the selection process because the pattern mining filtering structures are not adapted to planning.Finally, we provided a novel approach called METEOR, which ensures to find the frequent sequences of operators from a set of plans without a loss of information about their characteristics. This framework was conceived for mining macro-operators from past solution plans, and for selecting the optimal set of macro-operators that maximises the node gain. It has proven to successfully mine macro-operators of different lengths for four different benchmarks domains and thanks to the selection phase, be able to deliver a positive impact on the search time without drastically decreasing the quality of the plans

Стилі APA, Harvard, Vancouver, ISO та ін.

33

Malherbe, Emmanuel. "Standardization of textual data for comprehensive job market analysis." Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLC058/document.

Повний текст джерела

Анотація:

Sachant qu'une grande partie des offres d'emplois et des profils candidats est en ligne, le e-recrutement constitue un riche objet d'étude. Ces documents sont des textes non structurés, et le grand nombre ainsi que l'hétérogénéité des sites de recrutement implique une profusion de vocabulaires et nomenclatures. Avec l'objectif de manipuler plus aisément ces données, Multiposting, une entreprise française spécialisée dans les outils de e-recrutement, a soutenu cette thèse, notamment en terme de données, en fournissant des millions de CV numériques et offres d'emplois agrégées de sources publiques.Une difficulté lors de la manipulation de telles données est d'en déduire les concepts sous-jacents, les concepts derrière les mots n'étant compréhensibles que des humains. Déduire de tels attributs structurés à partir de donnée textuelle brute est le problème abordé dans cette thèse, sous le nom de normalisation. Avec l'objectif d'un traitement unifié, la normalisation doit fournir des valeurs dans une nomenclature, de sorte que les attributs résultants forment une représentation structurée unique de l'information. Ce traitement traduit donc chaque document en un language commun, ce qui permet d'agréger l'ensemble des données dans un format exploitable et compréhensible. Plusieurs questions sont cependant soulevées: peut-on exploiter les structures locales des sites web dans l'objectif d'une normalisation finale unifiée? Quelle structure de nomenclature est la plus adaptée à la normalisation, et comment l'exploiter? Est-il possible de construire automatiquement une telle nomenclature de zéro, ou de normaliser sans en avoir une?Pour illustrer le problème de la normalisation, nous allons étudier par exemple la déduction des compétences ou de la catégorie professionelle d'une offre d'emploi, ou encore du niveau d'étude d'un profil de candidat. Un défi du e-recrutement est que les concepts évoluent continuellement, de sorte que la normalisation se doit de suivre les tendances du marché. A la lumière de cela, nous allons proposer un ensemble de modèles d'apprentissage statistique nécessitant le minimum de supervision et facilement adaptables à l'évolution des nomenclatures. Les questions posées ont trouvé des solutions dans le raisonnement à partir de cas, le learning-to-rank semi-supervisé, les modèles à variable latente, ainsi qu'en bénéficiant de l'Open Data et des médias sociaux. Les différents modèles proposés ont été expérimentés sur des données réelles, avant d'être implémentés industriellement. La normalisation résultante est au coeur de SmartSearch, un projet qui fournit une analyse exhaustive du marché de l'emploi
With so many job adverts and candidate profiles available online, the e-recruitment constitutes a rich object of study. All this information is however textual data, which from a computational point of view is unstructured. The large number and heterogeneity of recruitment websites also means that there is a lot of vocabularies and nomenclatures. One of the difficulties when dealing with this type of raw textual data is being able to grasp the concepts contained in it, which is the problem of standardization that is tackled in this thesis. The aim of standardization is to create a unified process providing values in a nomenclature. A nomenclature is by definition a finite set of meaningful concepts, which means that the attributes resulting from standardization are a structured representation of the information. Several questions are however raised: Are the websites' structured data usable for a unified standardization? What structure of nomenclature is the best suited for standardization, and how to leverage it? Is it possible to automatically build such a nomenclature from scratch, or to manage the standardization process without one? To illustrate the various obstacles of standardization, the examples we are going to study include the inference of the skills or the category of a job advert, or the level of training of a candidate profile. One of the challenges of e-recruitment is that the concepts are continuously evolving, which means that the standardization must be up-to-date with job market trends. In light of this, we will propose a set of machine learning models that require minimal supervision and can easily adapt to the evolution of the nomenclatures. The questions raised found partial answers using Case Based Reasoning, semi-supervised Learning-to-Rank, latent variable models, and leveraging the evolving sources of the semantic web and social media. The different models proposed have been tested on real-world data, before being implemented in a industrial environment. The resulting standardization is at the core of SmartSearch, a project which provides a comprehensive analysis of the job market

Стилі APA, Harvard, Vancouver, ISO та ін.

34

Klinczak, Marjori Naiele Mocelin. "Identificação e propagação de temas em redes sociais." Universidade Tecnológica Federal do Paraná, 2016. http://repositorio.utfpr.edu.br/jspui/handle/1/2304.

Повний текст джерела

Анотація:

Os últimos anos foram marcados pelo surgimento de diversas mídias sociais, desde o Orkut até o Facebook, assim como Twitter, Youtube, Google+ e tantos outros: cada um oferece novas funcionalidades como forma de atrair um maior número de usuários. Essas mídias sociais geram uma grande quantidade de dados, que se devidamente processados podem ser utilizados para se identificar tendências, padrões e mudanças. O objetivo deste trabalho é a descoberta dos principais temas abordados em uma rede social, caracterizados como agrupamentos de termos relevantes, restritos a determinado contexto e o estudo de sua evolução ao longo do tempo. Para tanto serão utilizados procedimentos fundamentados em Mineração de Dados e no Processamento de Textos. Em um primeiro momento são utilizadas técnicas de pré-processamento de textos com o objetivo de identificar os termos mais relevantes que aparecem nas mensagens textuais da rede social. Em seguida utilizam-se algoritmos clássicos de agrupamento - k-means, k-medoids, DBSCAN - e o recente NMF (Non-negative Matrix Factorization), para a identificação dos temas principais destas mensagens, caracterizados como agrupamentos de termos relevantes. A proposta foi avaliada sobre a rede Twitter, utilizando-se bases de tweets considerando diversos contextos. Os resultados obtidos evidenciam a viabilidade da proposta e sua aplicação na identificação de temas relevantes desta rede social.
Recent years have been marked by the emergence of various social media, from Orkut to Facebook, and Twitter, Youtube, Google+ and many others: each offers new features as a way to attract more users. These social media generate a large amount of data which is processed properly can be used to identify trends, patterns and changes. The objective of this work is the discovery of the key topics in a social network, characterized as relevant terms groupings, restricted to a particular context and the study of its evolution over time. For that will be used procedures based on Data Mining and Text Processing. At first techniques are used preprocessing of texts in order to identify the most relevant terms that appear in the text messages from the social network. Next are used grouping of classical algorithms - k-means, k-medoids, DBSCAN - and the recent NMF (Non-negative Matrix Factorization), to identify the main themes of these messages, characterized as relevant terms groupings. The proposal was evaluated on the Twitter network, using bases tweets considering different contexts. The results show the feasibility of the proposal and its application in the identification of relevant topics of this social network

Стилі APA, Harvard, Vancouver, ISO та ін.

35

Nguyen, Hoang Viet Tuan. "Prise en compte de la qualité des données lors de l’extraction et de la sélection d’évolutions dans les séries temporelles de champs de déplacements en imagerie satellitaire." Thesis, Université Grenoble Alpes (ComUE), 2018. http://www.theses.fr/2018GREAA011.

Повний текст джерела

Анотація:

Ce travail de thèse traite de la découverte de connaissances à partir de Séries Temporelles de Champs de Déplacements (STCD) obtenues par imagerie satellitaire. De telles séries occupent aujourd'hui une place centrale dans l'étude et la surveillance de phénomènes naturels tels que les tremblements de terre, les éruptions volcaniques ou bien encore le déplacement des glaciers. En effet, ces séries sont riches d'informations à la fois spatiales et temporelles et peuvent aujourd'hui être produites régulièrement à moindre coût grâce à des programmes spatiaux tels que le programme européen Copernicus et ses satellites phares Sentinel. Nos propositions s'appuient sur l'extraction de motifs Séquentiels Fréquents Groupés (SFG). Ces motifs, à l'origine définis pour l'extraction de connaissances à partir des Séries Temporelles d’Images Satellitaires (STIS), ont montré leur potentiel dans de premiers travaux visant à dépouiller une STCD. Néanmoins, ils ne permettent pas d'utiliser les indices de confiance intrinsèques aux STCD et la méthode de swap randomisation employée pour sélectionner les motifs les plus prometteurs ne tient pas compte de leurs complémentarités spatiotemporelles, chaque motif étant évalué individuellement. Notre contribution est ainsi double. Une première proposition vise tout d'abord à associer une mesure de fiabilité à chaque motif en utilisant les indices de confiance. Cette mesure permet de sélectionner les motifs portés par des données qui sont en moyenne suffisamment fiables. Nous proposons un algorithme correspondant pour réaliser les extractions sous contrainte de fiabilité. Celui-ci s'appuie notamment sur une recherche efficace des occurrences les plus fiables par programmation dynamique et sur un élagage de l'espace de recherche grâce à une stratégie de push partiel, ce qui permet de considérer des STCD conséquentes. Cette nouvelle méthode a été implémentée sur la base du prototype existant SITS-P2miner, développé au sein du LISTIC et du LIRIS pour extraire et classer des motifs SFG. Une deuxième contribution visant à sélectionner les motifs les plus prometteurs est également présentée. Celle-ci, basée sur un critère informationnel, permet de prendre en compte à la fois les indices de confiance et la façon dont les motifs se complètent spatialement et temporellement. Pour ce faire, les indices de confiance sont interprétés comme des probabilités, et les STCD comme des bases de données probabilistes dont les distributions ne sont que partielles. Le gain informationnel associé à un motif est alors défini en fonction de la capacité de ses occurrences à compléter/affiner les distributions caractérisant les données. Sur cette base, une heuristique est proposée afin de sélectionner des motifs informatifs et complémentaires. Cette méthode permet de fournir un ensemble de motifs faiblement redondants et donc plus faciles à interpréter que ceux fournis par swap randomisation. Elle a été implémentée au sein d'un prototype dédié. Les deux propositions sont évaluées à la fois quantitativement et qualitativement en utilisant une STCD de référence couvrant des glaciers du Groenland construite à partir de données optiques Landsat. Une autre STCD que nous avons construite à partir de données radar TerraSAR-X couvrant le massif du Mont-Blanc est également utilisée. Outre le fait d'être construites à partir de données et de techniques de télédétection différentes, ces séries se différencient drastiquement en termes d'indices de confiance, la série couvrant le massif du Mont-Blanc se situant à des niveaux de confiance très faibles. Pour les deux STCD, les méthodes proposées ont été mises en œuvre dans des conditions standards au niveau consommation de ressources (temps, espace), et les connaissances des experts sur les zones étudiées ont été confirmées et complétées
This PhD thesis deals with knowledge discovery from Displacement Field Time Series (DFTS) obtained by satellite imagery. Such series now occupy a central place in the study and monitoring of natural phenomena such as earthquakes, volcanic eruptions and glacier displacements. These series are indeed rich in both spatial and temporal information and can now be produced regularly at a lower cost thanks to spatial programs such as the European Copernicus program and its famous Sentinel satellites. Our proposals are based on the extraction of grouped frequent sequential patterns. These patterns, originally defined for the extraction of knowledge from Satellite Image Time Series (SITS), have shown their potential in early work to analyze a DFTS. Nevertheless, they cannot use the confidence indices coming along with DFTS and the swap method used to select the most promising patterns does not take into account their spatiotemporal complementarities, each pattern being evaluated individually. Our contribution is thus double. A first proposal aims to associate a measure of reliability with each pattern by using the confidence indices. This measure allows to select patterns having occurrences in the data that are on average sufficiently reliable. We propose a corresponding constraint-based extraction algorithm. It relies on an efficient search of the most reliable occurrences by dynamic programming and on a pruning of the search space provided by a partial push strategy. This new method has been implemented on the basis of the existing prototype SITS-P2miner, developed by the LISTIC and LIRIS laboratories to extract and rank grouped frequent sequential patterns. A second contribution for the selection of the most promising patterns is also made. This one, based on an informational criterion, makes it possible to take into account at the same time the confidence indices and the way the patterns complement each other spatially and temporally. For this aim, the confidence indices are interpreted as probabilities, and the DFTS are seen as probabilistic databases whose distributions are only partial. The informational gain associated with a pattern is then defined according to the ability of its occurrences to complete/refine the distributions characterizing the data. On this basis, a heuristic is proposed to select informative and complementary patterns. This method provides a set of weakly redundant patterns and therefore easier to interpret than those provided by swap randomization. It has been implemented in a dedicated prototype. Both proposals are evaluated quantitatively and qualitatively using a reference DFTS covering Greenland glaciers constructed from Landsat optical data. Another DFTS that we built from TerraSAR-X radar data covering the Mont-Blanc massif is also used. In addition to being constructed from different data and remote sensing techniques, these series differ drastically in terms of confidence indices, the series covering the Mont-Blanc massif being at very low levels of confidence. In both cases, the proposed methods operate under standard conditions of resource consumption (time, space), and experts’ knowledge of the studied areas is confirmed and completed

Стилі APA, Harvard, Vancouver, ISO та ін.

36

Aleksandrova, Marharyta. "Factorisation de matrices et analyse de contraste pour la recommandation." Thesis, Université de Lorraine, 2017. http://www.theses.fr/2017LORR0080/document.

Повний текст джерела

Анотація:

Dans de nombreux domaines, les données peuvent être de grande dimension. Ça pose le problème de la réduction de dimension. Les techniques de réduction de dimension peuvent être classées en fonction de leur but : techniques pour la représentation optimale et techniques pour la classification, ainsi qu'en fonction de leur stratégie : la sélection et l'extraction des caractéristiques. L'ensemble des caractéristiques résultant des méthodes d'extraction est non interprétable. Ainsi, la première problématique scientifique de la thèse est comment extraire des caractéristiques latentes interprétables? La réduction de dimension pour la classification vise à améliorer la puissance de classification du sous-ensemble sélectionné. Nous voyons le développement de la tâche de classification comme la tâche d'identification des facteurs déclencheurs, c'est-à-dire des facteurs qui peuvent influencer le transfert d'éléments de données d'une classe à l'autre. La deuxième problématique scientifique de cette thèse est comment identifier automatiquement ces facteurs déclencheurs? Nous visons à résoudre les deux problématiques scientifiques dans le domaine d'application des systèmes de recommandation. Nous proposons d'interpréter les caractéristiques latentes de systèmes de recommandation basés sur la factorisation de matrices comme des utilisateurs réels. Nous concevons un algorithme d'identification automatique des facteurs déclencheurs basé sur les concepts d'analyse par contraste. Au travers d'expérimentations, nous montrons que les motifs définis peuvent être considérés comme des facteurs déclencheurs
In many application areas, data elements can be high-dimensional. This raises the problem of dimensionality reduction. The dimensionality reduction techniques can be classified based on their aim: dimensionality reduction for optimal data representation and dimensionality reduction for classification, as well as based on the adopted strategy: feature selection and feature extraction. The set of features resulting from feature extraction methods is usually uninterpretable. Thereby, the first scientific problematic of the thesis is how to extract interpretable latent features? The dimensionality reduction for classification aims to enhance the classification power of the selected subset of features. We see the development of the task of classification as the task of trigger factors identification that is identification of those factors that can influence the transfer of data elements from one class to another. The second scientific problematic of this thesis is how to automatically identify these trigger factors? We aim at solving both scientific problematics within the recommender systems application domain. We propose to interpret latent features for the matrix factorization-based recommender systems as real users. We design an algorithm for automatic identification of trigger factors based on the concepts of contrast analysis. Through experimental results, we show that the defined patterns indeed can be considered as trigger factors

Стилі APA, Harvard, Vancouver, ISO та ін.

37

Braik, William. "Détection d'évènements complexes dans les flux d'évènements massifs." Thesis, Bordeaux, 2017. http://www.theses.fr/2017BORD0596/document.

Повний текст джерела

Анотація:

La détection d’évènements complexes dans les flux d’évènements est un domaine qui a récemment fait surface dans le ecommerce. Notre partenaire industriel Cdiscount, parmi les sites ecommerce les plus importants en France, vise à identifier en temps réel des scénarios de navigation afin d’analyser le comportement des clients. Les objectifs principaux sont la performance et la mise à l’échelle : les scénarios de navigation doivent être détectés en moins de quelques secondes, alorsque des millions de clients visitent le site chaque jour, générant ainsi un flux d’évènements massif.Dans cette thèse, nous présentons Auros, un système permettant l’identification efficace et à grande échelle de scénarios de navigation conçu pour le eCommerce. Ce système s’appuie sur un langage dédié pour l’expression des scénarios à identifier. Les règles de détection définies sont ensuite compilées en automates déterministes, qui sont exécutés au sein d’une plateforme Big Data adaptée au traitement de flux. Notre évaluation montre qu’Auros répond aux exigences formulées par Cdiscount, en étant capable de traiter plus de 10,000 évènements par seconde, avec une latence de détection inférieure à une seconde
Pattern detection over streams of events is gaining more and more attention, especially in the field of eCommerce. Our industrial partner Cdiscount, which is one of the largest eCommerce companies in France, aims to use pattern detection for real-time customer behavior analysis. The main challenges to consider are efficiency and scalability, as the detection of customer behaviors must be achieved within a few seconds, while millions of unique customers visit the website every day,thus producing a large event stream. In this thesis, we present Auros, a system for large-scale an defficient pattern detection for eCommerce. It relies on a domain-specific language to define behavior patterns. Patterns are then compiled into deterministic finite automata, which are run on a BigData streaming platform. Our evaluation shows that our approach is efficient and scalable, and fits the requirements of Cdiscount

Стилі APA, Harvard, Vancouver, ISO та ін.

38

Ng, Kwun-Keung. "An empirical study of web usage mining techniques." Thesis, 2002. http://spectrum.library.concordia.ca/1805/1/MQ72942.pdf.

Повний текст джерела

Анотація:

Most of the existing web sites organize their content in a hierarchical manner. This organization may not be clear to the visitors because each visitor may have their own expected organization. For instance, it is often unclear to a visitor where a specific document is located. Usage knowledge, discovered from web usage mining, on the way visitors navigate in a web site could prevent disorientation, help the web site owner in designing the web site, provide efficient access between highly correlated object, and make better marketing decisions such as putting advertisements in proper places. In this report, we will give an overview on the web usage mining process with special emphasis on presenting two data mining techniques: association rules and path traversal pattern discovery. We introduced the notion of context awareness web usage mining which is a constraint pushed into these data mining techniques. As well, we will present our design and implementation of a web usage mining system call WUM. Finally, we will show the experimentation of using our WUM to mine for usage knowledge for a Computer Science department's web site.

Стилі APA, Harvard, Vancouver, ISO та ін.

39

"Web mining techniques for query log analysis and expertise retrieval." Thesis, 2009. http://library.cuhk.edu.hk/record=b6075418.

Повний текст джерела

Анотація:

Deng, Hongbo.
Thesis (Ph.D.)--Chinese University of Hong Kong, 2009.
Includes bibliographical references (leaves 156-175).
Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web.
Abstract also in Chinese.

Стилі APA, Harvard, Vancouver, ISO та ін.

40

Xu, Guandong. "Web mining techniques for recommendation and personalization." Thesis, 2008. https://vuir.vu.edu.au/1422/.

Повний текст джерела

Анотація:

Nowadays Web users are facing the problems of information overload and drowning due to the significant and rapid growth in the amount of information and the number of users. As a result, how to provide Web users with more exactly needed information is becoming a critical issue in web-based information retrieval and Web applications. In this work, we aim to address improving the performance of Web information retrieval and Web presentation through developing and employing Web data mining paradigms. Web data mining is a process that discovers the intrinsic relationships among Web data, which are expressed in the forms of textual, linkage or usage information, via analysing the features of the Web and web-based data using data mining techniques. Particularly, we concentrate on discovering Web usage pattern via Web usage mining, and then utilize the discovered usage knowledge for presenting Web users with more personalized Web contents, i.e. Web recommendation. For analysing Web user behaviour, we first establish a mathematical framework, called the usage data analysis model, to characterise the observed co-occurrence of Web log files. In this mathematical model, the relationships between Web users and pages are expressed by a matrix-based usage data schema. On the basis of this data model, we aim to devise algorithms to discover mutual associations between Web pages and user sessions hidden in the collected Web log data, and in turn, to use this kind of knowledge to uncover user access patterns. To reveal the underlying relationships among Web objects, such as Web pages or user sessions, and find the Web page categories and usage patterns from Web log files, we have proposed three kinds of latent semantic analytical techniques based on three statistical models, namely traditional Latent Semantic Indexing, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation model. In comparison to conventional Web usage mining approaches, the main strengths of latent semantic based analysis are their capabilities that can not only, capture the mutual correlations hidden in the observed objects explicitly, but also reveal the unseen latent factors/tasks associated with the discovered knowledge implicitly. In the traditional Latent Semantic Indexing, a specific matrix operation, i.e. Singular Value Decomposition algorithm, is employed on the usage data to discover the Web user behaviour pattern over a transformed latent Web page space, which contains the maximum approximation of the original Web page space. Then, a k-means clustering algorithm is applied to the transformed usage data to partition user sessions. The discovered Web user session group is eventually treated as a user session aggregation, in which all users share like-minded access task or intention. The centroids of the discovered user session clusters are, then, constructed as user profiles. In addition to intuitive latent semantic analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation approaches are also introduced into Web usage mining for Web page grouping and usage profiling via a probability inference approach. Meanwhile, the latent task space is captured by interpreting the contents of prominent Web pages, which significantly contribute to the user access preference. In contrast to traditional latent semantic analysis, the latter two approaches are capable of not only revealing the underlying associations between Web pages and users, but also capturing the latent task space, which is corresponding to user navigational patterns and Web site functionality. Experiments are performed to discover user access patterns, reveal the latent task space and evaluate the proposed techniques in terms of quality of clustering. The discovered user profiles, which are represented by the centroids of the Web user session clusters, are then used to make usage-based collaborative recommendation via a top-N weighted scoring scheme algorithm. In this scheme, the generated user profiles are learned from usage data in an offline stage using above described methods, and are considered as a usage pattern knowledge base. When a new active user session is coming, a matching operation is carried out to find the most matched/closest usage pattern/user profile by measuring the similarity between the active user session and the learned user profiles. The user profile with the largest similarity is selected as the most matched usage profile, which reflects the most similar access interest to the active user session. Then, the pages in the most matched usage profile are ranked in a descending order by examining the normalized page weights, which are corresponding to how likely it is that the pages will be visited in near future. Finally, the top-N pages in the ranked list are recommended to the user as the recommendation pages that are very likely to be visited in the coming period. To evaluate the effectiveness and efficiency of the recommendation, experiments are conducted in terms of the proposed recommendation accuracy metric. The experimental results have demonstrated that the proposed latent semantic analysis models and related algorithms are able to efficiently extract needed usage knowledge and to accurately make Web recommendations. Data mining techniques have been widely used in many other domains recently due to the powerful capability of non-linear learning from a wide range of data sources. In this study, we also extend the proposed methodologies and technologies to a biomechanical data mining application, namely gait pattern mining. Likewise in the context of Web mining, various clustering-based learning approaches are performed on the constructed gait variable data model, which is expressed as a feature vector of kinematic variables, to discover the subject gait classes. The centroids of the partitioned gait clusters are used to represent different specific walking characteristics. The data analysis on two gait datasets corresponding to various specific populations is carried out to demonstrate the feasibility and applicability of gait pattern mining. The results have shown the discovered gait pattern knowledge can be used as a useful means for human movement research and clinical applications.

Стилі APA, Harvard, Vancouver, ISO та ін.

41

Xu, Guandong. "Web mining techniques for recommendation and personalization." 2008. http://eprints.vu.edu.au/1422/1/xu.pdf.

Повний текст джерела

Анотація:

Nowadays Web users are facing the problems of information overload and drowning due to the significant and rapid growth in the amount of information and the number of users. As a result, how to provide Web users with more exactly needed information is becoming a critical issue in web-based information retrieval and Web applications. In this work, we aim to address improving the performance of Web information retrieval and Web presentation through developing and employing Web data mining paradigms. Web data mining is a process that discovers the intrinsic relationships among Web data, which are expressed in the forms of textual, linkage or usage information, via analysing the features of the Web and web-based data using data mining techniques. Particularly, we concentrate on discovering Web usage pattern via Web usage mining, and then utilize the discovered usage knowledge for presenting Web users with more personalized Web contents, i.e. Web recommendation. For analysing Web user behaviour, we first establish a mathematical framework, called the usage data analysis model, to characterise the observed co-occurrence of Web log files. In this mathematical model, the relationships between Web users and pages are expressed by a matrix-based usage data schema. On the basis of this data model, we aim to devise algorithms to discover mutual associations between Web pages and user sessions hidden in the collected Web log data, and in turn, to use this kind of knowledge to uncover user access patterns. To reveal the underlying relationships among Web objects, such as Web pages or user sessions, and find the Web page categories and usage patterns from Web log files, we have proposed three kinds of latent semantic analytical techniques based on three statistical models, namely traditional Latent Semantic Indexing, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation model. In comparison to conventional Web usage mining approaches, the main strengths of latent semantic based analysis are their capabilities that can not only, capture the mutual correlations hidden in the observed objects explicitly, but also reveal the unseen latent factors/tasks associated with the discovered knowledge implicitly. In the traditional Latent Semantic Indexing, a specific matrix operation, i.e. Singular Value Decomposition algorithm, is employed on the usage data to discover the Web user behaviour pattern over a transformed latent Web page space, which contains the maximum approximation of the original Web page space. Then, a k-means clustering algorithm is applied to the transformed usage data to partition user sessions. The discovered Web user session group is eventually treated as a user session aggregation, in which all users share like-minded access task or intention. The centroids of the discovered user session clusters are, then, constructed as user profiles. In addition to intuitive latent semantic analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation approaches are also introduced into Web usage mining for Web page grouping and usage profiling via a probability inference approach. Meanwhile, the latent task space is captured by interpreting the contents of prominent Web pages, which significantly contribute to the user access preference. In contrast to traditional latent semantic analysis, the latter two approaches are capable of not only revealing the underlying associations between Web pages and users, but also capturing the latent task space, which is corresponding to user navigational patterns and Web site functionality. Experiments are performed to discover user access patterns, reveal the latent task space and evaluate the proposed techniques in terms of quality of clustering. The discovered user profiles, which are represented by the centroids of the Web user session clusters, are then used to make usage-based collaborative recommendation via a top-N weighted scoring scheme algorithm. In this scheme, the generated user profiles are learned from usage data in an offline stage using above described methods, and are considered as a usage pattern knowledge base. When a new active user session is coming, a matching operation is carried out to find the most matched/closest usage pattern/user profile by measuring the similarity between the active user session and the learned user profiles. The user profile with the largest similarity is selected as the most matched usage profile, which reflects the most similar access interest to the active user session. Then, the pages in the most matched usage profile are ranked in a descending order by examining the normalized page weights, which are corresponding to how likely it is that the pages will be visited in near future. Finally, the top-N pages in the ranked list are recommended to the user as the recommendation pages that are very likely to be visited in the coming period. To evaluate the effectiveness and efficiency of the recommendation, experiments are conducted in terms of the proposed recommendation accuracy metric. The experimental results have demonstrated that the proposed latent semantic analysis models and related algorithms are able to efficiently extract needed usage knowledge and to accurately make Web recommendations. Data mining techniques have been widely used in many other domains recently due to the powerful capability of non-linear learning from a wide range of data sources. In this study, we also extend the proposed methodologies and technologies to a biomechanical data mining application, namely gait pattern mining. Likewise in the context of Web mining, various clustering-based learning approaches are performed on the constructed gait variable data model, which is expressed as a feature vector of kinematic variables, to discover the subject gait classes. The centroids of the partitioned gait clusters are used to represent different specific walking characteristics. The data analysis on two gait datasets corresponding to various specific populations is carried out to demonstrate the feasibility and applicability of gait pattern mining. The results have shown the discovered gait pattern knowledge can be used as a useful means for human movement research and clinical applications.

Стилі APA, Harvard, Vancouver, ISO та ін.

42

Cavalcanti, Fábio Torres. "Incremental mining techniques." Master's thesis, 2005. http://hdl.handle.net/1822/3965.

Повний текст джерела

Анотація:

Dissertação de mestrado em Sistemas de Dados e Processamento Analítico.
The increasing necessity of organizational data exploration and analysis, seeking new knowledge that may be implicit in their operational systems, has made the study of data mining techniques gain a huge impulse. This impulse can be clearly noticed in the e-commerce domain, where the analysis of client’s past behaviours is extremely valuable and may, eventually, bring up important working instruments for determining his future behaviour. Therefore, it is possible to predict what a Web site visitor might be looking for, and thus restructuring the Web site to meet his needs. Thereby, the visitor keeps longer navigating in the Web site, what increases his probability of getting attracted by some product, leading to its purchase. To achieve this goal, Web site adaptation has to be fast enough to change while the visitor navigates, and has also to ensure that this adaptation is made according to the most recent visitors’ navigation behaviour patterns, which requires a mining algorithm with a sufficiently good response time for frequently update the patterns. Typical databases are continuously changing over the time, what can invalidate some patterns or introduce new ones. Thus, conventional data mining techniques were proved to be inefficient, as they needed to re-execute to update the mining results with the ones derived from the last database changes. Incremental mining techniques emerged to avoid algorithm re-execution and to update mining results when incremental data are added or old data are removed, ensuring a better performance in the data mining processes. In this work, we analyze some existing incremental mining strategies and models, giving a particular emphasis in their application on Web sites, in order to develop models to discover Web user behaviour patterns and automatically generate some recommendations to restructure sites in useful time. For accomplishing this task, we designed and implemented Spottrigger, a system responsible for the whole data life cycle in a Web site restructuring work. This life cycle includes tasks specially oriented to extract the raw data stored in Web servers, pass these data by intermediate phases of cleansing and preparation, perform an incremental data mining technique to extract users’ navigation patterns and finally suggesting new locations of spots on the Web site according to the patterns found and the profile of the visitor. We applied Spottrigger in our case study, which was based on data gathered from a real online newspaper. Our main goal was to collect, in a useful time, information about users that at a given moment are consulting the site and thus restructuring the Web site in a short term, delivering the scheduled advertisements, activated according to the user’s profile. Basically, our idea is to have advertisements classified in levels and restructure the Web site to have the higher level advertisements in pages the visitor will most probably access. In order to do that, we construct a page ranking for the visitor, based on results obtained through the incremental mining technique. Since visitors’ navigation behaviour may change during time, the incremental mining algorithm will be responsible for catching this behaviour changes and fast update the patterns. Using Spottrigger as a decision support system for advertisement, a newspaper company may significantly improve the merchandising of its publicity spots guaranteeing that a given advertisement will reach to a higher number of visitors, even if they change their behaviour when visiting pages that were usually not visited.
A crescente necessidade de exploração e análise dos dados, na procura de novo conhecimento sobre o negócio de uma organização nos seus sistemas operacionais, tem feito o estudo das técnicas de mineração de dados ganhar um grande impulso. Este pode ser notado claramente no domínio do comércio electrónico, no qual a análise do comportamento passado dos clientes é extremamente valiosa e pode, eventualmente, fazer emergir novos elementos de trabalho, bastante válidos, para a determinação do seu comportamento no futuro. Desta forma, é possível prever aquilo que um visitante de um sítio Web pode andar à procura e, então, preparar esse sítio para atender melhor as suas necessidades. Desta forma, consegue-se fazer com que o visitante permaneça mais tempo a navegar por esse sítio o que aumenta naturalmente a possibilidade dele ser atraído por novos produtos e proceder, eventualmente, à sua aquisição. Para que este objectivo possa ser alcançado, a adaptação do sítio tem de ser suficientemente rápida para que possa acompanhar a navegação do visitante, ao mesmo tempo que assegura os mais recentes padrões de comportamento de navegação dos visitantes. Isto requer um algoritmo de mineração de dados com um nível de desempenho suficientemente bom para que se possa actualizar os padrões frequentemente. Com as constantes mudanças que ocorrem ao longo do tempo nas bases de dados, invalidando ou introduzindo novos padrões, as técnicas de mineração de dados convencionais provaram ser ineficientes, uma vez que necessitam de ser reexecutadas a fim de actualizar os resultados do processo de mineração com os dados subjacentes às modificações ocorridas na base de dados. As técnicas de mineração incremental surgiram com o intuito de evitar essa reexecução do algoritmo para actualizar os resultados da mineração quando novos dados (incrementais) são adicionados ou dados antigos são removidos. Assim, consegue-se assegurar uma maior eficiência aos processos de mineração de dados. Neste trabalho, analisamos algumas das diferentes estratégias e modelos para a mineração incremental de dados, dando-se particular ênfase à sua aplicação em sítios Web, visando desenvolver modelos para a descoberta de padrões de comportamento dos visitantes desses sítios e gerar automaticamente recomendações para a sua reestruturação em tempo útil. Para atingir esse objectivo projectámos e implementámos o sistema Spottrigger, que cobre todo o ciclo de vida do processo de reestruturação de um sítio Web. Este ciclo é composto, basicamente, por tarefas especialmente orientadas para a extracção de dados “crus” armazenados nos servidores Web, passar estes dados por fases intermédias de limpeza e preparação, executar uma técnica de mineração incremental para extrair padrões de navegação dos utilizadores e, finalmente, reestruturar o sítio Web de acordo com os padrões de navegação encontrados e com o perfil do próprio utilizador. Além disso, o sistema Spottrigger foi aplicado no nosso estudo de caso, o qual é baseado em dados reais provenientes de um jornal online. Nosso principal objectivo foi colectar, em tempo útil, alguma informação sobre o perfil dos utilizadores que num dado momento estão a consultar o sítio e, assim, fazer a reestruturação do sítio num período de tempo tão curto quanto o possível, exibindo os anúncios desejáveis, activados de acordo com o perfil do utilizador. Os anúncios do sistema estão classificados por níveis. Os sítios são reestruturados para que os anúncios de nível mais elevado sejam lançados nas páginas com maior probabilidade de serem visitadas. Nesse sentido, foi definida uma classificação das páginas para o utilizador, baseada nos padrões frequentes adquiridos através do processo de mineração incremental. Visto que o comportamento de navegação dos visitantes pode mudar ao longo do tempo, o algoritmo de mineração incremental será também responsável por capturar essas mudanças de comportamento e rapidamente actualizar os padrões. .

Стилі APA, Harvard, Vancouver, ISO та ін.

43

LIU, HUI-YU, and 劉慧瑜. "Application of Data Mining Techniques to a Web-Based Virtual Store." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/10108545208292454970.

Повний текст джерела

Стилі APA, Harvard, Vancouver, ISO та ін.

44

Huang, Hsing-Feng, and 黃星峯. "Using Data Mining Techniques to Build Adaptive E-Learning Web Site." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/61829979286470308750.

Повний текст джерела

Анотація:

碩士
大同大學
資訊經營學系(所)
92
The majority of e-learning Web sites have predefined course frameworks. No matter who enters the Web site, almost the same link types are offered. The course materials are added when the time is prolonging. As a result, the learners can be lost in the intricate links of teaching materials. Conklin indicated that ‘disorientation’ and ‘cognitive overhead’ are the two prime issues in hypermedia documents [1]. Thus, this thesis research focuses on the development of an adaptive model by taking pre-learning test before enrolling in the course materials, recording the browsing behavior on the course materials, and taking post-learning exam at the end of learning the course materials. These data are assembled to set up a data warehouse. We use data mining techniques, classification and association, to analyze the collected data to set up group navigation model and personal navigation model. Finally, we use the group navigation model and personal navigation model to predict learner’s personal navigation pattern and give adaptive guidance to the learner.

Стилі APA, Harvard, Vancouver, ISO та ін.

45

Huang, Hsing-Feng, and 黃星峰. "Using Data Mining Techniques to Build Adaptive E-Learning Web Site." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/54807931352614411674.

Повний текст джерела

Анотація:

碩士
大同大學
資訊經營研究所
92
The majority of e-learning Web sites have predefined course frameworks. No matter who enters the Web site, almost the same link types are offered. The course materials are added when the time is prolonging. As a result, the learners can be lost in the intricate links of teaching materials. Conklin indicated that ‘disorientation’ and ‘cognitive overhead’ are the two prime issues in hypermedia documents [1]. Thus, this thesis research focuses on the development of an adaptive model by taking pre-learning test before enrolling in the course materials, recording the browsing behavior on the course materials, and taking post-learning exam at the end of learning the course materials. These data are assembled to set up a data warehouse. We use data mining techniques, classification and association, to analyze the collected data to set up group navigation model and personal navigation model. Finally, we use the group navigation model and personal navigation model to predict learner’s personal navigation pattern and give adaptive guidance to the learner.

Стилі APA, Harvard, Vancouver, ISO та ін.

46

Hsiao, Ming-Chuan, and 蕭明傳. "Applying Data Mining Techniques in a Web-Based Programming Languages Learning Environment." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/49026280027196878324.

Повний текст джерела

Анотація:

碩士
雲林科技大學
資訊管理系碩士班
96
Web-based learning systems accumulate a vast amount of information which is very valuable for analyzing students’ behavior. These systems record their interactions with students and students’ study status. We use association rule in data mining to analyze a web-based programming learning environment and identify pattern between concepts of this course by illustrating student’s behavior in exercise. Programming learning focus on the training of implementation, so we classify students’ behavior in exercise to four status: fast, slow, finish after class, and failed to finish, and find the association among all questions in different chapters. Then, we use clustering method to separate questions into many clusters with two attribute: finish rate in class, and average compile count. Finally, view the association rules in each cluster to find patterns interest us. for example, in the high finish rate cluster, what questions with bad status will baffle students’ study in another chapter? In the low finish rate cluster, what questions with good status will help students understand in another questions. Because many teaching material of web-based learning are designed by tutors, we aim to represent the associations hidden between concepts. Using students’ behavior to analyze can understand students’ learning status and help tutors to improve their tuition.

Стилі APA, Harvard, Vancouver, ISO та ін.

47

黃釗田. "The Study in TVE Course Querying Web Site Managed by Data Mining Techniques." Thesis, 2000. http://ndltd.ncl.edu.tw/handle/46004054249962008600.

Повний текст джерела

Анотація:

碩士
國立臺灣師範大學
工業教育研究所
88
The study in Data Mining is focused on mining a lot of information and analyzing them with the help of technology, so as to find out the unknown and hidden data which may be very useful. Hence, Data Mining is based on the development in data-base field. Before Data Mining is done, a good management for some items has to be made first, for example, design, type … , etc. This thesis uses the data-base management method in Data Mining to set up the studying construction on the management for TVE course querying web site through information discussing and the web site system. Further more, Data Mining is used in TVE course information and the final analysis result can be taken for reference in the management and construction of TVE course querying web site. By now, almost all TVE schools have set up their own web sites. But they are weak in the integration among every school’s course data-base. People often need to spend much time analyzing and designing the course web site system construction and these results can not meet many people’s requests wholly in searching the course information. The study takes it as the starting point that it can meet users’ requests in querying TVE courses and in constructing the data warehouse of education in the future. It matches users to the thinking of customers’ shopping. Moreover, it uses the idea of data-base management in Data Mining to analyze the “TVE course querying web site” and to build the matching analysis modes of users to customers, TVE courses to goods and schools (teachers) to goods names. Finally, it uses methods of prediction and analysis to find out the amount of TVE course querying web site users, so as to provide references to the system designers in managing and setting up the TVE course querying web site, so that the management for web site can meet every user’s request.

Стилі APA, Harvard, Vancouver, ISO та ін.

48

Hsieh, Yu-Chun, and 謝宇俊. "Using Data Mining Techniques in Analyzing Patient’s Properties for the High Usage of Medical Resources." Thesis, 2008. http://ndltd.ncl.edu.tw/handle/52668644629159599434.

Повний текст джерела

Анотація:

碩士
國立台北護理學院
資訊管理研究所
96
The Taiwan’s National Health Insurance System founded since 1995. Nowadays, this system with some problems also occurred in Germany and Canada, for example, shortage of funding, abuse of medical resources, etc. However, the abuse of medical resources is the most serious problem. According to the recent report provided by the Bureau of National Health Insurance, the frequency of visiting hospital for Taiwan’s patient was about 15 times per year. Comparing to other country, the frequency was from 4 to 7 times per year, this frequency of Taiwan’s patient is much higher. This situation indeed revealed the problems of the wasted medical resources. This study tries to build proper models for analyzing patient’s behavior of visiting hospital, and find out the corresponding characteristics of patients. Then, it is possible to realize the reasons for the abuse of medical resources. In order to verify the proposed data mining models, this study used a sampling database as the experimental data set that provided by NHIRD (National Health Insurane Research Database). Besides, by using techniques of data warehouse, clustering, neural network and association rules, we proposed a conceptual data model that can analyze the visiting behavior of patients, and find out the profiles for each kind of visiting behavior of patients. The major ideas of the proposed method are three folds. First, we used data warehouse to build a summarized dataset from NHRID that base on year, disease and outpatnent/inpatient three dimensions. Next, we used self-organizing map (SOM) as a tool to classify patients into affinity clusters, the key variables used to classify patients are “number of hospital visiting flag”, “season III behavior variable” and “the amount of spend money”. Finally, we used association rule to filter out the profiles of each patient group of high usage of medical resources. We hope the research results are useful to realize the behavior of high visiting hospital and prevent the abuse of medical resources.

Стилі APA, Harvard, Vancouver, ISO та ін.

49

"Improving opinion mining with feature-opinion association and human computation." 2009. http://library.cuhk.edu.hk/record=b5894009.

Повний текст джерела

Анотація:

Chan, Kam Tong.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2009.
Includes bibliographical references (leaves [101]-113).
Abstracts in English and Chinese.
Abstract --- p.i
Acknowledgement --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Major Topic --- p.1
Chapter 1.1.1 --- Opinion Mining --- p.1
Chapter 1.1.2 --- Human Computation --- p.2
Chapter 1.2 --- Major Work and Contributions --- p.3
Chapter 1.3 --- Thesis Outline --- p.4
Chapter 2 --- Literature Review --- p.6
Chapter 2.1 --- Opinion Mining --- p.6
Chapter 2.1.1 --- Feature Extraction --- p.6
Chapter 2.1.2 --- Sentiment Analysis --- p.9
Chapter 2.2 --- Social Computing --- p.15
Chapter 2.2.1 --- Social Bookmarking --- p.15
Chapter 2.2.2 --- Social Games --- p.18
Chapter 3 --- Feature-Opinion Association for Sentiment Analysis --- p.25
Chapter 3.1 --- Motivation --- p.25
Chapter 3.2 --- Problem Definition --- p.27
Chapter 3.2.1 --- Definitions --- p.27
Chapter 3.3 --- Closer look at the problem --- p.28
Chapter 3.3.1 --- Discussion --- p.29
Chapter 3.4 --- Proposed Approach --- p.29
Chapter 3.4.1 --- Nearest Opinion Word (DIST) --- p.31
Chapter 3.4.2 --- Co-Occurrence Frequency (COF) --- p.31
Chapter 3.4.3 --- Co-Occurrence Ratio (COR) --- p.32
Chapter 3.4.4 --- Likelihood-Ratio Test (LHR) --- p.32
Chapter 3.4.5 --- Combined Method --- p.34
Chapter 3.4.6 --- Feature-Opinion Association Algorithm --- p.35
Chapter 3.4.7 --- Sentiment Lexicon Expansion --- p.36
Chapter 3.5 --- Evaluation --- p.37
Chapter 3.5.1 --- Corpus Data Set --- p.37
Chapter 3.5.2 --- Test Data set --- p.37
Chapter 3.5.3 --- Feature-Opinion Association Accuracy --- p.38
Chapter 3.6 --- Summary --- p.45
Chapter 4 --- Social Game for Opinion Mining --- p.46
Chapter 4.1 --- Motivation --- p.46
Chapter 4.2 --- Social Game Model --- p.47
Chapter 4.2.1 --- Definitions --- p.48
Chapter 4.2.2 --- Social Game Problem --- p.51
Chapter 4.2.3 --- Social Game Flow --- p.51
Chapter 4.2.4 --- Answer Extraction Procedure --- p.52
Chapter 4.3 --- Social Game Properties --- p.53
Chapter 4.3.1 --- Type of Information --- p.53
Chapter 4.3.2 --- Game Structure --- p.55
Chapter 4.3.3 --- Verification Method --- p.59
Chapter 4.3.4 --- Game Mechanism --- p.60
Chapter 4.3.5 --- Player Requirement --- p.62
Chapter 4.4 --- Design Guideline --- p.63
Chapter 4.5 --- Opinion Mining Game Design --- p.65
Chapter 4.5.1 --- OpinionMatch --- p.65
Chapter 4.5.2 --- FeatureGuess --- p.68
Chapter 4.6 --- Summary --- p.71
Chapter 5 --- Tag Sentiment Analysis for Social Bookmark Recommendation System --- p.72
Chapter 5.1 --- Motivation --- p.72
Chapter 5.2 --- Problem Statement --- p.74
Chapter 5.2.1 --- Social Bookmarking Model --- p.74
Chapter 5.2.2 --- Social Bookmark Recommendation (SBR) Problem --- p.75
Chapter 5.3 --- Proposed Approach --- p.75
Chapter 5.3.1 --- Social Bookmark Recommendation Framework --- p.75
Chapter 5.3.2 --- Subjective Tag Detection (STD) --- p.77
Chapter 5.3.3 --- Similarity Matrices --- p.80
Chapter 5.3.4 --- User-Website matrix: --- p.81
Chapter 5.3.5 --- User-Tag matrix --- p.81
Chapter 5.3.6 --- Website-Tag matrix --- p.82
Chapter 5.4 --- Pearson Correlation Coefficient --- p.82
Chapter 5.5 --- Social Network-based User Similarity --- p.83
Chapter 5.6 --- User-oriented Website Ranking --- p.85
Chapter 5.7 --- Evaluation --- p.87
Chapter 5.7.1 --- Bookmark Data --- p.87
Chapter 5.7.2 --- Social Network --- p.87
Chapter 5.7.3 --- Subjective Tag List --- p.87
Chapter 5.7.4 --- Subjective Tag Detection --- p.88
Chapter 5.7.5 --- Bookmark Recommendation Quality --- p.90
Chapter 5.7.6 --- System Evaluation --- p.91
Chapter 5.8 --- Summary --- p.93
Chapter 6 --- Conclusion and Future Work --- p.94
Chapter A --- List of Symbols and Notations --- p.97
Chapter B --- List of Publications --- p.100
Bibliography --- p.101

Стилі APA, Harvard, Vancouver, ISO та ін.

50

吳保珠. "Using Data Mining Techniques to Discover the Most Adaptive Web Paths on Learning Website." Thesis, 2004. http://ndltd.ncl.edu.tw/handle/42952731174483963154.

Повний текст джерела

Анотація:

碩士
南台科技大學
資訊管理系
92
As the Internet technology become mature and the network infrastructure has been built rapidly, the World Wide Web has accumulated a huge amount of information and driven a trans-formation to an Information-based society. Nowadays, Online-learning Websites become popular and disseminated in the World Wide Web. More and more Students browse or use the search device to learn new knowledge. With the explosive widespread use of World Wide Web, information on the Websites has growing at an amazing speed, making it increasingly difficult to search for information. However, World Wide Web does not guarantee to provide an efficient learning environment. The searching information on the Websites becomes inefficient. There is a strong drive fore for the Web’s planner to analyze how to allow the user to efficiently access the information. For the time being, most E-learning Websites have cached the elapsed browsing-time and visited-web pages. Unfortunately, no further analysis on the behavior of users was provided. This thesis will use data-mining technology to analyze this information and to excavate the most adaptive learning paths for students. In this thesis, we use two mining methods to discover the most adaptive web paths of students on learning website. First, we efficiently mine association rules which contain the mined web data of students. We can find the most adaptive web paths according the association rules. Moreover, we develop an efficient sequential pattern algorithm to mine maximal large sequences which contain the mined web sequence of students. We can find the most adaptive web paths according to the maximal large sequences.

Стилі APA, Harvard, Vancouver, ISO та ін.

Дисертації з теми "Web usage data mining techniques"

Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями