Dissertations / Theses: 'Web document clustering (WDC)'

1

Coquet, Jean. "Étude exhaustive de voies de signalisation de grande taille par clustering des trajectoires et caractérisation par analyse sémantique." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S073/document.

Full text

Abstract:

Les voies de signalisation décrivent les réponses d'une cellule à des stimuli externes. Elles sont primordiales dans les processus biologiques tels que la différentiation, la prolifération ou encore l'apoptose. La biologie des systèmes tentent d'étudier ces voies de façon exhaustive à partir de modèles statistiques ou dynamiques. Le nombre de solutions expliquant un phénomène biologique (par exemple la réaction d'une cellule à un stimulus) peut être très élevé dans le cas de grands modèles. Cette thèse propose, dans un premier temps, différentes stratégies de regroupement de ces solutions à partir de méthodes de clustering et d'analyse de concepts formels. Puis elle présente la caractérisation de ces regroupements à partir de web sémantique. Ces stratégies ont été appliquées au réseau de signalisation du TGF-beta, un stimulus extra-cellulaire jouant un rôle important dans le développement du cancer, ce qui a permis d'identifier cinq grands groupes de trajectoires participant chacun à des processus biologiques différents. Dans un second temps, cette thèse se confronte au problème de conversion des données hétérogènes provenant de différentes bases dans un formalisme unique afin de pouvoir généraliser l'étude précédente. Elle propose une stratégie permettant de regrouper les différents réseaux de signalisation provenant d'une base de données en un modèle unique et ainsi permettant de calculer toutes les trajectoires de signalisation d'un stimulus
Signaling pathways describe the extern stimuli responses of a cell. They are indispensable in biological processes such as differentiation, proliferation or apoptosis. The Systems Biology tries to study exhaustively the signalling pathways using static or dynamic models. The number of solutions which explain a biological phenomenon (for example the stimulus reaction of cell) can be very high in large models. First, this thesis proposes some different strategies to group the solutions describing the stimulus signalling with clustering methods and Formal Concept Analysis. Then, it presents the cluster characterization with semantic web methods. Those strategies have been applied to the TGF-beta signaling network, an extracellular stimulus playing an important role in the cancer growing, which helped to identify 5 large groups of trajectories characterized by different biological processes. Next, this thesis confronts the problem of heterogeneous data translation from different bases to a unique formalism. The goal is to be able to generalize the previous study. It proposes a strategy to group signaling pathways of a database to an unique model, then to calculate every signaling trajectory of the stimulus

APA, Harvard, Vancouver, ISO, and other styles

2

Roussinov, Dmitri G., and Hsinchun Chen. "Document clustering for electronic meetings: an experimental comparison of two techniques." Elsevier, 1999. http://hdl.handle.net/10150/105091.

Full text

Abstract:

Artificial Intelligence Lab, Department of MIS, University of Arizona
In this article, we report our implementation and comparison of two text clustering techniques. One is based on Wardâ s clustering and the other on Kohonenâ s Self-organizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also measured the time that it takes for an expert to â â clean upâ â the automatically produced clusters. The technique based on Wardâ s clustering was found to be more precise. Both techniques have worked equally well in detecting associations between text documents. We used text messages obtained from group brainstorming meetings.

APA, Harvard, Vancouver, ISO, and other styles

3

Kellou-Menouer, Kenza. "Découverte de schéma pour les données du Web sémantique." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV047/document.

Full text

Abstract:

Un nombre croissant de sources de données interconnectées sont publiées sur le Web. Cependant, leur schéma peut êtreincomplet ou absent. De plus, les données ne sont pas nécessairement conformes au schéma déclaré. Ce qui rend leur exploitation complexe. Dans cette thèse, nous proposons une approche d’extraction automatique et incrémentale du schéma d’une source à partir de la structure implicite de ses données. Afin decompléter la description des types découverts, nous proposons également une approche de découverte des patterns structurels d’un type. L’approche procède en ligne sans avoir à télécharger ou à parcourir la source. Ce quipeut être coûteux voire impossible car les sources sont interrogées à distance et peuvent imposer des contraintes d’accès, notamment en termes de temps ou de nombre de requêtes. Nous avons abordé le problème de l’annotationafin de trouver pour chaque type un ensemble de labels permettant de rendre compte de sonsens. Nous avons proposé des algorithmes d’annotation qui retrouvent le sens d’un type en utilisant des sources de données de références. Cette approche s’applique aussi bien pour trouver des noms pertinents pour les typesdécouverts que pour enrichir la description des types existants. Enfin, nous nous sommes intéressés à caractériser la conformité entre les données d’une source et le schéma qui les décrit. Nous avons proposé une approche pour l'analyse et l'amélioration de cette conformité et nous avons proposé des facteurs de qualité, les métriques associées, ainsi qu'une extension du schéma permettant de refléter l'hétérogénéité entre les instances d'un type
An increasing number of linked data sources are published on the Web. However, their schema may be incomplete or missing. In addition, data do not necessarily follow their schema. This flexibility for describing the data eases their evolution, but makes their exploitation more complex. In our work, we have proposed an automatic and incremental approach enabling schema discovery from the implicit structure of the data. To complement the description of the types in a schema, we have also proposed an approach for finding the possible versions (patterns) for each of them. It proceeds online without having to download or browse the source. This can be expensive or even impossible because the sources may have some access limitations, either on the query execution time, or on the number of queries.We have also addressed the problem of annotating the types in a schema, which consists in finding a set of labels capturing their meaning. We have proposed annotation algorithms which provide meaningful labels using external knowledge bases. Our approach can be used to find meaningful type labels during schema discovery, and also to enrichthe description of existing types.Finally, we have proposed an approach to evaluate the gap between a data source and itsschema. To this end, we have proposed a setof quality factors and the associated metrics, aswell as a schema extension allowing to reflect the heterogeneity among instances of the sametype. Both factors and schema extension are used to analyze and improve the conformity between a schema and the instances it describes

APA, Harvard, Vancouver, ISO, and other styles

4

Zanghi, Hugo. "Approches modèles pour la structuration du web vu comme un graphe." Thesis, Evry-Val d'Essonne, 2010. http://www.theses.fr/2010EVRY0041/document.

Full text

Abstract:

L’analyse statistique des réseaux complexes est une tâche difficile, étant donné que des modèles statistiques appropriés et des procédures de calcul efficaces sont nécessaires afin d’apprendre les structures sous-jacentes. Le principe de ces modèles est de supposer que la distribution des valeurs des arêtes suit une distribution paramétrique, conditionnellement à une structure latente qui est utilisée pour détecter les formes de connectivité. Cependant, ces méthodes souffrent de procédures d’estimation relativement lentes, puisque les dépendances sont complexes. Dans cette thèse nous adaptons des stratégies d’estimation incrémentales, développées à l’origine pour l’algorithme EM, aux modèles de graphes. Additionnellement aux données de réseau utilisées dans les méthodes mentionnées ci-dessus, le contenu des noeuds est parfois disponible. Nous proposons ainsi des algorithmes de partitionnement pour les ensembles de données pouvant être modélisés avec une structure de graphe incorporant de l’information au sein des sommets. Finalement,un service Web en ligne, basé sur le moteur de recherche d’ Exalead, permet de promouvoir certains aspects de cette thèse
He statistical analysis of complex networks is a challenging task, given that appropriate statistical models and efficient computational procedures are required in order for structures to be learned. The principle of these models is to assume that the distribution of the edge values follows a parametric distribution, conditionally on a latent structure which is used to detect connectivity patterns. However, these methods suffer from relatively slow estimation procedures, since dependencies are complex. In this thesis we adapt online estimation strategies, originally developed for the EM algorithm, to the case of graph models. In addition to the network data used in the methods mentioned above, vertex content will sometimes be available. We then propose algorithms for clustering data sets that can be modeled with a graph structure embedding vertex features. Finally, an online Web application, based on the Exalead search engine, allows to promote certain aspects of this thesis

APA, Harvard, Vancouver, ISO, and other styles

5

Qumsiyeh, Rani Majed. "Easy to Find: Creating Query-Based Multi-Document Summaries to Enhance Web Search." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2713.

Full text

Abstract:

Current web search engines, such as Google, Yahoo!, and Bing, rank the set of documents S retrieved in response to a user query Q and display each document with a title and a snippet, which serves as an abstract of the corresponding document in S. Snippets, however, are not as useful as they are designed for, i.e., to assist search engine users to quickly identify results of interest, if they exist, without browsing through the documents in S, since they (i) often include very similar information and (ii) do not capture the main content of the corresponding documents. Moreover, when the intended information need specified in a search query is ambiguous, it is difficult, if not impossible, for a search engine to identify precisely the set of documents that satisfy the user's intended request. Furthermore, a document title retrieved by web search engines is not always a good indicator of the content of the corresponding document, since it is not always informative. All these design problems can be solved by our proposed query-based, web informative summarization engine, denoted Q-WISE. Q-WISE clusters documents in S, which allows users to view segregated document collections created according to the specific topic covered in each collection, and generates a concise/comprehensive summary for each collection/cluster of documents. Q-WISE is also equipped with a query suggestion module that provides a guide to its users in formulating a keyword query, which facilitates the web search and improves the precision and recall of the search results. Experimental results show that Q-WISE is highly effective and efficient in generating a high quality summary for each cluster of documents on a specific topic, retrieved in response to a Q-WISE user's query. The empirical study also shows that Q-WISE's clustering algorithm is highly accurate, labels generated for the clusters are useful and often reflect the topic of the corresponding clustered documents, and the performance of the query suggestion module of Q-WISE is comparable to commercial web search engines.

APA, Harvard, Vancouver, ISO, and other styles

6

Saoud, Zohra. "Approche robuste pour l’évaluation de la confiance des ressources sur le Web." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1331/document.

Full text

Abstract:

Cette thèse en Informatique s'inscrit dans le cadre de gestion de la confiance et plus précisément des systèmes de recommandation. Ces systèmes sont généralement basés sur les retours d'expériences des utilisateurs (i.e., qualitatifs/quantitatifs) lors de l'utilisation des ressources sur le Web (ex. films, vidéos et service Web). Les systèmes de recommandation doivent faire face à trois types d'incertitude liés aux évaluations des utilisateurs, à leur identité et à la variation des performances des ressources au fil du temps. Nous proposons une approche robuste pour évaluer la confiance en tenant compte de ces incertitudes. Le premier type d'incertitude réfère aux évaluations. Cette incertitude provient de la vulnérabilité du système en présence d'utilisateurs malveillants fournissant des évaluations biaisées. Pour pallier cette incertitude, nous proposons un modèle flou de la crédibilité des évaluateurs. Ce modèle, basé sur la technique de clustering flou, permet de distinguer les utilisateurs malveillants des utilisateurs stricts habituellement exclus dans les approches existantes. Le deuxième type d'incertitude réfère à l'identité de l'utilisateur. En effet, un utilisateur malveillant a la possibilité de créer des identités virtuelles pour fournir plusieurs fausses évaluations. Pour contrecarrer ce type d'attaque dit Sybil, nous proposons un modèle de filtrage des évaluations, basé sur la crédibilité des utilisateurs et le graphe de confiance auquel ils appartiennent. Nous proposons deux mécanismes, l'un pour distribuer des capacités aux utilisateurs et l'autre pour sélectionner les utilisateurs à retenir lors de l'évaluation de la confiance. Le premier mécanisme permet de réduire le risque de faire intervenir des utilisateurs multi-identités. Le second mécanisme choisit des chemins dans le graphe de confiance contenant des utilisateurs avec des capacités maximales. Ces deux mécanismes utilisent la crédibilité des utilisateurs comme heuristique. Afin de lever l'incertitude sur l'aptitude d'une ressource à satisfaire les demandes des utilisateurs, nous proposons deux approches d'évaluation de la confiance d'une ressource sur leWeb, une déterministe et une probabiliste. La première consolide les différentes évaluations collectées en prenant en compte la crédibilité des évaluateurs. La deuxième s'appuie sur la théorie des bases de données probabilistes et la sémantique des mondes possibles. Les bases de données probabilistes offrent alors une meilleure représentation de l'incertitude sous-jacente à la crédibilité des utilisateurs et permettent aussi à travers des requêtes un calcul incertain de la confiance d'une ressource. Finalement, nous développons le système WRTrust (Web Resource Trust) implémentant notre approche d'évaluation de la confiance. Nous avons réalisé plusieurs expérimentations afin d'évaluer la performance et la robustesse de notre système. Les expérimentations ont montré une amélioration de la qualité de la confiance et de la robustesse du système aux attaques des utilisateurs malveillants
This thesis in Computer Science is part of the trust management field and more specifically recommendation systems. These systems are usually based on users’ experiences (i.e., qualitative / quantitative) interacting with Web resources (eg. Movies, videos and Web services). Recommender systems are undermined by three types of uncertainty that raise due to users’ ratings and identities that can be questioned and also due to variations in Web resources performance at run-time. We propose a robust approach for trust assessment under these uncertainties. The first type of uncertainty refers to users’ ratings. This uncertainty stems from the vulnerability of the system in the presence of malicious users providing false ratings. To tackle this uncertainty, we propose a fuzzy model for users’ credibility. This model uses a fuzzy clustering technique to distinguish between malicious users and strict users usually excluded in existing approaches. The second type of uncertainty refers to user’s identity. Indeed, a malicious user purposely creates virtual identities to provide false ratings. To tackle this type of attack known as Sybil, we propose a ratings filtering model based on the users’ credibility and the trust graph to which they belong. We propose two mechanisms, one for assigning capacities to users and the second one is for selecting users whose ratings will be retained when evaluating trust. The first mechanism reduces the attack capacity of Sybil users. The second mechanism chose paths in the trust graph including trusted users with maximum capacities. Both mechanisms use users’ credibility as heuristic. To deal with the uncertainty over the capacity of a Web resource in satisfying users’ requests, we propose two approaches for Web resources trust assessment, one deterministic and one probabilistic. The first consolidates users’ ratings taking into account users credibility values. The second relies on probability theory coupled with possible worlds semantics. Probabilistic databases offer a better representation of the uncertainty underlying users’ credibility and also permit an uncertain assessment of resources trust. Finally, we develop the system WRTrust (Web Resource Trust) implementing our trust assessment approach. We carried out several experiments to evaluate the performance and robustness of our system. The results show that trust quality has been significantly improved, as well as the system’s robustness in presence of false ratings attacks and Sybil attacks

APA, Harvard, Vancouver, ISO, and other styles

7

Ghenname, Mérième. "Le web social et le web sémantique pour la recommandation de ressources pédagogiques." Thesis, Saint-Etienne, 2015. http://www.theses.fr/2015STET4015/document.

Full text

Abstract:

Ce travail de recherche est conjointement effectué dans le cadre d’une cotutelle entre deux universités : en France l’Université Jean Monnet de Saint-Etienne, laboratoire Hubert Curien sous la supervision de Mme Frédérique Laforest, M. Christophe Gravier et M. Julien Subercaze, et au Maroc l’Université Mohamed V de Rabat, équipe LeRMA sous la supervision de Mme Rachida Ajhoun et Mme Mounia Abik. Les connaissances et les apprentissages sont des préoccupations majeures dans la société d’aujourd’hui. Les technologies de l’apprentissage humain visent à promouvoir, stimuler, soutenir et valider le processus d’apprentissage. Notre approche explore les opportunités soulevées en faisant coopérer le Web Social et le Web sémantique pour le e-learning. Plus précisément, nous travaillons sur l’enrichissement des profils des apprenants en fonction de leurs activités sur le Web Social. Le Web social peut être une source d’information très importante à explorer, car il implique les utilisateurs dans le monde de l’information et leur donne la possibilité de participer à la construction et à la diffusion de connaissances. Nous nous focalisons sur le suivi des différents types de contributions, dans les activités de collaboration spontanée des apprenants sur les réseaux sociaux. Le profil de l’apprenant est non seulement basé sur la connaissance extraite de ses activités sur le système de e-learning, mais aussi de ses nombreuses activités sur les réseaux sociaux. En particulier, nous proposons une méthodologie pour exploiter les hashtags contenus dans les écrits des utilisateurs pour la génération automatique des intérêts des apprenants dans le but d’enrichir leurs profils. Cependant les hashtags nécessitent un certain traitement avant d’être source de connaissances sur les intérêts des utilisateurs. Nous avons défini une méthode pour identifier la sémantique de hashtags et les relations sémantiques entre les significations des différents hashtags. Par ailleurs, nous avons défini le concept de Folksionary, comme un dictionnaire de hashtags qui pour chaque hashtag regroupe ses définitions en unités de sens. Les hashtags enrichis en sémantique sont donc utilisés pour nourrir le profil de l’apprenant de manière à personnaliser les recommandations sur le matériel d’apprentissage. L’objectif est de construire une représentation sémantique des activités et des intérêts des apprenants sur les réseaux sociaux afin d’enrichir leurs profils. Nous présentons également notre approche générale de recommandation multidimensionnelle dans un environnement d’e-learning. Nous avons conçu une approche fondée sur trois types de filtrage : le filtrage personnalisé à base du profil de l’apprenant, le filtrage social à partir des activités de l’apprenant sur les réseaux sociaux, et le filtrage local à partir des statistiques d’interaction de l’apprenant avec le système. Notre implémentation s’est focalisée sur la recommandation personnalisée
This work has been jointly supervised by U. Jean Monnet Saint Etienne, in the Hubert Curien Lab (Frederique Laforest, Christophe Gravier, Julien Subercaze) and U. Mohamed V Rabat, LeRMA ENSIAS (Rachida Ahjoun, Mounia Abik). Knowledge, education and learning are major concerns in today’s society. The technologies for human learning aim to promote, stimulate, support and validate the learning process. Our approach explores the opportunities raised by mixing the Social Web and the Semantic Web technologies for e-learning. More precisely, we work on discovering learners profiles from their activities on the social web. The Social Web can be a source of information, as it involves users in the information world and gives them the ability to participate in the construction and dissemination of knowledge. We focused our attention on tracking the different types of contributions, activities and conversations in learners spontaneous collaborative activities on social networks. The learner profile is not only based on the knowledge extracted from his/her activities on the e-learning system, but also from his/her many activities on social networks. We propose a methodology for exploiting hashtags contained in users’ writings for the automatic generation of learner’s semantic profiles. Hashtags require some processing before being source of knowledge on the user interests. We have defined a method to identify semantics of hashtags and semantic relationships between the meanings of different hashtags. By the way, we have defined the concept of Folksionary, as a hashtags dictionary that for each hashtag clusters its definitions into meanings. Semantized hashtags are thus used to feed the learner’s profile so as to personalize recommendations on learning material. The goal is to build a semantic representation of the activities and interests of learners on social networks in order to enrich their profiles. We also discuss our recommendation approach based on three types of filtering (personalized, social, and statistical interactions with the system). We focus on personalized recommendation of pedagogical resources to the learner according to his/her expectations and profile

APA, Harvard, Vancouver, ISO, and other styles

8

Luu, Vinh Trung. "Using event sequence alignment to automatically segment web users for prediction and recommendation." Thesis, Mulhouse, 2016. http://www.theses.fr/2016MULH0098/document.

Full text

Abstract:

Une masse de données importante est collectée chaque jour par les gestionnaires de site internet sur les visiteurs qui accèdent à leurs services. La collecte de ces données a pour objectif de mieux comprendre les usages et d'acquérir des connaissances sur le comportement des visiteurs. A partir de ces connaissances, les gestionnaires de site peuvent décider de modifier leur site ou proposer aux visiteurs du contenu personnalisé. Cependant, le volume de données collectés ainsi que la complexité de représentation des interactions entre le visiteur et le site internet nécessitent le développement de nouveaux outils de fouille de données. Dans cette thèse, nous avons exploré l’utilisation des méthodes d’alignement de séquences pour l'extraction de connaissances sur l'utilisation de site Web (web mining). Ces méthodes sont la base du regroupement automatique d’internautes en segments, ce qui permet de découvrir des groupes de comportements similaires. De plus, nous avons également étudié comment ces groupes pouvaient servir à effectuer de la prédiction et la recommandation de pages. Ces thèmes sont particulièrement importants avec le développement très rapide du commerce en ligne qui produit un grand volume de données (big data) qu’il est impossible de traiter manuellement
This thesis explored the application of sequence alignment in web usage mining, including user clustering and web prediction and recommendation.This topic was chosen as the online business has rapidly developed and gathered a huge volume of information and the use of sequence alignment in the field is still limited. In this context, researchers are required to build up models that rely on sequence alignment methods and to empirically assess their relevance in user behavioral mining. This thesis presents a novel methodological point of view in the area and show applicable approaches in our quest to improve previous related work. Web usage behavior analysis has been central in a large number of investigations in order to maintain the relation between users and web services. Useful information extraction has been addressed by web content providers to understand users’ need, so that their content can be correspondingly adapted. One of the promising approaches to reach this target is pattern discovery using clustering, which groups users who show similar behavioral characteristics. Our research goal is to perform users clustering, in real time, based on their session similarity

APA, Harvard, Vancouver, ISO, and other styles

9

Anderson, James D. "Interactive Visualization of Search Results of Large Document Sets." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1547048073451373.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Attiaoui, Dorra. "Belief detection and temporal analysis of experts in question answering communities : case strudy on stack overflow." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S085/document.

Full text

Abstract:

L'émergence du Web 2.0 a changé la façon avec laquelle les gens recherchent et obtiennent des informations sur internet. Entre sites communautaires spécialisés, réseaux sociaux, l'utilisateur doit faire face à une grande quantité d'informations. Les sites communautaires de questions réponses représentent un moyen facile et rapide pour obtenir des réponses à n'importe quelle question qu'une personne se pose. Tout ce qu'il suffit de faire c'est de déposer une question sur un de ces sites et d'attendre qu'un autre utilisateur lui réponde. Dans ces sites communautaires, nous voulons identifier les personnes très compétentes. Ce sont des utilisateurs importants qui partagent leurs connaissances avec les autres membres de leurs communauté. Ainsi la détection des experts est devenue une tache très importantes, car elle permet de garantir la qualité des réponses postées sur les différents sites. Dans cette thèse, nous proposons une mesure générale d'expertise fondée sur la théorie des fonctions de croyances. Cette théorie nous permet de gérer l'incertitude présente dans toutes les données émanant du monde réel. D'abord et afin d'identifier ces experts parmi la foule d'utilisateurs présents dans la communauté, nous nous sommes intéressés à identifier des attributs qui permettent de décrire le comportement de chaque individus. Nous avons ensuite développé un modèle statistique fondé sur la théorie des fonctions de croyance pour estimer l'expertise générale des usagers de la plateforme. Cette mesure nous a permis de classifier les différents utilisateurs et de détecter les plus experts d'entre eux. Par la suite, nous proposons une analyse temporelle pour étudier l'évolution temporelle des utilisateurs pendant plusieurs mois. Pour cette partie, nous décrirons com- ment les différents usagers peuvent évoluer au cours de leur activité dans la plateforme. En outre, nous nous sommes également intéressés à la détection des experts potentiels pendant les premiers mois de leurs inscriptions dans un site. L'efficacité de ces approches a été validée par des données réelles provenant de Stack Overflow
During the last decade, people have changed the way they seek information online. Between question answering communities, specialized websites, social networks, the Web has become one of the most widespread platforms for information exchange and retrieval. Question answering communities provide an easy and quick way to search for information needed in any topic. The user has to only ask a question and wait for the other members of the community to respond. Any person posting a question intends to have accurate and helpful answers. Within these platforms, we want to find experts. They are key users that share their knowledge with the other members of the community. Expert detection in question answering communities has become important for several reasons such as providing high quality content, getting valuable answers, etc. In this thesis, we are interested in proposing a general measure of expertise based on the theory of belief functions. Also called the mathematical theory of evidence, it is one of the most well known approaches for reasoning under uncertainty. In order to identify experts among other users in the community, we have focused on finding the most important features that describe every individual. Next, we have developed a model founded on the theory of belief functions to estimate the general expertise of the contributors. This measure will allow us to classify users and detect the most knowledgeable persons. Therefore, once this metric defined, we look at the temporal evolution of users' behavior over time. We propose an analysis of users activity for several months in community. For this temporal investigation, we will describe how do users evolve during their time spent within the platform. Besides, we are also interested on detecting potential experts during the beginning of their activity. The effectiveness of these approaches is evaluated on real data provided from Stack Overflow

APA, Harvard, Vancouver, ISO, and other styles

11

Zelený, Jan. "Segmentace webových stránek s využitím shlukovacích technik." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-412590.

Full text

Abstract:

Získávání informací a jiné techniky dolování dat z webových stránek získávají na důležitosti s tím, jak se rozvíjí webové technologie a jak roste množství informací uložených na webu, jakožto jediném nosiči těchto informací. Spolu s tímto množství informací také ale roste množství obsahu, který není v kontextu prezentovaných informací ničím důležitý. To je jedním z důvodů, proč je důležité se intenzivně věnovat předzpracování informací uložených na webu. Segmentační algoritmy jsou jedním z možných způsobů předzpracování. Tato práce se věnuje využití shlukovacích technik pro zefektivnění existujících, ale i nalezení zcela nových algoritmů použitelných pro segmentaci webových stránek.

APA, Harvard, Vancouver, ISO, and other styles

12

Leclerc, Tom. "Contributions for Advanced Service Discovery in Ad hoc Networks." Thesis, Nancy 1, 2011. http://www.theses.fr/2011NAN10133/document.

Full text

Abstract:

Lors de la dernière décennie, le nombre d'appareils possédant des capacités sans fil a très fortement augmenté, attirant ainsi le grand public vers les réseaux mobiles sans fil. Nous considérons le cas des réseaux mobiles ad hoc aussi connu sous le nom de MANET (Mobile Ad hoc NETworks). La caractéristique principale des MANETs est la grande dynamicité des noeuds (induite pas le mouvement des utilisateurs), la propriété volatile des transmissions sans fil, le comportement des utilisateurs, les services et leurs utilisations. Cette thèse propose une solution complète pour la découverte de service dans les réseaux ad hoc, de la couche réseau sous-jacente à la découverte de service à proprement dite. La première contribution est le protocole Stable Linked Structure Flooding (SLSF) qui établi une structure basée sur des clusters stable et permet d'obtenir une dissémination efficace qui passe à l'échelle. La seconde contribution est SLSR (Stable Linked Structure Routing) qui utilise la structure de dissémination de SLSF et permet de faire du routage à travers le réseau. En utilisant ces protocoles comme base, nous proposons d'améliorer la découverte de service en prenant en compte le contexte. De plus, nous avons contribué à la simulation réseau en couplant des modèles et des simulateurs de domaines différents qui une fois couplés permettent d'élaborer et la simuler des scénarios riches et variés adaptés aux MANETs. Cette thèse à été réalisé au sein du projet ANR SARAH qui avait pour but le déploiement de service multimédia dans une architecture ad hoc hybride
In the last decade, the number of wireless capable devices increased drastically along with their popularity. Devices also became more powerful and affordable, attracting more users to mobile networks. In this thesis we consider service discovery in Mobile Ad hoc NETworks, also called MANETs, that are a collection of devices that communicate with each other spontaneously whenever they are in wireless transmission range without any preexisting infrastructure. The main characteristic of MANETs is the high dynamic of nodes (induced by the users moving around), the volatile wireless transmissions, the user behavior, the services and their usage. This thesis proposes a complete solution for service discovery in ad hoc networks, from the underlying network up to the service discovery itself. A first contribution is the Stable Linked Structure Flooding (SLSF) protocol that creates stable based cluster structure and thereby provides scalable and efficient message dissemination. The second contribution is the Stable Linked Structure Routing (SLSR) protocol that uses the SLSF dissemination structure to enable routing capabilities. Using those protocols as basis, we propose to improve service discovery by additionally considering context awareness and adaptation. Moreover, we also contributed on improving simulations by coupling simulators and models that, together, can model and simulate the variety and richness of ad hoc related usage scenarios and their human characteristic

APA, Harvard, Vancouver, ISO, and other styles

13

Drushku, Krista. "User intent based recommendation for modern BI systems." Thesis, Tours, 2019. http://www.theses.fr/2019TOUR4001/document.

Full text

Abstract:

Stocker de grandes quantités de données complexifie les interactions avec les systèmes de Business Intelligence (BI). Les systèmes de recommandation semblent un choix logique pour aider les utilisateurs dans leur analyse. Ils extraient des comportements de données historiques et suggèrent des actions personnalisées, potentiellement redondantes, via des scores de similarité. La diversité est essentielle pour améliorer la satisfaction des utilisateurs, d’où l’intérêt particulier accordé aux recommandations complémentaires. Nous avons étudié deux problèmes concrets d’exploration de données en BI et proposons de découvrir et exploiter les intentions utilisateur pour fournir deux recommandeurs de requête. Le premier, un recommandeur collaboratif réactif original basé sur l’intention, recommande des séquences de requêtes à l’utilisateur pour poursuivre son analyse. Le second propose proactivement un ensemble de requêtes pour compléter un rapport BI, en fonction di contexte utilisateur
The storage of big amounts of data may lead to a series of long questions towards the expected solution which complicates user interactions with Business Intelligence (BI) systems. Recommender systems appear as a natural solution to help the users complete their analysis. They try to discover user behaviors from the past logs and to suggest personalized actions by predicting lists of likeness scores, which may lead to redundant recommendations. Nowadays, diversity is becoming essential to improve users’ satisfaction, thus, a special interest is dedicated to complementary recommendation. We studied two concrete data exploration problems in BI and we propose to discover and leverage the user intents to provide two query recommenders. The first, an original reactive collaborative Intent-based Recommender, recommends sequences of queries for the user to pursue her analysis. The second one proactively proposes a bundle of queries to complete user BI report, based on the user intents

APA, Harvard, Vancouver, ISO, and other styles

14

Teboul, Bruno. "Le développement du neuromarketing aux Etats-Unis et en France. Acteurs-réseaux, traces et controverses." Thesis, Paris Sciences et Lettres (ComUE), 2016. http://www.theses.fr/2016PSLED036/document.

Full text

Abstract:

Notre travail de recherche explore de manière comparée le développement du neuromarketing aux Etats-Unis et en France. Nous commençons par analyser la littérature sur le neuromarketing. Nous utilisons comme cadre théorique et méthodologique l’Actor Network Theory (ANT) ou Théorie de l’Acteur-Réseau (dans le sillage des travaux de Bruno Latour et Michel Callon). Nous montrons ainsi comment des actants « humains et non-humains »: acteurs-réseaux, traces (publications) et controverses forment les piliers d’une nouvelle discipline telle que le neuromarketing. Notre approche hybride « qualitative-quantitative », nous permet de construire une méthodologie appliquée de l’ANT: analyse bibliométrique (Publish Or Perish), text mining, clustering et analyse sémantique de la littérature scientifique et web du neuromarketing. A partir de ces résultats, nous construisons des cartographies, sous forme de graphes en réseau (Gephi) qui révèlent les interrelations et les associations entre acteurs, traces et controverses autour du neuromarketing
Our research explores the comparative development of neuromarketing between the United States and France. We start by analyzing the literature on neuromarketing. We use as theoretical and methodological framework the Actor Network Theory (ANT) (in the wake of the work of Bruno Latour and Michel Callon). We show how “human and non-human” entities (“actants”): actor-network, traces (publications) and controversies form the pillars of a new discipline such as the neuromarketing. Our hybrid approach “qualitative-quantitative” allows us to build an applied methodology of the ANT: bibliometric analysis (Publish Or Perish), text mining, clustering and semantic analysis of the scientific literature and web of the neuromarketing. From these results, we build data visualizations, mapping of network graphs (Gephi) that reveal the interrelations and associations between actors, traces and controversies about neuromarketing

APA, Harvard, Vancouver, ISO, and other styles

15

(14030507), Deepani B. Guruge. "Effective document clustering system for search engines." Thesis, 2008. https://figshare.com/articles/thesis/Effective_document_clustering_system_for_search_engines/21433218.

Full text

Abstract:

People use web search engines to fill a wide variety of navigational, informational and transactional needs. However, current major search engines on the web retrieve a large number of documents of which only a small fraction are relevant to the user query. The user then has to manually search for relevant documents by traversing a topic hierarchy, into which a collection is categorised. As more information becomes available, it becomes a time consuming task to search for required relevant information.

This research develops an effective tool, the web document clustering (WDC) system, to cluster, and then rank, the output data obtained from queries submitted to a search engine, into three pre-defined fuzzy clusters. Namely closely related, related and not related. Documents in closely related and related documents are ranked based on their context.

The WDC output has been compared against document clustering results from the Google, Vivisimo and Dogpile systems as these where considered the best at the fourth Search Engine Awards [24]. Test data was from standard document sets, such as the TREC-8 [118] data files and the Iris database [38], or 3 from test text retrieval tasks, "Latex", "Genetic Algorithms" and "Evolutionary Algorithms". Our proposed system had as good as, or better results, than that obtained by these other systems. We have shown that the proposed system can effectively and efficiently locate closely related, related and not related, documents among the retrieved document set for queries submitted to a search engine.

We developed a methodology to supply the user with a list of keywords filtered from the initial search result set to further refine the search. Again we tested our clustering results against the Google, Vivisimo and Dogpile systems. In all cases we have found that our WDC performs as well as, or better than these systems.

The contributions of this research are:

A post-retrieval fuzzy document clustering algorithm that groups documents into closely related, related and not related clusters. This algorithm uses modified fuzzy c-means (FCM) algorithm to cluter documents into predefined intelligent fuzzy clusters and this approach has not been used before.
The fuzzy WDC system satisfies the user's information need as far as possible by allowing the user to reformulate the initial query. The system prepares an initial word list by selecting a few characteristics terms of high frequency from the first twenty documents in the initial search engine output. The user is then able to use these terms to input a secondary query. The WDC system then creates a second word list, or the context of the user query (COQ), from the closely related documents to provide training data to refine the search. Documents containing words with high frequency from the training list, based on a pre-defined threshold value, are then presented to the user to refine the search by reformulating the query. In this way the context of the user query is built, enabling the user to learn from the keyword list. This approach is not available in current search engine technology.
A number of modifications were made to the FCM algorithm to improve its performance in web document clustering. A factor sw_kq is introduced into the membership function as a measure of the amount of overlaping between the components of the feature vector and the cluster prototype. As the FCM algorithm is greatly affected by the values used to initialise the components of cluster prototypes a machine learning approach, using an Evolutionary Algorithm, was used to resolve the initialisation problem.
Experimental results indicate that the WDC system outperformed Google, Dogpile and the Vivisimo search engines. The post-retrieval fuzzy web document clustering algorithm designed in this research improves the precision of web searches and it also contributes to the knowledge of document retrieval using fuzzy logic.
A relational data model was used to automatically store data output from the search engine off-line. This takes the processing of data of the Internet off-line, saving resources and making better use of the local CPU.
This algorithm uses Latent Semantic Indexing (LSI) to rank documents in the closely related and related clusters. Using LSI to rank document is wellknown, however, we are the first to apply it in the context of ranking closely related documents by using COQ to form the term x document matrix in LSI, to obtain better ranking results.
Adjustments based on document size are proposed for dealing with problems associated with varying document size in the retrieved documents and the effect this has on cluster analysis.

APA, Harvard, Vancouver, ISO, and other styles

16

"Incremental document clustering for web page classification." 2000. http://library.cuhk.edu.hk/record=b5890417.

Full text

Abstract:

by Wong, Wai-Chiu.
Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.
Includes bibliographical references (leaves 89-94).
Abstracts in English and Chinese.
Abstract --- p.ii
Acknowledgments --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Document Clustering --- p.2
Chapter 1.2 --- DC-tree --- p.4
Chapter 1.3 --- Feature Extraction --- p.5
Chapter 1.4 --- Outline of the Thesis --- p.5
Chapter 2 --- Related Work --- p.8
Chapter 2.1 --- Clustering Algorithms --- p.8
Chapter 2.1.1 --- Partitional Clustering Algorithms --- p.8
Chapter 2.1.2 --- Hierarchical Clustering Algorithms --- p.10
Chapter 2.2 --- Document Classification by Examples --- p.11
Chapter 2.2.1 --- k-NN algorithm - Expert Network (ExpNet) --- p.11
Chapter 2.2.2 --- Learning Linear Text Classifier --- p.12
Chapter 2.2.3 --- Generalized Instance Set (GIS) algorithm --- p.12
Chapter 2.3 --- Document Clustering --- p.13
Chapter 2.3.1 --- B+-tree-based Document Clustering --- p.13
Chapter 2.3.2 --- Suffix Tree Clustering --- p.14
Chapter 2.3.3 --- Association Rule Hypergraph Partitioning Algorithm --- p.15
Chapter 2.3.4 --- Principal Component Divisive Partitioning --- p.17
Chapter 2.4 --- Projections for Efficient Document Clustering --- p.18
Chapter 3 --- Background --- p.21
Chapter 3.1 --- Document Preprocessing --- p.21
Chapter 3.1.1 --- Elimination of Stopwords --- p.22
Chapter 3.1.2 --- Stemming Technique --- p.22
Chapter 3.2 --- Problem Modeling --- p.23
Chapter 3.2.1 --- Basic Concepts --- p.23
Chapter 3.2.2 --- Vector Model --- p.24
Chapter 3.3 --- Feature Selection Scheme --- p.25
Chapter 3.4 --- Similarity Model --- p.27
Chapter 3.5 --- Evaluation Techniques --- p.29
Chapter 4 --- Feature Extraction and Weighting --- p.31
Chapter 4.1 --- Statistical Analysis of the Words in the Web Domain --- p.31
Chapter 4.2 --- Zipf's Law --- p.33
Chapter 4.3 --- Traditional Methods --- p.36
Chapter 4.4 --- The Proposed Method --- p.38
Chapter 4.5 --- Experimental Results --- p.40
Chapter 4.5.1 --- Synthetic Data Generation --- p.40
Chapter 4.5.2 --- Real Data Source --- p.41
Chapter 4.5.3 --- Coverage --- p.41
Chapter 4.5.4 --- Clustering Quality --- p.43
Chapter 4.5.5 --- Binary Weight vs Numerical Weight --- p.45
Chapter 5 --- Web Document Clustering Using DC-tree --- p.48
Chapter 5.1 --- Document Representation --- p.48
Chapter 5.2 --- Document Cluster (DC) --- p.49
Chapter 5.3 --- DC-tree --- p.52
Chapter 5.3.1 --- Tree Definition --- p.52
Chapter 5.3.2 --- Insertion --- p.54
Chapter 5.3.3 --- Node Splitting --- p.55
Chapter 5.3.4 --- Deletion and Node Merging --- p.56
Chapter 5.4 --- The Overall Strategy --- p.57
Chapter 5.4.1 --- Preprocessing --- p.57
Chapter 5.4.2 --- Building DC-tree --- p.59
Chapter 5.4.3 --- Identifying the Interesting Clusters --- p.60
Chapter 5.5 --- Experimental Results --- p.61
Chapter 5.5.1 --- Alternative Similarity Measurement : Synthetic Data --- p.61
Chapter 5.5.2 --- DC-tree Characteristics : Synthetic Data --- p.63
Chapter 5.5.3 --- Compare DC-tree and B+-tree: Synthetic Data --- p.64
Chapter 5.5.4 --- Compare DC-tree and B+-tree: Real Data --- p.66
Chapter 5.5.5 --- Varying the Number of Features : Synthetic Data --- p.67
Chapter 5.5.6 --- Non-Correlated Topic Web Page Collection: Real Data --- p.69
Chapter 5.5.7 --- Correlated Topic Web Page Collection: Real Data --- p.71
Chapter 5.5.8 --- Incremental updates on Real Data Set --- p.72
Chapter 5.5.9 --- Comparison with the other clustering algorithms --- p.73
Chapter 6 --- Conclusion --- p.75
Appendix --- p.77
Chapter A --- Stopword List --- p.77
Chapter B --- Porter's Stemming Algorithm --- p.81
Chapter C --- Insertion Algorithm --- p.83
Chapter D --- Node Splitting Algorithm --- p.85
Chapter E --- Features Extracted in Experiment 4.53 --- p.87
Bibliography --- p.88

APA, Harvard, Vancouver, ISO, and other styles

17

Sood, Sadhan. "Probabilistic Simhash Matching." Thesis, 2011. http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813.

Full text

Abstract:

Finding near-duplicate documents is an interesting problem but the existing methods are not suitable for large scale datasets and memory constrained systems. In this work, we developed approaches that tackle the problem of finding near-duplicates while improving query performance and using less memory. We then carried out an evaluation of our method on a dataset of 70M web documents, and showed that our method works really well. The results indicated that our method could achieve a reduction in space by a factor of 5 while improving the query time by a factor of 4 with a recall of 0.95 for finding all near-duplicates when the dataset is in memory. With the same recall and same reduction in space, we could achieve an improvement in query-time by a factor of 4.5 while finding first the near-duplicate for an in memory dataset. When the dataset was stored on a disk, we could achieve an improvement in performance by 7 times for finding all near-duplicates and by 14 times when finding the first near-duplicate.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Web document clustering (WDC)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles