Дисертації з теми "Web document clustering (WDC)"
Оформте джерело за APA, MLA, Chicago, Harvard та іншими стилями
Ознайомтеся з топ-17 дисертацій для дослідження на тему "Web document clustering (WDC)".
Біля кожної праці в переліку літератури доступна кнопка «Додати до бібліографії». Скористайтеся нею – і ми автоматично оформимо бібліографічне посилання на обрану працю в потрібному вам стилі цитування: APA, MLA, «Гарвард», «Чикаго», «Ванкувер» тощо.
Також ви можете завантажити повний текст наукової публікації у форматі «.pdf» та прочитати онлайн анотацію до роботи, якщо відповідні параметри наявні в метаданих.
Переглядайте дисертації для різних дисциплін та оформлюйте правильно вашу бібліографію.
Coquet, Jean. "Étude exhaustive de voies de signalisation de grande taille par clustering des trajectoires et caractérisation par analyse sémantique." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S073/document.
Повний текст джерелаSignaling pathways describe the extern stimuli responses of a cell. They are indispensable in biological processes such as differentiation, proliferation or apoptosis. The Systems Biology tries to study exhaustively the signalling pathways using static or dynamic models. The number of solutions which explain a biological phenomenon (for example the stimulus reaction of cell) can be very high in large models. First, this thesis proposes some different strategies to group the solutions describing the stimulus signalling with clustering methods and Formal Concept Analysis. Then, it presents the cluster characterization with semantic web methods. Those strategies have been applied to the TGF-beta signaling network, an extracellular stimulus playing an important role in the cancer growing, which helped to identify 5 large groups of trajectories characterized by different biological processes. Next, this thesis confronts the problem of heterogeneous data translation from different bases to a unique formalism. The goal is to be able to generalize the previous study. It proposes a strategy to group signaling pathways of a database to an unique model, then to calculate every signaling trajectory of the stimulus
Roussinov, Dmitri G., and Hsinchun Chen. "Document clustering for electronic meetings: an experimental comparison of two techniques." Elsevier, 1999. http://hdl.handle.net/10150/105091.
Повний текст джерелаIn this article, we report our implementation and comparison of two text clustering techniques. One is based on Wardâ s clustering and the other on Kohonenâ s Self-organizing Maps. We have evaluated how closely clusters produced by a computer resemble those created by human experts. We have also measured the time that it takes for an expert to â â clean upâ â the automatically produced clusters. The technique based on Wardâ s clustering was found to be more precise. Both techniques have worked equally well in detecting associations between text documents. We used text messages obtained from group brainstorming meetings.
Kellou-Menouer, Kenza. "Découverte de schéma pour les données du Web sémantique." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLV047/document.
Повний текст джерелаAn increasing number of linked data sources are published on the Web. However, their schema may be incomplete or missing. In addition, data do not necessarily follow their schema. This flexibility for describing the data eases their evolution, but makes their exploitation more complex. In our work, we have proposed an automatic and incremental approach enabling schema discovery from the implicit structure of the data. To complement the description of the types in a schema, we have also proposed an approach for finding the possible versions (patterns) for each of them. It proceeds online without having to download or browse the source. This can be expensive or even impossible because the sources may have some access limitations, either on the query execution time, or on the number of queries.We have also addressed the problem of annotating the types in a schema, which consists in finding a set of labels capturing their meaning. We have proposed annotation algorithms which provide meaningful labels using external knowledge bases. Our approach can be used to find meaningful type labels during schema discovery, and also to enrichthe description of existing types.Finally, we have proposed an approach to evaluate the gap between a data source and itsschema. To this end, we have proposed a setof quality factors and the associated metrics, aswell as a schema extension allowing to reflect the heterogeneity among instances of the sametype. Both factors and schema extension are used to analyze and improve the conformity between a schema and the instances it describes
Zanghi, Hugo. "Approches modèles pour la structuration du web vu comme un graphe." Thesis, Evry-Val d'Essonne, 2010. http://www.theses.fr/2010EVRY0041/document.
Повний текст джерелаHe statistical analysis of complex networks is a challenging task, given that appropriate statistical models and efficient computational procedures are required in order for structures to be learned. The principle of these models is to assume that the distribution of the edge values follows a parametric distribution, conditionally on a latent structure which is used to detect connectivity patterns. However, these methods suffer from relatively slow estimation procedures, since dependencies are complex. In this thesis we adapt online estimation strategies, originally developed for the EM algorithm, to the case of graph models. In addition to the network data used in the methods mentioned above, vertex content will sometimes be available. We then propose algorithms for clustering data sets that can be modeled with a graph structure embedding vertex features. Finally, an online Web application, based on the Exalead search engine, allows to promote certain aspects of this thesis
Qumsiyeh, Rani Majed. "Easy to Find: Creating Query-Based Multi-Document Summaries to Enhance Web Search." BYU ScholarsArchive, 2011. https://scholarsarchive.byu.edu/etd/2713.
Повний текст джерелаSaoud, Zohra. "Approche robuste pour l’évaluation de la confiance des ressources sur le Web." Thesis, Lyon, 2016. http://www.theses.fr/2016LYSE1331/document.
Повний текст джерелаThis thesis in Computer Science is part of the trust management field and more specifically recommendation systems. These systems are usually based on users’ experiences (i.e., qualitative / quantitative) interacting with Web resources (eg. Movies, videos and Web services). Recommender systems are undermined by three types of uncertainty that raise due to users’ ratings and identities that can be questioned and also due to variations in Web resources performance at run-time. We propose a robust approach for trust assessment under these uncertainties. The first type of uncertainty refers to users’ ratings. This uncertainty stems from the vulnerability of the system in the presence of malicious users providing false ratings. To tackle this uncertainty, we propose a fuzzy model for users’ credibility. This model uses a fuzzy clustering technique to distinguish between malicious users and strict users usually excluded in existing approaches. The second type of uncertainty refers to user’s identity. Indeed, a malicious user purposely creates virtual identities to provide false ratings. To tackle this type of attack known as Sybil, we propose a ratings filtering model based on the users’ credibility and the trust graph to which they belong. We propose two mechanisms, one for assigning capacities to users and the second one is for selecting users whose ratings will be retained when evaluating trust. The first mechanism reduces the attack capacity of Sybil users. The second mechanism chose paths in the trust graph including trusted users with maximum capacities. Both mechanisms use users’ credibility as heuristic. To deal with the uncertainty over the capacity of a Web resource in satisfying users’ requests, we propose two approaches for Web resources trust assessment, one deterministic and one probabilistic. The first consolidates users’ ratings taking into account users credibility values. The second relies on probability theory coupled with possible worlds semantics. Probabilistic databases offer a better representation of the uncertainty underlying users’ credibility and also permit an uncertain assessment of resources trust. Finally, we develop the system WRTrust (Web Resource Trust) implementing our trust assessment approach. We carried out several experiments to evaluate the performance and robustness of our system. The results show that trust quality has been significantly improved, as well as the system’s robustness in presence of false ratings attacks and Sybil attacks
Ghenname, Mérième. "Le web social et le web sémantique pour la recommandation de ressources pédagogiques." Thesis, Saint-Etienne, 2015. http://www.theses.fr/2015STET4015/document.
Повний текст джерелаThis work has been jointly supervised by U. Jean Monnet Saint Etienne, in the Hubert Curien Lab (Frederique Laforest, Christophe Gravier, Julien Subercaze) and U. Mohamed V Rabat, LeRMA ENSIAS (Rachida Ahjoun, Mounia Abik). Knowledge, education and learning are major concerns in today’s society. The technologies for human learning aim to promote, stimulate, support and validate the learning process. Our approach explores the opportunities raised by mixing the Social Web and the Semantic Web technologies for e-learning. More precisely, we work on discovering learners profiles from their activities on the social web. The Social Web can be a source of information, as it involves users in the information world and gives them the ability to participate in the construction and dissemination of knowledge. We focused our attention on tracking the different types of contributions, activities and conversations in learners spontaneous collaborative activities on social networks. The learner profile is not only based on the knowledge extracted from his/her activities on the e-learning system, but also from his/her many activities on social networks. We propose a methodology for exploiting hashtags contained in users’ writings for the automatic generation of learner’s semantic profiles. Hashtags require some processing before being source of knowledge on the user interests. We have defined a method to identify semantics of hashtags and semantic relationships between the meanings of different hashtags. By the way, we have defined the concept of Folksionary, as a hashtags dictionary that for each hashtag clusters its definitions into meanings. Semantized hashtags are thus used to feed the learner’s profile so as to personalize recommendations on learning material. The goal is to build a semantic representation of the activities and interests of learners on social networks in order to enrich their profiles. We also discuss our recommendation approach based on three types of filtering (personalized, social, and statistical interactions with the system). We focus on personalized recommendation of pedagogical resources to the learner according to his/her expectations and profile
Luu, Vinh Trung. "Using event sequence alignment to automatically segment web users for prediction and recommendation." Thesis, Mulhouse, 2016. http://www.theses.fr/2016MULH0098/document.
Повний текст джерелаThis thesis explored the application of sequence alignment in web usage mining, including user clustering and web prediction and recommendation.This topic was chosen as the online business has rapidly developed and gathered a huge volume of information and the use of sequence alignment in the field is still limited. In this context, researchers are required to build up models that rely on sequence alignment methods and to empirically assess their relevance in user behavioral mining. This thesis presents a novel methodological point of view in the area and show applicable approaches in our quest to improve previous related work. Web usage behavior analysis has been central in a large number of investigations in order to maintain the relation between users and web services. Useful information extraction has been addressed by web content providers to understand users’ need, so that their content can be correspondingly adapted. One of the promising approaches to reach this target is pattern discovery using clustering, which groups users who show similar behavioral characteristics. Our research goal is to perform users clustering, in real time, based on their session similarity
Anderson, James D. "Interactive Visualization of Search Results of Large Document Sets." Wright State University / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=wright1547048073451373.
Повний текст джерелаAttiaoui, Dorra. "Belief detection and temporal analysis of experts in question answering communities : case strudy on stack overflow." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S085/document.
Повний текст джерелаDuring the last decade, people have changed the way they seek information online. Between question answering communities, specialized websites, social networks, the Web has become one of the most widespread platforms for information exchange and retrieval. Question answering communities provide an easy and quick way to search for information needed in any topic. The user has to only ask a question and wait for the other members of the community to respond. Any person posting a question intends to have accurate and helpful answers. Within these platforms, we want to find experts. They are key users that share their knowledge with the other members of the community. Expert detection in question answering communities has become important for several reasons such as providing high quality content, getting valuable answers, etc. In this thesis, we are interested in proposing a general measure of expertise based on the theory of belief functions. Also called the mathematical theory of evidence, it is one of the most well known approaches for reasoning under uncertainty. In order to identify experts among other users in the community, we have focused on finding the most important features that describe every individual. Next, we have developed a model founded on the theory of belief functions to estimate the general expertise of the contributors. This measure will allow us to classify users and detect the most knowledgeable persons. Therefore, once this metric defined, we look at the temporal evolution of users' behavior over time. We propose an analysis of users activity for several months in community. For this temporal investigation, we will describe how do users evolve during their time spent within the platform. Besides, we are also interested on detecting potential experts during the beginning of their activity. The effectiveness of these approaches is evaluated on real data provided from Stack Overflow
Zelený, Jan. "Segmentace webových stránek s využitím shlukovacích technik." Doctoral thesis, Vysoké učení technické v Brně. Fakulta informačních technologií, 2017. http://www.nusl.cz/ntk/nusl-412590.
Повний текст джерелаLeclerc, Tom. "Contributions for Advanced Service Discovery in Ad hoc Networks." Thesis, Nancy 1, 2011. http://www.theses.fr/2011NAN10133/document.
Повний текст джерелаIn the last decade, the number of wireless capable devices increased drastically along with their popularity. Devices also became more powerful and affordable, attracting more users to mobile networks. In this thesis we consider service discovery in Mobile Ad hoc NETworks, also called MANETs, that are a collection of devices that communicate with each other spontaneously whenever they are in wireless transmission range without any preexisting infrastructure. The main characteristic of MANETs is the high dynamic of nodes (induced by the users moving around), the volatile wireless transmissions, the user behavior, the services and their usage. This thesis proposes a complete solution for service discovery in ad hoc networks, from the underlying network up to the service discovery itself. A first contribution is the Stable Linked Structure Flooding (SLSF) protocol that creates stable based cluster structure and thereby provides scalable and efficient message dissemination. The second contribution is the Stable Linked Structure Routing (SLSR) protocol that uses the SLSF dissemination structure to enable routing capabilities. Using those protocols as basis, we propose to improve service discovery by additionally considering context awareness and adaptation. Moreover, we also contributed on improving simulations by coupling simulators and models that, together, can model and simulate the variety and richness of ad hoc related usage scenarios and their human characteristic
Drushku, Krista. "User intent based recommendation for modern BI systems." Thesis, Tours, 2019. http://www.theses.fr/2019TOUR4001/document.
Повний текст джерелаThe storage of big amounts of data may lead to a series of long questions towards the expected solution which complicates user interactions with Business Intelligence (BI) systems. Recommender systems appear as a natural solution to help the users complete their analysis. They try to discover user behaviors from the past logs and to suggest personalized actions by predicting lists of likeness scores, which may lead to redundant recommendations. Nowadays, diversity is becoming essential to improve users’ satisfaction, thus, a special interest is dedicated to complementary recommendation. We studied two concrete data exploration problems in BI and we propose to discover and leverage the user intents to provide two query recommenders. The first, an original reactive collaborative Intent-based Recommender, recommends sequences of queries for the user to pursue her analysis. The second one proactively proposes a bundle of queries to complete user BI report, based on the user intents
Teboul, Bruno. "Le développement du neuromarketing aux Etats-Unis et en France. Acteurs-réseaux, traces et controverses." Thesis, Paris Sciences et Lettres (ComUE), 2016. http://www.theses.fr/2016PSLED036/document.
Повний текст джерелаOur research explores the comparative development of neuromarketing between the United States and France. We start by analyzing the literature on neuromarketing. We use as theoretical and methodological framework the Actor Network Theory (ANT) (in the wake of the work of Bruno Latour and Michel Callon). We show how “human and non-human” entities (“actants”): actor-network, traces (publications) and controversies form the pillars of a new discipline such as the neuromarketing. Our hybrid approach “qualitative-quantitative” allows us to build an applied methodology of the ANT: bibliometric analysis (Publish Or Perish), text mining, clustering and semantic analysis of the scientific literature and web of the neuromarketing. From these results, we build data visualizations, mapping of network graphs (Gephi) that reveal the interrelations and associations between actors, traces and controversies about neuromarketing
(14030507), Deepani B. Guruge. "Effective document clustering system for search engines." Thesis, 2008. https://figshare.com/articles/thesis/Effective_document_clustering_system_for_search_engines/21433218.
Повний текст джерелаPeople use web search engines to fill a wide variety of navigational, informational and transactional needs. However, current major search engines on the web retrieve a large number of documents of which only a small fraction are relevant to the user query. The user then has to manually search for relevant documents by traversing a topic hierarchy, into which a collection is categorised. As more information becomes available, it becomes a time consuming task to search for required relevant information.
This research develops an effective tool, the web document clustering (WDC) system, to cluster, and then rank, the output data obtained from queries submitted to a search engine, into three pre-defined fuzzy clusters. Namely closely related, related and not related. Documents in closely related and related documents are ranked based on their context.
The WDC output has been compared against document clustering results from the Google, Vivisimo and Dogpile systems as these where considered the best at the fourth Search Engine Awards [24]. Test data was from standard document sets, such as the TREC-8 [118] data files and the Iris database [38], or 3 from test text retrieval tasks, "Latex", "Genetic Algorithms" and "Evolutionary Algorithms". Our proposed system had as good as, or better results, than that obtained by these other systems. We have shown that the proposed system can effectively and efficiently locate closely related, related and not related, documents among the retrieved document set for queries submitted to a search engine.
We developed a methodology to supply the user with a list of keywords filtered from the initial search result set to further refine the search. Again we tested our clustering results against the Google, Vivisimo and Dogpile systems. In all cases we have found that our WDC performs as well as, or better than these systems.
The contributions of this research are:
- A post-retrieval fuzzy document clustering algorithm that groups documents into closely related, related and not related clusters. This algorithm uses modified fuzzy c-means (FCM) algorithm to cluter documents into predefined intelligent fuzzy clusters and this approach has not been used before.
- The fuzzy WDC system satisfies the user's information need as far as possible by allowing the user to reformulate the initial query. The system prepares an initial word list by selecting a few characteristics terms of high frequency from the first twenty documents in the initial search engine output. The user is then able to use these terms to input a secondary query. The WDC system then creates a second word list, or the context of the user query (COQ), from the closely related documents to provide training data to refine the search. Documents containing words with high frequency from the training list, based on a pre-defined threshold value, are then presented to the user to refine the search by reformulating the query. In this way the context of the user query is built, enabling the user to learn from the keyword list. This approach is not available in current search engine technology.
- A number of modifications were made to the FCM algorithm to improve its performance in web document clustering. A factor swkq is introduced into the membership function as a measure of the amount of overlaping between the components of the feature vector and the cluster prototype. As the FCM algorithm is greatly affected by the values used to initialise the components of cluster prototypes a machine learning approach, using an Evolutionary Algorithm, was used to resolve the initialisation problem.
- Experimental results indicate that the WDC system outperformed Google, Dogpile and the Vivisimo search engines. The post-retrieval fuzzy web document clustering algorithm designed in this research improves the precision of web searches and it also contributes to the knowledge of document retrieval using fuzzy logic.
- A relational data model was used to automatically store data output from the search engine off-line. This takes the processing of data of the Internet off-line, saving resources and making better use of the local CPU.
- This algorithm uses Latent Semantic Indexing (LSI) to rank documents in the closely related and related clusters. Using LSI to rank document is wellknown, however, we are the first to apply it in the context of ranking closely related documents by using COQ to form the term x document matrix in LSI, to obtain better ranking results.
- Adjustments based on document size are proposed for dealing with problems associated with varying document size in the retrieved documents and the effect this has on cluster analysis.
"Incremental document clustering for web page classification." 2000. http://library.cuhk.edu.hk/record=b5890417.
Повний текст джерелаThesis (M.Phil.)--Chinese University of Hong Kong, 2000.
Includes bibliographical references (leaves 89-94).
Abstracts in English and Chinese.
Abstract --- p.ii
Acknowledgments --- p.iv
Chapter 1 --- Introduction --- p.1
Chapter 1.1 --- Document Clustering --- p.2
Chapter 1.2 --- DC-tree --- p.4
Chapter 1.3 --- Feature Extraction --- p.5
Chapter 1.4 --- Outline of the Thesis --- p.5
Chapter 2 --- Related Work --- p.8
Chapter 2.1 --- Clustering Algorithms --- p.8
Chapter 2.1.1 --- Partitional Clustering Algorithms --- p.8
Chapter 2.1.2 --- Hierarchical Clustering Algorithms --- p.10
Chapter 2.2 --- Document Classification by Examples --- p.11
Chapter 2.2.1 --- k-NN algorithm - Expert Network (ExpNet) --- p.11
Chapter 2.2.2 --- Learning Linear Text Classifier --- p.12
Chapter 2.2.3 --- Generalized Instance Set (GIS) algorithm --- p.12
Chapter 2.3 --- Document Clustering --- p.13
Chapter 2.3.1 --- B+-tree-based Document Clustering --- p.13
Chapter 2.3.2 --- Suffix Tree Clustering --- p.14
Chapter 2.3.3 --- Association Rule Hypergraph Partitioning Algorithm --- p.15
Chapter 2.3.4 --- Principal Component Divisive Partitioning --- p.17
Chapter 2.4 --- Projections for Efficient Document Clustering --- p.18
Chapter 3 --- Background --- p.21
Chapter 3.1 --- Document Preprocessing --- p.21
Chapter 3.1.1 --- Elimination of Stopwords --- p.22
Chapter 3.1.2 --- Stemming Technique --- p.22
Chapter 3.2 --- Problem Modeling --- p.23
Chapter 3.2.1 --- Basic Concepts --- p.23
Chapter 3.2.2 --- Vector Model --- p.24
Chapter 3.3 --- Feature Selection Scheme --- p.25
Chapter 3.4 --- Similarity Model --- p.27
Chapter 3.5 --- Evaluation Techniques --- p.29
Chapter 4 --- Feature Extraction and Weighting --- p.31
Chapter 4.1 --- Statistical Analysis of the Words in the Web Domain --- p.31
Chapter 4.2 --- Zipf's Law --- p.33
Chapter 4.3 --- Traditional Methods --- p.36
Chapter 4.4 --- The Proposed Method --- p.38
Chapter 4.5 --- Experimental Results --- p.40
Chapter 4.5.1 --- Synthetic Data Generation --- p.40
Chapter 4.5.2 --- Real Data Source --- p.41
Chapter 4.5.3 --- Coverage --- p.41
Chapter 4.5.4 --- Clustering Quality --- p.43
Chapter 4.5.5 --- Binary Weight vs Numerical Weight --- p.45
Chapter 5 --- Web Document Clustering Using DC-tree --- p.48
Chapter 5.1 --- Document Representation --- p.48
Chapter 5.2 --- Document Cluster (DC) --- p.49
Chapter 5.3 --- DC-tree --- p.52
Chapter 5.3.1 --- Tree Definition --- p.52
Chapter 5.3.2 --- Insertion --- p.54
Chapter 5.3.3 --- Node Splitting --- p.55
Chapter 5.3.4 --- Deletion and Node Merging --- p.56
Chapter 5.4 --- The Overall Strategy --- p.57
Chapter 5.4.1 --- Preprocessing --- p.57
Chapter 5.4.2 --- Building DC-tree --- p.59
Chapter 5.4.3 --- Identifying the Interesting Clusters --- p.60
Chapter 5.5 --- Experimental Results --- p.61
Chapter 5.5.1 --- Alternative Similarity Measurement : Synthetic Data --- p.61
Chapter 5.5.2 --- DC-tree Characteristics : Synthetic Data --- p.63
Chapter 5.5.3 --- Compare DC-tree and B+-tree: Synthetic Data --- p.64
Chapter 5.5.4 --- Compare DC-tree and B+-tree: Real Data --- p.66
Chapter 5.5.5 --- Varying the Number of Features : Synthetic Data --- p.67
Chapter 5.5.6 --- Non-Correlated Topic Web Page Collection: Real Data --- p.69
Chapter 5.5.7 --- Correlated Topic Web Page Collection: Real Data --- p.71
Chapter 5.5.8 --- Incremental updates on Real Data Set --- p.72
Chapter 5.5.9 --- Comparison with the other clustering algorithms --- p.73
Chapter 6 --- Conclusion --- p.75
Appendix --- p.77
Chapter A --- Stopword List --- p.77
Chapter B --- Porter's Stemming Algorithm --- p.81
Chapter C --- Insertion Algorithm --- p.83
Chapter D --- Node Splitting Algorithm --- p.85
Chapter E --- Features Extracted in Experiment 4.53 --- p.87
Bibliography --- p.88
Sood, Sadhan. "Probabilistic Simhash Matching." Thesis, 2011. http://hdl.handle.net/1969.1/ETD-TAMU-2011-08-9813.
Повний текст джерела