Dissertations / Theses on the topic 'Graphe de recommandation'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 20 dissertations / theses for your research on the topic 'Graphe de recommandation.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Slaimi, Fatma. "Découverte et recommandation de services Web." Thesis, Aix-Marseille, 2017. http://www.theses.fr/2017AIXM0069.
Full textThe Web has become an universal platform for content hosting and distributed heterogeneous applications that can be accessed manually or automatically. In this context, Web services have established themselves as a key technology for deploying interactions across applications. The standard Web services technologies allow and facilitate the manual programming of these applications. To promote automatic programming based on Web services, a major problem arises : that of their discovery. Several approaches addressing this problem have been proposed in the literature. The aim of this thesis is to improve the Web services discovery process. We proposed three approaches. We proposed a Web services discovery approach that combines several matching techniques. The second consists on the validation of the services returned by an automatic process of discovery using users’ competencies. These approaches do not take into account the evolution of services over time and user preferences. To address these shortcomings, several approaches incorporate referral techniques to assist the discovery process. A large majority of these approaches are based on assessments of QoS properties. In practice, these assessments are rarely available. In other systems, trust relationships between users and services are used. These relationships are established based on invocations evaluations of similar services. However, invoking the same service do not necessarily mean having the same preferences. Hence, we propose, in our third approach, the use of the relations of interest between users to recommend services. The approach relies on modeling services’ ecosystem by database graphs
Nzekon, Nzeko'o Armel Jacques. "Système de recommandation avec dynamique temporelle basée sur les flots de liens." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS454.
Full textRecommending appropriate items to users is crucial in many e-commerce platforms that propose a large number of items to users. Recommender systems are one favorite solution for this task. Most research in this area is based on explicit ratings that users give to items, while most of the time, ratings are not available in sufficient quantities. In these situations, it is important that recommender systems use implicit data which are link stream connecting users to items while maintaining timestamps i.e. users browsing, purchases and streaming history. We exploit this type of implicit data in this thesis. One common approach consists in selecting the N most relevant items to each user, for a given N, which is called top-N recommendation. To do so, recommender systems rely on various kinds of information, like content-based features of items, past interest of users for items and trust between users. However, they often use only one or two such pieces of information simultaneously, which can limit their performance because user's interest for an item can depend on more than two types of side information. To address this limitation, we make three contributions in the field of graph-based recommender systems. The first one is an extension of the Session-based Temporal Graph (STG) introduced by Xiang et al., which is a dynamic graph combining long-term and short-term preferences in order to better capture user preferences over time. STG ignores content-based features of items, and make no difference between the weight of newer edges and older edges. The new proposed graph Time-weight Content-based STG addresses STG limitations by adding a new node type for content-based features of items, and a penalization of older edges. The second contribution is the Link Stream Graph (LSG) for temporal recommendations. This graph is inspired by a formal representation of link stream, and has the particularity to consider time in a continuous way unlike others state-of-the-art graphs, which ignore the temporal dimension like the classical bipartite graph (BIP), or consider time discontinuously like STG where time is divided into slices. The third contribution in this thesis is GraFC2T2, a general graph-based framework for top-N recommendation. This framework integrates basic recommender graphs, and enriches them with content-based features of items, users' preferences temporal dynamics, and trust relationships between them. Implementations of these three contributions on CiteUlike, Delicious, Last.fm, Ponpare, Epinions and Ciao datasets confirm their relevance
Ruas, Olivier. "The many faces of approximation in KNN graph computation." Thesis, Rennes 1, 2018. http://www.theses.fr/2018REN1S088/document.
Full textThe incredible quantity of available content in online services makes content of interest incredibly difficult to find. The most emblematic way to help the users is to do item recommendation. The K-Nearest-Neighbors (KNN) graph connects each user to its k most similar other users, according to a given similarity metric. The computation time of an exact KNN graph is prohibitive in online services. Existing approaches approximate the set of candidates for each user’s neighborhood to decrease the computation time. In this thesis we push farther the notion of approximation : we approximate the data of each user, the similarity and the data locality. The resulting approach clearly outperforms all the other ones
Peoples, Bruce E. "Méthodologie d'analyse du centre de gravité de normes internationales publiées : une démarche innovante de recommandation." Thesis, Paris 8, 2016. http://www.theses.fr/2016PA080023.
Full text“Standards make a positive contribution to the world we live in. They facilitate trade, spreadknowledge, disseminate innovative advances in technology, and share good management andconformity assessment practices”7. There are a multitude of standard and standard consortiaorganizations producing market relevant standards, specifications, and technical reports in thedomain of Information Communication Technology (ICT). With the number of ICT relatedstandards and specifications numbering in the thousands, it is not readily apparent to users howthese standards inter-relate to form the basis of technical interoperability. There is a need todevelop and document a process to identify how standards inter-relate to form a basis ofinteroperability in multiple contexts; at a general horizontal technology level that covers alldomains, and within specific vertical technology domains and sub-domains. By analyzing whichstandards inter-relate through normative referencing, key standards can be identified as technicalcenters of gravity, allowing identification of specific standards that are required for thesuccessful implementation of standards that normatively reference them, and form a basis forinteroperability across horizontal and vertical technology domains. This Thesis focuses on defining a methodology to analyze ICT standards to identifynormatively referenced standards that form technical centers of gravity utilizing Data Mining(DM) and Social Network Analysis (SNA) graph technologies as a basis of analysis. As a proofof concept, the methodology focuses on the published International Standards (IS) published bythe International Organization of Standards/International Electrotechnical Committee; JointTechnical Committee 1, Sub-committee 36 Learning Education, and Training (ISO/IEC JTC1 SC36). The process is designed to be scalable for larger document sets within ISO/IEC JTC1 that covers all JTC1 Sub-Committees, and possibly other Standard Development Organizations(SDOs).Chapter 1 provides a review of literature of previous standard analysis projects and analysisof components used in this Thesis, such as data mining and graph theory. Identification of adataset for testing the developed methodology containing published International Standardsneeded for analysis and form specific technology domains and sub-domains is the focus ofChapter 2. Chapter 3 describes the specific methodology developed to analyze publishedInternational Standards documents, and to create and analyze the graphs to identify technicalcenters of gravity. Chapter 4 presents analysis of data which identifies technical center of gravitystandards for ICT learning, education, and training standards produced in ISO/IEC JTC1 SC 36.Conclusions of the analysis are contained in Chapter 5. Recommendations for further researchusing the output of the developed methodology are contained in Chapter 6
Poulain, Rémy. "Analyse et modélisation de la diversité des structures relationnelles à l'aide de graphes multipartis." Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS453.
Full textThere is no longer any need to prove that digital technology, the Internet and the web have led to a revolution, particularly in the way people get information. Like any revolution, it is followed by a series of issues : equal treatment of users and suppliers, ecologically sustainable consumption, freedom of expression and censorship, etc. Research needs to provide a clear vision of these stakes. Among these issues, we can talk about two phenomena : the echo chamber phenomenon and the filter bubble phenomenon. These two phenomena are linked to the lack of diversity of information visible on the Internet, and one may wonder about the impact of recommendation algorithms. Even if this is our primary motivation, we are moving away from this subject to propose a general scientific framework to analyze diversity. We find that the graph formalism is useful enough to be able to represent relational data. More precisely, we will analyze relational data with entities of different natures. This is why we chose the n-part graph formalism because this is a good way to represent a great diversity of data. Even if the first data we studied is related to recommendation algorithms (music consumption or purchase of articles on a platform) we will see over the course of the manuscript how this formalism can be adapted to other types of data (politicized users on Twitter, guests of television shows, establishment of NGOs in different States ...). There are several objectives in this study : — Mathematically define diversity indicators on the n-part graphs. — Algorithmically define how to calculate them. — Program these algorithms to make them a usable computer object. — Use these programs on quite varied data. — See the different meanings that our indicators can have. We will begin by describing the mathematical formalism necessary for our study. Then we will apply our mathematical object to basic examples to see all the possibilities that our object offers us. This will show us the importance of normalizing our indicators, and will motivate us to study random normalization. Then we will see another series of examples which will allow us to go further on our indicators, going beyond the static and tripartite side to approach graphs with more layers and depending on time. To be able to have a better vision of what the real data brings us, we will study our indicators on completely randomly generated graphs
Dadoun, Amine. "Semantic data driven approach for merchandizing optimization." Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS191.
Full textThe overall objective of this PhD is to explore and propose new approaches leveraging a large volume of heterogeneous data that needs to be integrated and semantically enriched, and recent advances in machine and deep learning techniques, in order to exploit both the increased variety of offers that an airline can make to its customers as well as the knowledge it has of its customers with the ultimate goal of optimizing conversion and purchase. The overall goal of this thesis can be broken down into three main research questions: 1) What piece of content (ancillary services, third-party content) should be recommended and personalized to each traveler? 2) When should a recommendation be made and for which communication channel to optimize conversion? 3) How do we group ancillary services and third-party content and can we learn what often goes together based on purchase logs?
Lully, Vincent. "Vers un meilleur accès aux informations pertinentes à l’aide du Web sémantique : application au domaine du e-tourisme." Thesis, Sorbonne université, 2018. http://www.theses.fr/2018SORUL196.
Full textThis thesis starts with the observation that there is an increasing infobesity on the Web. The two main types of tools, namely the search engine and the recommender system, which are designed to help us explore the Web data, have several problems: (1) in helping users express their explicit information needs, (2) in selecting relevant documents, and (3) in valuing the selected documents. We propose several approaches using Semantic Web technologies to remedy these problems and to improve the access to relevant information. We propose particularly: (1) a semantic auto-completion approach which helps users formulate longer and richer search queries, (2) several recommendation approaches using the hierarchical and transversal links in knowledge graphs to improve the relevance of the recommendations, (3) a semantic affinity framework to integrate semantic and social data to yield qualitatively balanced recommendations in terms of relevance, diversity and novelty, (4) several recommendation explanation approaches aiming at improving the relevance, the intelligibility and the user-friendliness, (5) two image user profiling approaches and (6) an approach which selects the best images to accompany the recommended documents in recommendation banners. We implemented and applied our approaches in the e-tourism domain. They have been properly evaluated quantitatively with ground-truth datasets and qualitatively through user studies
Benchettara, Nasserine. "Prévision de nouveaux liens dans les réseaux d'interactions bipartis : Application au calcul de recommandation." Paris 13, 2011. http://scbd-sto.univ-paris13.fr/secure/edgalilee_th_2011_benchettara.pdf.
Full textIn this work, we handle the problem of new link prediction in dynamic complex networks. We mainly focus on studying networks having a bipartite underlaying structure. We propose to apply a propositionnalization approach where each couple of nodes in the network is described by a set of topological measures. One first contribution in this thesis is to consider measures computed in the bipartite graph and also in the associated projected graphs. A supervised machine learning approach is applied. This approach though it gives some good results, suffers from the obvious problem of class skewness. We hence focus on handling this problem. Informed sub-sampling approaches are first proposed. A semi-supervised machine learning approach is also applied. All proposed approaches are applied and evaluated on real datasets used in real application of academic collaboration recommendation and product recommendation in an e-commerce site
Draidi, Fady. "Recommandation Pair-à-Pair pour Communautés en Ligne à Grande Echelle." Phd thesis, Université Montpellier II - Sciences et Techniques du Languedoc, 2012. http://tel.archives-ouvertes.fr/tel-00766963.
Full textMoin, Afshin. "Les Techniques De Recommandation Et De Visualisation Pour Les Données A Une Grande Echelle." Phd thesis, Université Rennes 1, 2012. http://tel.archives-ouvertes.fr/tel-00724121.
Full textMoin, Afshin. "Les techniques de recommandation et de visualisation pour les données à une grande échelle." Rennes 1, 2012. https://tel.archives-ouvertes.fr/tel-00724121.
Full textNous avons assisté au développement rapide de la technologie de l'information au cours de la dernière décennie. D'une part, la capacité du traitement et du stockage des appareils numériques est en constante augmentation grâce aux progrès des méthodes de construction. D'autre part, l'interaction entre ces dispositifs puissants a été rendue possible grâce à la technologie de réseautage. Une conséquence naturelle de ces progrès, est que le volume des données générées dans différentes applications a grandi à un rythme sans précédent. Désormais, nous sommes confrontés à de nouveaux défis pour traiter et représenter efficacement la masse énorme de données à notre disposition. Cette thèse est centrée autour des deux axes de recommandation du contenu pertinent et de sa visualisation correcte. Le rôle des systèmes de recommandation est d'aider les utilisateurs dans le processus de prise de décision pour trouver des articles avec un contenu pertinent et une qualité satisfaisante au sein du vaste ensemble des possibilités existant dans le Web. D'autre part, la représentation correcte des données traitées est un élément central à la fois pour accroître l’utilité des données pour l'utilisateur final et pour la conception des outils d'analyse efficaces. Dans cet exposé, les principales approches des systèmes de recommandation ainsi que les techniques les plus importantes de la visualisation des données sous forme de graphes sont discutées. En outre, il est montré comment quelques-unes des mêmes techniques appliquées aux systèmes de recommandation peuvent être modifiées pour tenir compte des exigences de visualisation
Ettaleb, Mohamed. "Approche de recommandation à base de fouille de données et de graphes étiquetés multi-couches : contributions à la RI sociale." Electronic Thesis or Diss., Aix-Marseille, 2020. http://www.theses.fr/2020AIXM0588.
Full textIn general, the purpose of a recommendation system is to assist users in selecting relevant elements from a wide range of elements. In the context of the explosion in the number of academic publications available (books, articles, etc.) online, providing a personalized recommendation service is becoming a necessity. In addition, automatic book recommendation based on a query is an emerging theme with many scientific locks. It combines several issues related to information retrieval and data mining for the assessment of the degree of opportunity to recommend a book. This assessment must be made taking into account the query but also the user profile (reading history, interest, notes and comments associated with previous readings) and the entire collection to which the document belongs. Two main avenues have been addressed in this paper to deal with the problem of automatic book recommendation : - Identification of the user’s intentions from a query. - Recommendation of relevant books according to the user’s needs
Benkoussas, Chahinez. "Approches non supervisées pour la recommandation de lectures et la mise en relation automatique de contenus au sein d'une bibliothèque numérique." Thesis, Aix-Marseille, 2016. http://www.theses.fr/2016AIXM4379/document.
Full textThis thesis deals with the field of information retrieval and the recommendation of reading. It has for objects:— The creation of new approach of document retrieval and recommendation using techniques of combination of results, aggregation of social data and reformulation of queries;— The creation of an approach of recommendation using methods of information retrieval and graph theories.Two collections of documents were used. First one is a collection which is provided by CLEF (Social Book Search - SBS) and the second from the platforms of electronic sources in Humanities and Social Sciences OpenEdition.org (Revues.org). The modelling of the documents of every collection is based on two types of relations:— For the first collection (SBS), documents are connected with similarity calculated by Amazon which is based on several factors (purchases of the users, the comments, the votes, products bought together, etc.);— For the second collection (OpenEdition), documents are connected with relations of citations, extracted from bibliographical references.We show that the proposed approaches bring in most of the cases gain in the performances of research and recommendation. The manuscript is structured in two parts. The first part "state of the art" includes a general introduction, a state of the art of informationretrieval and recommender systems. The second part "contributions" includes a chapter on the detection of reviews of books in Revues.org; a chapter on the methods of IR used on complex queries written in natural language and last chapter which handles the proposed approach of recommendation which is based on graph
Dos, Santos Ludovic. "Representation learning for relational data." Thesis, Paris 6, 2017. http://www.theses.fr/2017PA066480/document.
Full textThe increasing use of social and sensor networks generates a large quantity of data that can be represented as complex graphs. There are many tasks from information analysis, to prediction and retrieval one can imagine on those data where relation between graph nodes should be informative. In this thesis, we proposed different models for three different tasks: - Graph node classification - Relational time series forecasting - Collaborative filtering. All the proposed models use the representation learning framework in its deterministic or Gaussian variant. First, we proposed two algorithms for the heterogeneous graph labeling task, one using deterministic representations and the other one Gaussian representations. Contrary to other state of the art models, our solution is able to learn edge weights when learning simultaneously the representations and the classifiers. Second, we proposed an algorithm for relational time series forecasting where the observations are not only correlated inside each series, but also across the different series. We use Gaussian representations in this contribution. This was an opportunity to see in which way using Gaussian representations instead of deterministic ones was profitable. At last, we apply the Gaussian representation learning approach to the collaborative filtering task. This is a preliminary work to see if the properties of Gaussian representations found on the two previous tasks were also verified for the ranking one. The goal of this work was to then generalize the approach to more relational data and not only bipartite graphs between users and items
Lisena, Pasquale. "Knowledge-based music recommendation : models, algorithms and exploratory search." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS614.
Full textRepresenting the information about music is a complex activity that involves different sub-tasks. This thesis manuscript mostly focuses on classical music, researching how to represent and exploit its information. The main goal is the investigation of strategies of knowledge representation and discovery applied to classical music, involving subjects such as Knowledge-Base population, metadata prediction, and recommender systems. We propose a complete workflow for the management of music metadata using Semantic Web technologies. We introduce a specialised ontology and a set of controlled vocabularies for the different concepts specific to music. Then, we present an approach for converting data, in order to go beyond the librarian practice currently in use, relying on mapping rules and interlinking with controlled vocabularies. Finally, we show how these data can be exploited. In particular, we study approaches based on embeddings computed on structured metadata, titles, and symbolic music for ranking and recommending music. Several demo applications have been realised for testing the previous approaches and resources
Dos, Santos Ludovic. "Representation learning for relational data." Electronic Thesis or Diss., Paris 6, 2017. http://www.theses.fr/2017PA066480.
Full textThe increasing use of social and sensor networks generates a large quantity of data that can be represented as complex graphs. There are many tasks from information analysis, to prediction and retrieval one can imagine on those data where relation between graph nodes should be informative. In this thesis, we proposed different models for three different tasks: - Graph node classification - Relational time series forecasting - Collaborative filtering. All the proposed models use the representation learning framework in its deterministic or Gaussian variant. First, we proposed two algorithms for the heterogeneous graph labeling task, one using deterministic representations and the other one Gaussian representations. Contrary to other state of the art models, our solution is able to learn edge weights when learning simultaneously the representations and the classifiers. Second, we proposed an algorithm for relational time series forecasting where the observations are not only correlated inside each series, but also across the different series. We use Gaussian representations in this contribution. This was an opportunity to see in which way using Gaussian representations instead of deterministic ones was profitable. At last, we apply the Gaussian representation learning approach to the collaborative filtering task. This is a preliminary work to see if the properties of Gaussian representations found on the two previous tasks were also verified for the ranking one. The goal of this work was to then generalize the approach to more relational data and not only bipartite graphs between users and items
Boutalbi, Rafika. "Model-based tensor (co)-clustering and applications." Electronic Thesis or Diss., Université Paris Cité, 2020. https://wo.app.u-paris.fr/cgi-bin/WebObjects/TheseWeb.woa/wa/show?t=7172&f=55867.
Full textClustering, which seeks to group together similar data points according to a given criterion, is an important unsupervised learning technique to deal with large scale data. In particular, given a data matrix where rows represent objects and columns represent features, clustering aims to partition only one dimension of the matrix at a time, by clustering either objects or features. Although successfully applied in several application domains, clustering techniques are often challenged by certain characteristics exhibited by some datasets such as high dimensionality and sparsity. When it comes to such data, co-clustering techniques, which allow the simultaneous clustering of rows and columns of a data matrix, has proven to be more beneficial. In particular, co-clustering techniques allow the exploitation of the inherent duality between the objects set and features set, which make them more effective even if we are interested in the clustering of only one dimension of our data matrix. In addition, co-clustering turns out to be more efficient since compressed matrices are used at each time step of the process instead of the whole matrix for traditional clustering. Although co-clustering approaches have been successfully applied in a variety of applications, existing approaches are specially tailored for datasets represented by double-entry tables. However, in several real-world applications, two dimensions are not sufficient to represent the dataset. For example, if we consider the articles clustering problem, several information linked to the articles can be collected, such as common words, co-authors and citations, which naturally lead to a tensorial representation. Intuitively, leveraging all this information would lead to a better clustering quality. In particular, two articles that share a large set of words, authors and citations are very likely to be similar. Despite the great interest of tensor co-clustering models, research works are extremely limited in this context and rely, for most of them, on tensor factorization methods. Inspired by the famous statement made by Jean Paul Benzécri "The model must follow the data and not vice versa", we have chosen in this thesis to rely on appropriate mixture models. More explicitly, we propose several new co-clustering models which are specially tailored for tensorial representations as well as robust towards data sparsity. Our contribution can be summarized as follows. First, we propose to extend the LBM (Latent Block Model) formalism to take into account tensorial structures. More specifically, we present Tensor LBM (TLBM), a powerful tensor co-clustering model that we successfully applied on diverse kind of data. Moreover, we highlight that the derived algorithm VEM-T, reveals the most meaningful co-clusters from tensor data. Second, we develop a novel Sparse TLBM taking into account sparsity. We extend its use for the management of multiple graphs (or multi-view graphs), leading to implicit consensus clustering of multiple graphs. As a last contribution of this thesis, we propose a new co-clusterwise method which integrates co-clustering in a supervised learning framework. These contributions have been successfully evaluated on tensorial data from various fields ranging from recommendation systems, clustering of hyperspectral images and categorization of documents, to waste management optimization. They also allow us to envisage interesting and immediate future research avenues. For instance, the extension of the proposed models to tri-clustering and multivariate time series
Todeschini, Adrien. "Probabilistic and Bayesian nonparametric approaches for recommender systems and networks." Thesis, Bordeaux, 2016. http://www.theses.fr/2016BORD0237/document.
Full textWe propose two novel approaches for recommender systems and networks. In the first part, we first give an overview of recommender systems and concentrate on the low-rank approaches for matrix completion. Building on a probabilistic approach, we propose novel penalty functions on the singular values of the low-rank matrix. By exploiting a mixture model representation of this penalty, we show that a suitably chosen set of latent variables enables to derive an expectation-maximization algorithm to obtain a maximum a posteriori estimate of the completed low-rank matrix. The resulting algorithm is an iterative soft-thresholded algorithm which iteratively adapts the shrinkage coefficients associated to the singular values. The algorithm is simple to implement and can scale to large matrices. We provide numerical comparisons between our approach and recent alternatives showing the interest of the proposed approach for low-rank matrix completion. In the second part, we first introduce some background on Bayesian nonparametrics and in particular on completely random measures (CRMs) and their multivariate extension, the compound CRMs. We then propose a novel statistical model for sparse networks with overlapping community structure. The model is based on representing the graph as an exchangeable point process, and naturally generalizes existing probabilistic models with overlapping block-structure to the sparse regime. Our construction builds on vectors of CRMs, and has interpretable parameters, each node being assigned a vector representing its level of affiliation to some latent communities. We develop methods for simulating this class of random graphs, as well as to perform posterior inference. We show that the proposed approach can recover interpretable structure from two real-world networks and can handle graphs with thousands of nodes and tens of thousands of edges
Salah, Aghiles. "Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data." Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB093/document.
Full textCluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere
Salah, Aghiles. "Von Mises-Fisher based (co-)clustering for high-dimensional sparse data : application to text and collaborative filtering data." Electronic Thesis or Diss., Sorbonne Paris Cité, 2016. https://wo.app.u-paris.fr/cgi-bin/WebObjects/TheseWeb.woa/wa/show?t=1858&f=11557.
Full textCluster analysis or clustering, which aims to group together similar objects, is undoubtedly a very powerful unsupervised learning technique. With the growing amount of available data, clustering is increasingly gaining in importance in various areas of data science for several reasons such as automatic summarization, dimensionality reduction, visualization, outlier detection, speed up research engines, organization of huge data sets, etc. Existing clustering approaches are, however, severely challenged by the high dimensionality and extreme sparsity of the data sets arising in some current areas of interest, such as Collaborative Filtering (CF) and text mining. Such data often consists of thousands of features and more than 95% of zero entries. In addition to being high dimensional and sparse, the data sets encountered in the aforementioned domains are also directional in nature. In fact, several previous studies have empirically demonstrated that directional measures—that measure the distance between objects relative to the angle between them—, such as the cosine similarity, are substantially superior to other measures such as Euclidean distortions, for clustering text documents or assessing the similarities between users/items in CF. This suggests that in such context only the direction of a data vector (e.g., text document) is relevant, not its magnitude. It is worth noting that the cosine similarity is exactly the scalar product between unit length data vectors, i.e., L 2 normalized vectors. Thus, from a probabilistic perspective using the cosine similarity is equivalent to assuming that the data are directional data distributed on the surface of a unit-hypersphere. Despite the substantial empirical evidence that certain high dimensional sparse data sets, such as those encountered in the above domains, are better modeled as directional data, most existing models in text mining and CF are based on popular assumptions such as Gaussian, Multinomial or Bernoulli which are inadequate for L 2 normalized data. In this thesis, we focus on the two challenging tasks of text document clustering and item recommendation, which are still attracting a lot of attention in the domains of text mining and CF, respectively. In order to address the above limitations, we propose a suite of new models and algorithms which rely on the von Mises-Fisher (vMF) assumption that arises naturally for directional data lying on a unit-hypersphere