Rozprawy doktorskie na temat „Big Data et algorithmes”
Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych
Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „Big Data et algorithmes”.
Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.
Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.
Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.
Ho, Zhen Wai Olivier. "Contributions aux algorithmes stochastiques pour le Big Data et à la théorie des valeurs extrèmes multivariés". Thesis, Bourgogne Franche-Comté, 2018. http://www.theses.fr/2018UBFCD025/document.
Pełny tekst źródłaThis thesis in divided in two parts. The first part studies models for multivariate extremes. We give a method to construct multivariate regularly varying random vectors. The method is based on a multivariate extension of a Breiman Lemma that states that a product $RZ$ of a random non negative regularly varying variable $R$ and a non negative $Z$ sufficiently integrable is also regularly varying. Replacing $Z$ with a random vector $mathbf{Z}$, we show that the product $Rmathbf{Z}$ is regularly varying and we give a characterisation of its limit measure. Then, we show that taking specific distributions for $mathbf{Z}$, we obtain classical max-stable models. We extend our result to non-standard regular variations. Next, we show that the Pareto model associated with the Hüsler-Reiss max-stable model forms a full exponential family. We show some properties of this model and we give an algorithm for exact simulation. We study the properties of the maximum likelihood estimator. Then, we extend our model to non-standard regular variations. To finish the first part, we propose a numerical study of the Hüsler-Reiss Pareto model.In the second part, we start by giving a lower bound of the smallest singular value of a matrix perturbed by appending a column. Then, we give a greedy algorithm for feature selection and we illustrate this algorithm on a time series dataset. Secondly, we show that an incoherent matrix satisfies a weakened version of the NSP property. Thirdly, we study the problem of column selection of $Xinmathbb{R}^{n imes p}$ given a coherence threshold $mu$. This means we want the largest submatrix satisfying some coherence property. We formulate the problem as a linear program with quadratic constraint on ${0,1}^p$. Then, we consider a relaxation on the sphere and we bound the relaxation error. Finally, we study the projected stochastic gradient descent for online PCA. We show that in expectation, the algorithm converges to a leading eigenvector and we suggest an algorithm for step-size selection. We illustrate this algorithm with a numerical experiment
Bach, Tran. "Algorithmes avancés de DCA pour certaines classes de problèmes en apprentissage automatique du Big Data". Electronic Thesis or Diss., Université de Lorraine, 2019. http://www.theses.fr/2019LORR0255.
Pełny tekst źródłaBig Data has become gradually essential and ubiquitous in all aspects nowadays. Therefore, there is an urge to develop innovative and efficient techniques to deal with the rapid growth in the volume of data. This dissertation considers the following problems in Big Data: group variable selection in multi-class logistic regression, dimension reduction by t-SNE (t-distributed Stochastic Neighbor Embedding), and deep clustering. We develop advanced DCAs (Difference of Convex functions Algorithms) for these problems, which are based on DC Programming and DCA – the powerful tools for non-smooth non-convex optimization problems. Firstly, we consider the problem of group variable selection in multi-class logistic regression. We tackle this problem by using recently advanced DCAs -- Stochastic DCA and DCA-Like. Specifically, Stochastic DCA specializes in the large sum of DC functions minimization problem, which only requires a subset of DC functions at each iteration. DCA-Like relaxes the convexity condition of the second DC component while guaranteeing the convergence. Accelerated DCA-Like incorporates the Nesterov's acceleration technique into DCA-Like to improve its performance. The numerical experiments in benchmark high-dimensional datasets show the effectiveness of proposed algorithms in terms of running time and solution quality. The second part studies the t-SNE problem, an effective non-linear dimensional reduction technique. Motivated by the novelty of DCA-Like and Accelerated DCA-Like, we develop two algorithms for the t-SNE problem. The superiority of proposed algorithms in comparison with existing methods is illustrated through numerical experiments for visualization application. Finally, the third part considers the problem of deep clustering. In the first application, we propose two algorithms based on DCA to combine t-SNE with MSSC (Minimum Sum-of-Squares Clustering) by following two approaches: “tandem analysis” and joint-clustering. The second application considers clustering with auto-encoder (a well-known type of neural network). We propose an extension to a class of joint-clustering algorithms to overcome the scaling problem and applied for a specific case of joint-clustering with MSSC. Numerical experiments on several real-world datasets show the effectiveness of our methods in rapidity and clustering quality, compared to the state-of-the-art methods
Chuchuk, Olga. "Optimisation de l'accès aux données au CERN et dans la Grille de calcul mondiale pour le LHC (WLCG)". Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ4005.
Pełny tekst źródłaThe Worldwide LHC Computing Grid (WLCG) offers an extensive distributed computing infrastructure dedicated to the scientific community involved with CERN's Large Hadron Collider (LHC). With storage that totals roughly an exabyte, the WLCG addresses the data processing and storage requirements of thousands of international scientists. As the High-Luminosity LHC phase approaches, the volume of data to be analysed will increase steeply, outpacing the expected gain through the advancement of storage technology. Therefore, new approaches to effective data access and management, such as caches, become essential. This thesis delves into a comprehensive exploration of storage access within the WLCG, aiming to enhance the aggregate science throughput while limiting the cost. Central to this research is the analysis of real file access logs sourced from the WLCG monitoring system, highlighting genuine usage patterns.In a scientific setting, caching has profound implications. Unlike more commercial applications such as video streaming, scientific data caches deal with varying file sizes—from a mere few bytes to multiple terabytes. Moreover, the inherent logical associations between files considerably influence user access patterns. Traditional caching research has predominantly revolved around uniform file sizes and independent reference models. Contrarily, scientific workloads encounter variances in file sizes, and logical file interconnections significantly influence user access patterns.My investigations show how LHC's hierarchical data organization, particularly its compartmentalization into datasets, impacts request patterns. Recognizing the opportunity, I introduce innovative caching policies that emphasize dataset-specific knowledge, and compare their effectiveness with traditional file-centric strategies. Furthermore, my findings underscore the "delayed hits" phenomenon triggered by limited connectivity between computing and storage locales, shedding light on its potential repercussions for caching efficiency.Acknowledging the long-standing challenge of predicting Data Popularity in the High Energy Physics (HEP) community, especially with the upcoming HL-LHC era's storage conundrums, my research integrates Machine Learning (ML) tools. Specifically, I employ the Random Forest algorithm, known for its suitability with Big Data. By harnessing ML to predict future file reuse patterns, I present a dual-stage method to inform cache eviction policies. This strategy combines the power of predictive analytics and established cache eviction algorithms, thereby devising a more resilient caching system for the WLCG. In conclusion, this research underscores the significance of robust storage services, suggesting a direction towards stateless caches for smaller sites to alleviate complex storage management requirements and open the path to an additional level in the storage hierarchy. Through this thesis, I aim to navigate the challenges and complexities of data storage and retrieval, crafting more efficient methods that resonate with the evolving needs of the WLCG and its global community
Défossez, Gautier. "Le système d'information multi-sources du Registre général des cancers de Poitou-Charentes. Conception, développement et applications à l'ère des données massives en santé". Thesis, Poitiers, 2021. http://theses.univ-poitiers.fr/64594/2021-Defossez-Gautier-These.
Pełny tekst źródłaPopulation-based cancer registries (PBCRs) are the best international option tool to provide a comprehensive (unbiased) picture of the weight, incidence and severity of cancer in the general population. Their work in classifying and coding diagnoses according to international rules gives to the final data a specific quality and comparability in time and space, thus building a decisive knowledge database for describing the evolution of cancers and their management in an uncontrolled environment. Cancer registration is based on a thorough investigative process, for which the complexity is largely related to the ability to access all the relevant data concerning the same individual and to gather them efficiently. Created in 2007, the General Cancer Registry of Poitou-Charentes (RGCPC) is a recent generation of cancer registry, started at a conducive time to devote a reflection about how to optimize the registration process. Driven by the computerization of medical data and the increasing interoperability of information systems, the RGCPC has experimented over 10 years a multi-source information system combining innovative methods of information processing and representation, based on the reuse of standardized data usually produced for other purposes.In a first section, this work presents the founding principles and the implementation of a system capable of gathering large amounts of data, highly qualified and structured, with semantic alignment to subscribe to algorithmic approaches. Data are collected on multiannual basis from 110 partners representing seven data sources (clinical, biological and medical administrative data). Two algorithms assist the cancer registrar by dematerializing the manual tasks usually carried out prior to tumor registration. A first algorithm generate automatically the tumors and its various components (publication), and a second algorithm represent the care pathway of each individual as an ordered sequence of time-stamped events that can be access within a secure interface (publication). Supervised machine learning techniques are experimented to get around the possible lack of codification of pathology reports (publication).The second section focuses on the wide field of research and evaluation achieved through the availability of this integrated information system. Data linkage with other datasets were tested, within the framework of regulatory authorizations, to enhance the contextualization and knowledge of care pathways, and thus to support the strategic role of PBCRs for real-life evaluation of care practices and health services research (proof of concept): screening, molecular diagnosis, cancer treatment, pharmacoepidemiology (four main publications). Data from the RGCPC were linked with those from the REIN registry (chronic end-stage renal failure) as a use case for experimenting a prototype platform dedicated to the collaborative sharing of massive health data (publication).The last section of this work proposes an open discussion on the relevance of the proposed solutions to the requirements of quality, cost and transferability, and then sets out the prospects and expected benefits in the field of surveillance, evaluation and research in the era of big data
Brahem, Mariem. "Optimisation de requêtes spatiales et serveur de données distribué - Application à la gestion de masses de données en astronomie". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLV009/document.
Pełny tekst źródłaThe big scientific data generated by modern observation telescopes, raises recurring problems of performances, in spite of the advances in distributed data management systems. The main reasons are the complexity of the systems and the difficulty to adapt the access methods to the data. This thesis proposes new physical and logical optimizations to optimize execution plans of astronomical queries using transformation rules. These methods are integrated in ASTROIDE, a distributed system for large-scale astronomical data processing.ASTROIDE achieves scalability and efficiency by combining the benefits of distributed processing using Spark with the relevance of an astronomical query optimizer.It supports the data access using the query language ADQL that is commonly used.It implements astronomical query algorithms (cone search, kNN search, cross-match, and kNN join) tailored to the proposed physical data organization.Indeed, ASTROIDE offers a data partitioning technique that allows efficient processing of these queries by ensuring load balancing and eliminating irrelevant partitions. This partitioning uses an indexing technique adapted to astronomical data, in order to reduce query processing time
Jlassi, Aymen. "Optimisation de la gestion des ressources sur une plate-forme informatique du type Big Data basée sur le logiciel Hadoop". Thesis, Tours, 2017. http://www.theses.fr/2017TOUR4042.
Pełny tekst źródła"Cyres-Group" is working to improve the response time of his clusters Hadoop and optimize how the resources are exploited in its data center. That is, the goals are to finish work as soon as possible and reduce the latency of each user of the system. Firstly, we decide to work on the scheduling problem in the Hadoop system. We consider the problem as the problem of scheduling a set of jobs on a homogeneous platform. Secondly, we decide to propose tools, which are able to provide more flexibility during the resources management in the data center and ensure the integration of Hadoop in Cloud infrastructures without unacceptable loss of performance. Next, the second level focuses on the review of literature. We conclude that, existing works use simple mathematical models that do not reflect the real problem. They ignore the main characteristics of Hadoop software. Hence, we propose a new model ; we take into account the most important aspects like resources management and the relations of precedence among tasks and the data management and transfer. Thus, we model the problem. We begin with a simplistic model and we consider the minimisation of the Cmax as the objective function. We solve the model with mathematical solver CPLEX and we compute a lower bound. We propose the heuristic "LocFirst" that aims to minimize the Cmax. In the third level, we consider a more realistic modelling of the scheduling problem. We aim to minimize the weighted sum of the following objectives : the weighted flow time ( ∑ wjCj) and the makespan (Cmax). We compute a lower bound and we propose two heuristics to resolve the problem
Saffarian, Azadeh. "Algorithmes de prédiction et de recherche de multi-structures d'ARN". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2011. http://tel.archives-ouvertes.fr/tel-00832700.
Pełny tekst źródłaPhan, Duy-Hung. "Algorithmes d'aggrégation pour applications Big Data". Electronic Thesis or Diss., Paris, ENST, 2016. http://www.theses.fr/2016ENST0043.
Pełny tekst źródłaTraditional databases are facing problems of scalability and efficiency dealing with a vast amount of big-data. Thus, modern data management systems that scale to thousands of nodes, like Apache Hadoop and Spark, have emerged and become the de-facto platforms to process data at massive scales. In such systems, many data processing optimizations that were well studied in the database domain have now become futile because of the novel architectures and programming models. In this context, this dissertation pledged to optimize one of the most predominant operations in data processing: data aggregation for such systems.Our main contributions were the logical and physical optimizations for large-scale data aggregation, including several algorithms and techniques. These optimizations are so intimately related that without one or the other, the data aggregation optimization problem would not be solved entirely. Moreover, we integrated these optimizations in our multi-query optimization engine, which is totally transparent to users. The engine, the logical and physical optimizations proposed in this dissertation formed a complete package that is runnable and ready to answer data aggregation queries at massive scales. We evaluated our optimizations both theoretically and experimentally. The theoretical analyses showed that our algorithms and techniques are much more scalable and efficient than prior works. The experimental results using a real cluster with synthetic and real datasets confirmed our analyses, showed a significant performance boost and revealed various angles about our works. Last but not least, our works are published as open sources for public usages and studies
Malekian, Hajar. "La libre circulation et la protection des données à caractère personnel sur Internet". Thesis, Paris 2, 2017. http://www.theses.fr/2017PA020050.
Pełny tekst źródłaFree flow of data and personal data protection on the Internet Protection of personal data is an autonomous fundamental right within the European Union (Article 8 of the Charter of Fundamental Rights of European Union). Moreover, free flow of personal data and free movement of information society services in particular online platforms is essential for the development of digital single market in European Union. The balance between free movement of data and personal data protection is subject of the European legal framework. However, the main challenge still remains to strike the right balance between effective personal data protection and free flow of this data and information society services. This balance is not an easy task especially in the age of online platforms, Big Data and processing algorithms like Machine Learning and Deep Learning
Kopylova, Evguenia. "Algorithmes bio-informatiques pour l'analyse de données de séquençage à haut débit". Phd thesis, Université des Sciences et Technologie de Lille - Lille I, 2013. http://tel.archives-ouvertes.fr/tel-00919185.
Pełny tekst źródłaBulteau, Laurent. "Ordres et désordres dans l'algorithmique du génome". Phd thesis, Université de Nantes, 2013. http://tel.archives-ouvertes.fr/tel-00906929.
Pełny tekst źródłaDemuth, Stanislas. "Computational approach for precision medicine in multiple sclerosis". Electronic Thesis or Diss., Strasbourg, 2024. http://www.theses.fr/2024STRAJ062.
Pełny tekst źródłaThis PhD work explored the secondary use of clinical research data in multiple sclerosis (MS) and their integration with modern information technology to support neurologists’ therapeutic decisions. Tabular data of 31,786 patients with MS were integrated into a homemade cloud-based precision medicine platform from 11 industrial RCTs and two cohorts of the French MS registry. The resulting clinical decision support system relied on interactive data visualization. It showed a similar discriminatory capacity to machine learning but better explainability and calibration in a held-out real-world population. Dedicated training of neurologists appeared required. Regulatory barriers were addressed by generating virtual patients using a privacy-by-design method. They achieved sufficient privacy and clinical utility to proxy the reference data. These translational efforts demonstrated the clinical utility of several data engineering processes to develop a new paradigm of precision medicine in MS
Duarte, Kevin. "Aide à la décision médicale et télémédecine dans le suivi de l’insuffisance cardiaque". Thesis, Université de Lorraine, 2018. http://www.theses.fr/2018LORR0283/document.
Pełny tekst źródłaThis thesis is part of the "Handle your heart" project aimed at developing a drug prescription assistance device for heart failure patients. In a first part, a study was conducted to highlight the prognostic value of an estimation of plasma volume or its variations for predicting major short-term cardiovascular events. Two classification rules were used, logistic regression and linear discriminant analysis, each preceded by a stepwise variable selection. Three indices to measure the improvement in discrimination ability by adding the biomarker of interest were used. In a second part, in order to identify patients at short-term risk of dying or being hospitalized for progression of heart failure, a short-term event risk score was constructed by an ensemble method, two classification rules, logistic regression and linear discriminant analysis of mixed data, bootstrap samples, and by randomly selecting predictors. We define an event risk measure by an odds-ratio and a measure of the importance of variables and groups of variables using standardized coefficients. We show a property of linear discriminant analysis of mixed data. This methodology for constructing a risk score can be implemented as part of online learning, using stochastic gradient algorithms to update online the predictors. We address the problem of sequential multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose to use a process with online standardized data instead of raw data and to use of several observations per step or all observations until the current step. We define three processes and study their almost sure convergence, one with a variable step-size, an averaged process with a constant step-size, a process with a constant or variable step-size and the use of all observations until the current step without storing them. These processes are compared to classical processes on 11 datasets. The third defined process with constant step-size typically yields the best results
Madra, Anna. "Analyse et visualisation de la géométrie des matériaux composites à partir de données d’imagerie 3D". Thesis, Compiègne, 2017. http://www.theses.fr/2017COMP2387/document.
Pełny tekst źródłaThe subject of the thesis project between Laboratoire Roberval at Université de Technologie Compiègne and Center for High-Performance Composites at Ecole Polytechnique de Montréal considered the design of a deep learning architecture with semantics for automatic generation of models of composite materials microstructure based on X-ray microtomographic imagery. The thesis consists of three major parts. Firstly, the methods of microtomographic image processing are presented, with an emphasis on phase segmentation. Then, the geometric features of phase elements are extracted and used to classify and identify new morphologies. The method is presented for composites filled with short natural fibers. The classification approach is also demonstrated for the study of defects in composites, but with spatial features added to the process. A high-level descriptor "defect genome" is proposed, that permits comparison of the state o defects between specimens. The second part of the thesis introduces structural segmentation on the example of woven reinforcement in a composite. The method relies on dual kriging, calibrated by the segmentation error from learning algorithms. In the final part, a stochastic formulation of the kriging model is presented based on Gaussian Processes, and distribution of physical properties of a composite microstructure is retrieved, ready for numerical simulation of the manufacturing process or of mechanical behavior
Duarte, Kevin. "Aide à la décision médicale et télémédecine dans le suivi de l’insuffisance cardiaque". Electronic Thesis or Diss., Université de Lorraine, 2018. http://www.theses.fr/2018LORR0283.
Pełny tekst źródłaThis thesis is part of the "Handle your heart" project aimed at developing a drug prescription assistance device for heart failure patients. In a first part, a study was conducted to highlight the prognostic value of an estimation of plasma volume or its variations for predicting major short-term cardiovascular events. Two classification rules were used, logistic regression and linear discriminant analysis, each preceded by a stepwise variable selection. Three indices to measure the improvement in discrimination ability by adding the biomarker of interest were used. In a second part, in order to identify patients at short-term risk of dying or being hospitalized for progression of heart failure, a short-term event risk score was constructed by an ensemble method, two classification rules, logistic regression and linear discriminant analysis of mixed data, bootstrap samples, and by randomly selecting predictors. We define an event risk measure by an odds-ratio and a measure of the importance of variables and groups of variables using standardized coefficients. We show a property of linear discriminant analysis of mixed data. This methodology for constructing a risk score can be implemented as part of online learning, using stochastic gradient algorithms to update online the predictors. We address the problem of sequential multidimensional linear regression, particularly in the case of a data stream, using a stochastic approximation process. To avoid the phenomenon of numerical explosion which can be encountered and to reduce the computing time in order to take into account a maximum of arriving data, we propose to use a process with online standardized data instead of raw data and to use of several observations per step or all observations until the current step. We define three processes and study their almost sure convergence, one with a variable step-size, an averaged process with a constant step-size, a process with a constant or variable step-size and the use of all observations until the current step without storing them. These processes are compared to classical processes on 11 datasets. The third defined process with constant step-size typically yields the best results
Bourdy, Emilien. "algorithmes de big data adaptés aux réseaux véhiculaires pour modélisation de comportement de conducteur". Thesis, Reims, 2018. http://www.theses.fr/2018REIMS001/document.
Pełny tekst źródłaBig Data is gaining lots of attentions from various research communities as massive data are becoming real issues and processing such data is now possible thanks to available high-computation capacity of today’s equipment. In the meanwhile, it is also the beginning of Vehicular Ad-hoc Networks (VANET) era. Connected vehicles are being manufactured and will become an important part of vehicle market. Topology in this type of network is in constant evolution accompanied by massive data coming from increasing volume of connected vehicles in the network.In this thesis, we handle this interesting topic by providing our first contribution on discussing different aspects of Big Data in VANET. Thus, for each key step of Big Data, we raise VANET issues.The second contribution is the extraction of VANET characteristics in order to collect data. To do that, we discuss how to establish tests scenarios, and to how emulate an environment for these tests. First we conduct an implementation in a controlled environment, before performing tests on real environment in order to obtain real VANET data.For the third contribution, we propose an original approach for driver's behavior modeling. This approach is based on an algorithm permitting extraction of representatives population, called samples, using a local density in a neighborhood concept
Bassino, Frédérique. "Automates, énumération et algorithmes". Habilitation à diriger des recherches, Université de Marne la Vallée, 2005. http://tel.archives-ouvertes.fr/tel-00719172.
Pełny tekst źródłaEl, alaoui Imane. "Transformer les big social data en prévisions - méthodes et technologies : Application à l'analyse de sentiments". Thesis, Angers, 2018. http://www.theses.fr/2018ANGE0011/document.
Pełny tekst źródłaExtracting public opinion by analyzing Big Social data has grown substantially due to its interactive nature, in real time. In fact, our actions on social media generate digital traces that are closely related to our personal lives and can be used to accompany major events by analysing peoples' behavior. It is in this context that we are particularly interested in Big Data analysis methods. The volume of these daily-generated traces increases exponentially creating massive loads of information, known as big data. Such important volume of information cannot be stored nor dealt with using the conventional tools, and so new tools have emerged to help us cope with the big data challenges. For this, the aim of the first part of this manuscript is to go through the pros and cons of these tools, compare their respective performances and highlight some of its interrelated applications such as health, marketing and politics. Also, we introduce the general context of big data, Hadoop and its different distributions. We provide a comprehensive overview of big data tools and their related applications.The main contribution of this PHD thesis is to propose a generic analysis approach to automatically detect trends on given topics from big social data. Indeed, given a very small set of manually annotated hashtags, the proposed approach transfers information from hashtags known sentiments (positive or negative) to individual words. The resulting lexical resource is a large-scale lexicon of polarity whose efficiency is measured against different tasks of sentiment analysis. The comparison of our method with different paradigms in literature confirms the impact of our method to design accurate sentiment analysis systems. Indeed, our model reaches an overall accuracy of 90.21%, significantly exceeding the current models on social sentiment analysis
Kleisarchaki, Sofia. "Analyse des différences dans le Big Data : Exploration, Explication, Évolution". Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM055/document.
Pełny tekst źródłaVariability in Big Data refers to data whose meaning changes continuously. For instance, data derived from social platforms and from monitoring applications, exhibits great variability. This variability is essentially the result of changes in the underlying data distributions of attributes of interest, such as user opinions/ratings, computer network measurements, etc. {em Difference Analysis} aims to study variability in Big Data. To achieve that goal, data scientists need: (a) measures to compare data in various dimensions such as age for users or topic for network traffic, and (b) efficient algorithms to detect changes in massive data. In this thesis, we identify and study three novel analytical tasks to capture data variability: {em Difference Exploration, Difference Explanation} and {em Difference Evolution}.Difference Exploration is concerned with extracting the opinion of different user segments (e.g., on a movie rating website). We propose appropriate measures for comparing user opinions in the form of rating distributions, and efficient algorithms that, given an opinion of interest in the form of a rating histogram, discover agreeing and disargreeing populations. Difference Explanation tackles the question of providing a succinct explanation of differences between two datasets of interest (e.g., buying habits of two sets of customers). We propose scoring functions designed to rank explanations, and algorithms that guarantee explanation conciseness and informativeness. Finally, Difference Evolution tracks change in an input dataset over time and summarizes change at multiple time granularities. We propose a query-based approach that uses similarity measures to compare consecutive clusters over time. Our indexes and algorithms for Difference Evolution are designed to capture different data arrival rates (e.g., low, high) and different types of change (e.g., sudden, incremental). The utility and scalability of all our algorithms relies on hierarchies inherent in data (e.g., time, demographic).We run extensive experiments on real and synthetic datasets to validate the usefulness of the three analytical tasks and the scalability of our algorithms. We show that Difference Exploration guides end-users and data scientists in uncovering the opinion of different user segments in a scalable way. Difference Explanation reveals the need to parsimoniously summarize differences between two datasets and shows that parsimony can be achieved by exploiting hierarchy in data. Finally, our study on Difference Evolution provides strong evidence that a query-based approach is well-suited to tracking change in datasets with varying arrival rates and at multiple time granularities. Similarly, we show that different clustering approaches can be used to capture different types of change
Kacem, Fadi. "Algorithmes exacts et approchés pour des problèmes d'ordonnancement et de placement". Thesis, Evry-Val d'Essonne, 2012. http://www.theses.fr/2012EVRY0007/document.
Pełny tekst źródłaIn this thesis, we focus on solving some combinatorial optimization problems that we have chosen to study in two parts. Firstly, we study optimization problems issued from scheduling a set of tasks on computing machines where we seek to minimize the total energy consumed by these machines while maintaining acceptable quality of service. In a second step, we discuss two optimization problems, namely a classical scheduling problem in architecture of parallel machines with communication delays, and a problem of placing data in graphs that represent peer-to-peer networks and the goal is to minimize the total cost of data access
Yameogo, Relwende Aristide. "Risques et perspectives du big data et de l'intelligence artificielle : approche éthique et épistémologique". Thesis, Normandie, 2020. http://www.theses.fr/2020NORMLH10.
Pełny tekst źródłaIn the 21st century, the use of big data and AI in the field of health has gradually expanded, although it is accompanied by problems linked to the emergence of practices based on the use of digital traces. The aim of this thesis is to evaluate the use of big data and AI in medical practice, to discover the processes generated by digital tools in the field of health and to highlight the ethical problems they pose.The use of ICTs in medical practice is mainly based on the use of EHR, prescription software and connected objects. These uses raise many problems for physicians who are aware of the risk involved in protecting patients' health data. In this work, we are implementing a method for designing CDSS, the temporal fuzzy vector space. This method allows us to model a new clinical diagnostic score for pulmonary embolism. Through the "Human-trace" paradigm, our research allows us not only to measure the limitation in the use of ICT, but also to highlight the interpretative biases due to the delinking between the individual caught in his complexity as a "Human-trace" and the data circulating about him via digital traces. If big data, coupled with AI can play a major role in the implementation of CDSS, it cannot be limited to this field. We are also studying how to set up big data and AI development processes that respect the deontological and medical ethics rules associated with the appropriation of ICTs by the actors of the health system
Clément, Julien. "Algorithmes, mots et textes aléatoires". Habilitation à diriger des recherches, Université de Caen, 2011. http://tel.archives-ouvertes.fr/tel-00913127.
Pełny tekst źródłaStehlé, Damien. "Réseaux Euclidiens : Algorithmes et Cryptographie". Habilitation à diriger des recherches, Ecole normale supérieure de lyon - ENS LYON, 2011. http://tel.archives-ouvertes.fr/tel-00645387.
Pełny tekst źródłaBouafia-Djalab, Soumaya. "Big Data dans les entreprises : transformations organisationnelles, modèles d'usages et modèles d'affaires". Thesis, Pau, 2019. http://www.theses.fr/2019PAUU2068.
Pełny tekst źródłaBig data, blockchain, connected abjects, artificia/ intelligence, eca, al/ these terms refers to new information technologies and relayed until the arrivai of the next innovation. However, they al/ have one thing in common: data. These technologies acce/erate the mass production of various data, and open access to them in real time. We are ta/king about 3V Big Data: Volume, Variety and Ve/ocity. These characteristics of data attract the interest of many professional and academic actors, they a/so pose various questions about their appropriation and generate transformations of different /eve/s. Our present research work deals with the issue of the valuation of Big Data by companies. We have thus tried to understand how companies manage to create value from such massive data, and tried to identify the various business models specific to the exploitation of such data. We also have tried to specif y the related organizational transformation. Weprovide a typology of Big data business mode/s, comprising 9 types eventualf y distributed in 5 categories
Primicerio, Kevin. "Comportement des traders institutionnels et microstructure des marchés : une approche big data". Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLC036/document.
Pełny tekst źródłaThe thesis is divided into four parts.Part I introduces and provides a technical description of the FactSet Ownership dataset together with some preliminary statistics and a set of stylized facts emerging from the portfolio structure of large financial institutions, and from the capitalization of recorded securities.Part II proposes a method to assess the statistical significance of the overlap between pairs of heterogeneously diversified portfolios. This method is then applied to public assets ownership data reported by financial institutions in order to infer statistically robust links between the portfolios of financial institutions based on similar patterns of investment. From an economic point of view, it is suspected that the overlapping holding of financial institution is an important channel for financial contagion with the potential to trigger fire sales and thus severe losses at a systemic level.Part III investigates the collective behaviour of fund manager and, in particular, how the average portfolio structure of institutional investors optimally accounts for transactions costs when investment constraints are weak. The collective ability of a crowd to accurately estimate an unknown quantity is known as the Wisdom of the Crowd. In many situation, the median or average estimate of a group of unrelated individuals is surprisingly close to the true value.In Part IV, we use more than 6.7 billions of trades from the Thomson-Reuters Tick History database and the ownership data from FactSet. We show how the tick-by-tick dynamics of limit order book data depends on the aggregate actions of large funds acting on much larger time scale. In particular, we find that the well-established long memory of marker order signs is markedly weaker when large investment funds trade in a markedly directional way or when their aggregate participation ratio is large. Conversely, we investigate to what respect an asset with a weak memory experiences direction trading from large funds
Marcu, Ovidiu-Cristian. "KerA : Un Système Unifié d'Ingestion et de Stockage pour le Traitement Efficace du Big Data : Un Système Unifié d'Ingestion et de Stockage pour le Traitement Efficace du Big Data". Thesis, Rennes, INSA, 2018. http://www.theses.fr/2018ISAR0028/document.
Pełny tekst źródłaBig Data is now the new natural resource. Current state-of-the-art Big Data analytics architectures are built on top of a three layer stack:data streams are first acquired by the ingestion layer (e.g., Kafka) and then they flow through the processing layer (e.g., Flink) which relies on the storage layer (e.g., HDFS) for storing aggregated data or for archiving streams for later processing. Unfortunately, in spite of potential benefits brought by specialized layers (e.g., simplified implementation), moving large quantities of data through specialized layers is not efficient: instead, data should be acquired, processed and stored while minimizing the number of copies. This dissertation argues that a plausible path to follow to alleviate from previous limitations is the careful design and implementation of a unified architecture for stream ingestion and storage, which can lead to the optimization of the processing of Big Data applications. This approach minimizes data movement within the analytics architecture, finally leading to better utilized resources. We identify a set of requirements for a dedicated stream ingestion/storage engine. We explain the impact of the different Big Data architectural choices on end-to-end performance. We propose a set of design principles for a scalable, unified architecture for data ingestion and storage. We implement and evaluate the KerA prototype with the goal of efficiently handling diverse access patterns: low-latency access to streams and/or high throughput access to streams and/or objects
Salikhov, Kamil. "Algorithmes et structures de données efficaces pour l’indexation de séquences d’ADN". Thesis, Paris Est, 2017. http://www.theses.fr/2017PESC1232/document.
Pełny tekst źródłaAmounts of data generated by Next Generation Sequencing technologies increase exponentially in recent years. Storing, processing and transferring this data become more and more challenging tasks. To be able to cope with them, data scientists should develop more and more efficient approaches and techniques.In this thesis we present efficient data structures and algorithmic methods for the problems of approximate string matching, genome assembly, read compression and taxonomy based metagenomic classification.Approximate string matching is an extensively studied problem with countless number of published papers, both theoretical and practical. In bioinformatics, read mapping problem can be regarded as approximate string matching. Here we study string matching strategies based on bidirectional indices. We define a framework, called search schemes, to work with search strategies of this type, then provide a probabilistic measure for the efficiency of search schemes, prove several combinatorial properties of efficient search schemes and provide experimental computations supporting the superiority of our strategies.Genome assembly is one of the basic problems of bioinformatics. Here we present Cascading Bloom filter data structure, that improves standard Bloom filter and can be applied to several problems like genome assembly. We provide theoretical and experimental results proving properties of Cascading Bloom filter. We also show how Cascading Bloom filter can be used for solving another important problem of read compression.Another problem studied in this thesis is metagenomic classification. We present a BWT-based approach that improves the BWT-index for quick and memory-efficient k-mer search. We mainly focus on data structures that improve speed and memory usage of classical BWT-index for our application
Maria, Clément. "Algorithmes et structures de données en topologie algorithmique". Thesis, Nice, 2014. http://www.theses.fr/2014NICE4081/document.
Pełny tekst źródłaThe theory of homology generalizes the notion of connectivity in graphs to higher dimensions. It defines a family of groups on a domain, described discretely by a simplicial complex that captures the connected components, the holes, the cavities and higher-dimensional equivalents. In practice, the generality and flexibility of homology allows the analysis of complex data, interpreted as point clouds in metric spaces. The theory of persistent homology introduces a robust notion of homology for topology inference. Its applications are various and range from the description of high dimensional configuration spaces of complex dynamical systems, classification of shapes under deformations and learning in medical imaging. In this thesis, we explore the algorithmic ramifications of persistent homology. We first introduce the simplex tree, an efficient data structure to construct and maintain high dimensional simplicial complexes. We then present a fast implementation of persistent cohomology via the compressed annotation matrix data structure. We also refine the computation of persistence by describing ideas of homological torsion in this framework, and introduce the modular reconstruction method for computation. Finally, we present an algorithm to compute zigzag persistent homology, an algebraic generalization of persistence. To do so, we introduce new local transformation theorems in quiver representation theory, called diamond principles. All algorithms are implemented in the computational library Gudhi
Woloszko, Nicolas. "Essays on Nowcasting : Machine learning, données haute-fréquence et prévision économique". Electronic Thesis or Diss., CY Cergy Paris Université, 2024. http://www.theses.fr/2024CYUN1257.
Pełny tekst źródłaCOVID-19 has abruptly accelerated the need for policy makers to have real-time data on economic activity. This event has accelerated earlier research on the use of new methods in economics, such as machine learning and Big Data. While the big data created and held by companies contains real-time information on the economy, its processing requires a specific approach that relies on non-linear algorithms.This research introduces the OECD Weekly Tracker, a tool for monitoring economic activity that provides real-time estimates of weekly GDP for 48 countries. It relies on Google Trends data, which reflect the evolution of topics of interest to Google Search users. The main methodological innovation lies in the implementation of a non-linear panel model. A neural network jointly models the relationship between GDP and Google Trends data for 48 countries while allowing for disparities between countries in this relationship.The Weekly Tracker has been updated every week since the summer of 2020. An analysis of its historical performance shows that its forecasts were more reliable than those of the OECD's flagship publication, the Economic Outlook, during the years of the pandemic. By studying the numerous press publications citing the figures from the Weekly Tracker, it is also shown that this tool provided qualitatively accurate and relevant indications to guide policy making.The policy relevance of the Weekly Tracker lies in both its timeliness and high frequency. The weekly series available since 2004 allow for retrospective policy analysis that exploits high-frequency statistical identification methods. The Tracker is used in an article published in Nature Communications, which examines the consequences of the introduction of the COVID certificates in France, Italy, and Germany. Among other findings, it reveals that vaccination efforts were positively correlated with economic growth and that the COVID certificates led to a €6 billion increase in GDP in France, and respectively €1.4 billion and €2.1 billion in Germany and Italy
Milojevic, Dragomir. "Implémentation des filtres non-linéaires de rang sur des architectures universelles et reconfigurables". Doctoral thesis, Universite Libre de Bruxelles, 2004. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/211147.
Pełny tekst źródłaLes filtres de rang sont considérés comme un important goulot d'étranglement dans la chaîne de traitement, à cause du tri des pixels dans chaque voisinage, à effectuer pour tout pixel de l'image. Les temps de calcul augmentent de façon significative avec la taille de l'image à traiter, la taille du voisinage considéré et lorsque le rang approche la médiane.
Cette thèse propose deux solutions à l'accélération du temps de traitement des filtres de rang.
La première solution vise l'exploitation des différents niveaux de parallélisme des ordinateurs personnels d'aujourd'hui, notamment le parallélisme de données et le parallélisme inter-processeurs. Une telle approche présente un facteur d'accélération de l'ordre de 10 par rapport à une approche classique qui fait abstraction du matériel grâce aux compilateurs des langages évolués. Si le débit résultant des pixels traités, de l'ordre d'une dizaine de millions de pixels par seconde, permet de travailler en temps réel avec des applications vidéo, peu de temps reste pour d'autres traitements dans la chaîne.
La deuxième solution proposée est basée sur le concept de calcul reconfigurable et réalisée à l'aide des circuits FPGA (Field Programmable Gate Array). Le système décrit combine les algorithmes de type bit-série et la haute densité des circuits FPGA actuels. Il en résulte un système de traitement hautement parallèle, impliquant des centaines d'unités de traitement par circuit FPGA et permet d'arriver à un facteur d'accélération supplémentaire de l'ordre de 10 par rapport à la première solution présentée. Un tel système, inséré entre une source d'image numérique et un système hôte, effectue le calcul des filtres de rang avec un débit de l'ordre de centaine de millions de pixels par seconde.
Doctorat en sciences appliquées
info:eu-repo/semantics/nonPublished
Zheng, Wenjing. "Apprentissage ciblé et Big Data : contribution à la réconciliation de l'estimation adaptative et de l’inférence statistique". Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCB044/document.
Pełny tekst źródłaThis dissertation focuses on developing robust semiparametric methods for complex parameters that emerge at the interface of causal inference and biostatistics, with applications to epidemiological and medical research in the era of Big Data. Specifically, we address two statistical challenges that arise in bridging the disconnect between data-adaptive estimation and statistical inference. The first challenge arises in maximizing information learned from Randomized Control Trials (RCT) through the use of adaptive trial designs. We present a framework to construct and analyze group sequential covariate-adjusted response-adaptive (CARA) RCTs that admits the use of data-adaptive approaches in constructing the randomization schemes and in estimating the conditional response model. This framework adds to the existing literature on CARA RCTs by allowing flexible options in both their design and analysis and by providing robust effect estimates even under model mis-specifications. The second challenge arises from obtaining a Central Limit Theorem when data-adaptive estimation is used to estimate the nuisance parameters. We consider as target parameter of interest the marginal risk difference of the outcome under a binary treatment, and propose a Cross-validated Targeted Minimum Loss Estimator (TMLE), which augments the classical TMLE with a sample-splitting procedure. The proposed Cross-Validated TMLE (CV-TMLE) inherits the double robustness properties and efficiency properties of the classical TMLE , and achieves asymptotic linearity at minimal conditions by avoiding the Donsker class condition
Nesvijevskaia, Anna. "Phénomène Big Data en entreprise : processus projet, génération de valeur et Médiation Homme-Données". Thesis, Paris, CNAM, 2019. http://www.theses.fr/2019CNAM1247.
Pełny tekst źródłaBig Data, a sociotechnical phenomenon carrying myths, is reflected in companies by the implementation of first projects, especially Data Science projects. However, they do not seem to generate the expected value. The action-research carried out over the course of 3 years in the field, through an in-depth qualitative study of multiple cases, points to key factors that limit this generation of value, including overly self-contained project process models. The result is (1) an open data project model (Brizo_DS), orientated on the usage, including knowledge capitalization, intended to reduce the uncertainties inherent in these exploratory projects, and transferable to the scale of portfolio management of corporate data projects. It is completed with (2) a tool for documenting the quality of the processed data, the Databook, and (3) a Human-Data Mediation device, which guarantee the alignment of the actors towards an optimal result
Kemp, Gavin. "CURARE : curating and managing big data collections on the cloud". Thesis, Lyon, 2018. http://www.theses.fr/2018LYSE1179/document.
Pełny tekst źródłaThe emergence of new platforms for decentralized data creation, such as sensor and mobile platforms and the increasing availability of open data on the Web, is adding to the increase in the number of data sources inside organizations and brings an unprecedented Big Data to be explored. The notion of data curation has emerged to refer to the maintenance of data collections and the preparation and integration of datasets, combining them to perform analytics. Curation tasks include extracting explicit and implicit meta-data; semantic metadata matching and enrichment to add quality to the data. Next generation data management engines should promote techniques with a new philosophy to cope with the deluge of data. They should aid the user in understanding the data collections’ content and provide guidance to explore data. A scientist can stepwise explore into data collections and stop when the content and quality reach a satisfaction point. Our work adopts this philosophy and the main contribution is a data collections’ curation approach and exploration environment named CURARE. CURARE is a service-based system for curating and exploring Big Data. CURARE implements a data collection model that we propose, used for representing their content in terms of structural and statistical meta-data organised under the concept of view. A view is a data structure that provides an aggregated perspective of the content of a data collection and its several associated releases. CURARE provides tools focused on computing and extracting views using data analytics methods and also functions for exploring (querying) meta-data. Exploiting Big Data requires a substantial number of decisions to be performed by data analysts to determine which is the best way to store, share and process data collections to get the maximum benefit and knowledge from them. Instead of manually exploring data collections, CURARE provides tools integrated in an environment for assisting data analysts determining which are the best collections that can be used for achieving an analytics objective. We implemented CURARE and explained how to deploy it on the cloud using data science services on top of which CURARE services are plugged. We have conducted experiments to measure the cost of computing views based on datasets of Grand Lyon and Twitter to provide insight about the interest of our data curation approach and environment
Lechuga, lopez Olga. "Contributions a l’analyse de données multivoie : algorithmes et applications". Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLC038/document.
Pełny tekst źródłaIn this thesis we develop a framework for the extension of commonly used linear statistical methods (Fisher Discriminant Analysis, Logistical Regression, Cox regression and Regularized Canonical Correlation Analysis) to the multiway context. In contrast to their standard formulation, their multiway generalization relies on structural constraints imposed to the weight vectors that integrate the original tensor structure of the data within the optimization process. This structural constraint yields a more parsimonious and interpretable model. Different strategies to deal with high dimensionality are also considered. The application of these algorithms is illustrated on two real datasets: (i) serving for the discrimination of spectroscopy data for which all methods where tested and (ii) to predict the long term recovery of patients after traumatic brain injury from multi-modal brain Magnetic Resonance Imaging. In both datasets our methods yield valuable results compared to the standard approach
Hadjipavlou, Elena. "Big data, surveillance et confiance : la question de la traçabilité dans le milieu aéroportuaire". Thesis, Université Côte d'Azur (ComUE), 2016. http://www.theses.fr/2016AZUR2044/document.
Pełny tekst źródłaThis research project questions, in a comprehensive and critical way, the presence of digital traces in the era of Big Data. This reflection opens up in the relation between Surveillance and Trust. In recent years, “Big Data” has massively and repeatedly been used in order to describe a new societal dynamic that would be characterized by the production of massive quantities of data. Furthermore, enormous potential benefits from using new statistical tools to analyze these data generated from connected objects and tools in more and more human actions. The airport sector is currently facing a major transformation, fueled by the explosion of data within its structure. The data generated during a passenger's journey are now extremely massive. There is no doubt that the management of this data is an important lever for the safety, the improvement of services and the comfort of the passenger. However, the expected benefits raise a great question: Where do these data go? We do not know. And as long as we do not know, how can we trust? These considerations are being examined at Larnaca airport in Cyprus. The different angles of approach as well as the diversity of the actors required the creation of a multidimensional corpus, resulting from a mixed methodology, in order to have a comprehensive approach to the subject. This corpus includes interviews, questionnaires and life stories of passengers and professionals. The qualitative and quantitative analysis that followed was based on a theoretical framework previously elaborated, in order to cross the representations of the actors concerning the surveillance and the trust and finally, highlight the different inherent visions to this issue
Viennot, Laurent. "Quelques algorithmes parallèles et séquentiels de traitement des graphes et applications". Phd thesis, Université Paris-Diderot - Paris VII, 1996. http://tel.archives-ouvertes.fr/tel-00471691.
Pełny tekst źródłaPasquier, Nicolas. "Data Mining : algorithmes d'extraction et de réduction des règles d'association dans les bases de données". Phd thesis, Université Blaise Pascal - Clermont-Ferrand II, 2000. http://tel.archives-ouvertes.fr/tel-00467764.
Pełny tekst źródłaPasquier, Nicolas. "Data mining : algorithmes d'extraction et de reduction des regles d'association dans les bases de donnees". Clermont-Ferrand 2, 2000. https://tel.archives-ouvertes.fr/tel-00467764.
Pełny tekst źródłaBarredo, Escribano Maria. "La construction de l'identité sur Internet : mutations et transformations dans le web social". Thesis, Limoges, 2015. http://www.theses.fr/2015LIMO0081.
Pełny tekst źródłaOn the basis of this analysis, we propose to take into consideration the digital identity as a complex process of construction, which may be regarded from several angles. In a constant mutation, a variety of stakeholders present in the Internet perform different roles in the on line individual's construction identity. On the one hand, an emergence of social web converts the user, in the form of a social media profile, into a multi-positional actor ( sender, transmitter, receiver, etc.) and gives him/her a relational identity as well. On the other hand, the constraints imposed by the net and the issues placed in different levels of analysis may suggest to review the horizontal hierarchy between nodes, being these ones the web's minimal units which in turn are embodied by the users. Therefore, could a node be social ? The digital interactive communication could it be based in presumptions excluding the individual ? Beyond the relational identity of social web, could it be conceived a digital identity equivalent to the real identity of an individual on any society? Conditions, premises and the confluence of different digital praxis are indeed the elements to be analysed in order to find suitable answers to our general problem. Certainly, the criteria to take into consideration a concept such an identity, and the preservation of user's real identity as a citizen are the main axis of our analysis. More precisely, an analysis which is focused in the current state of contemporary Internet regarding the individual, as we conceive him nowadays
Chihoub, Houssem-Eddine. "Managing Consistency for Big Data Applications on Clouds: Tradeoffs and Self Adaptiveness". Phd thesis, École normale supérieure de Cachan - ENS Cachan, 2013. http://tel.archives-ouvertes.fr/tel-00915091.
Pełny tekst źródłaToss, Julio. "Algorithmes et structures de données parallèles pour applications interactives". Thesis, Université Grenoble Alpes (ComUE), 2017. http://www.theses.fr/2017GREAM056/document.
Pełny tekst źródłaThe quest for performance has been a constant through the history of computing systems. It has been more than a decade now since the sequential processing model had shown its first signs of exhaustion to keep performance improvements.Walls to the sequential computation pushed a paradigm shift and established the parallel processing as the standard in modern computing systems. With the widespread adoption of parallel computers, many algorithms and applications have been ported to fit these new architectures. However, in unconventional applications, with interactivity and real-time requirements, achieving efficient parallelizations is still a major challenge.Real-time performance requirement shows-up, for instance, in user-interactive simulations where the system must be able to react to the user's input within a computation time-step of the simulation loop. The same kind of constraint appears in streaming data monitoring applications. For instance, when an external source of data, such as traffic sensors or social media posts, provides a continuous flow of information to be consumed by an on-line analysis system. The consumer system has to keep a controlled memory budget and delivery fast processed information about the stream.Common optimizations relying on pre-computed models or static index of data are not possible in these highly dynamic scenarios. The dynamic nature of the data brings up several performance issues originated from the problem decomposition for parallel processing and from the data locality maintenance for efficient cache utilization.In this thesis we address data-dependent problems on two different application: one in physics-based simulation and other on streaming data analysis. To the simulation problem, we present a parallel GPU algorithm for computing multiple shortest paths and Voronoi diagrams on a grid-like graph. To the streaming data analysis problem we present a parallelizable data structure, based on packed memory arrays, for indexing dynamic geo-located data while keeping good memory locality
Attal, Jean-Philippe. "Nouveaux algorithmes pour la détection de communautés disjointes et chevauchantes basés sur la propagation de labels et adaptés aux grands graphes". Thesis, Cergy-Pontoise, 2017. http://www.theses.fr/2017CERG0842/document.
Pełny tekst źródłaGraphs are mathematical structures amounting to a set of nodes (objects or persons) in which some pairs are in linked with edges. Graphs can be used to model complex systems.One of the main problems in graph theory is the community detection problemwhich aims to find a partition of nodes in the graph to understand its structure.For instance, by representing insurance contracts by nodes and their relationship by edges,detecting groups of nodes highly connected leads to detect similar profiles and to evaluate risk profiles. Several algorithms are used as aresponse to this currently open research field.One of the fastest method is the label propagation.It's a local method, in which each node changes its own label according toits neighbourhood.Unfortunately, this method has two major drawbacks. The first is the instability of the method. Each trialgives rarely the same result.The second is a bad propagation which can lead to huge communities without sense (giant communities problem).The first contribution of the thesis is i) proposing a stabilisation methodfor the label propagation with artificial dams on edges of some networks in order to limit bad label propagations. Complex networks are also characterized by some nodes which may belong to several communities,we call this a cover.For example, in Protein–protein interaction networks, some proteins may have several functions.Detecting these functions according to their communities could help to cure cancers. The second contribution of this thesis deals with the ii)implementation of an algorithmwith functions to detect potential overlapping nodes .The size of the graphs is also to be considered because some networks contain several millions of nodes and edges like the Amazon product co-purchasing network.We propose iii) a parallel and a distributed version of the community detection using core label propagation.A study and a comparative analysis of the proposed algorithms will be done based on the quality of the resulted partitions and covers
Gayet, Amaury. "Méthode de valorisation comptable temps réel et big data : étude de cas appliquée à l'industrie papetière". Thesis, Paris 10, 2018. http://www.theses.fr/2018PA100001/document.
Pełny tekst źródłaContext: IP Leanware is a growing start-up. Created in 2008, its consolidated sales has quadrupled in 4 years and established two subsidiaries (Brazil and the United States). Since then, its growth has been two digits (2015). It optimizes the performance of industrial companies with software (BrainCube) that identifies overperformance conditions. The thesis, carried out in CIFRE within the R&D service led by Sylvain Rubat du Mérac, is located at the interface of management control, production management and information systems.Aim: BrainCube manages massive descriptive data of its customers' process flows. Its analysis engine identifies overperformance situations and broadcasts them in real time through tactile interfaces. BrainCube couples two flows: informational and physical. The mission is to integrate the economic variable. A literature study shows that simultaneous real-time evaluation of physical, informational and financial flows coupled with continuous improvement of production processes is not realized.Result: A literature review examines the practices and methods of management control to propose a real-time method adapted to the specificities of BrainCube. The case study, based on an engineering-research, proposes a generic modeling methodology of the economic variable. Configurable generic decision models are proposed. They must facilitate the use of real time information with high granularity. The contributions, limits and perspectives highlight the interest of works for the company and the management sciences
Ren, Xiangnan. "Traitement et raisonnement distribués des flux RDF". Thesis, Paris Est, 2018. http://www.theses.fr/2018PESC1139/document.
Pełny tekst źródłaReal-time processing of data streams emanating from sensors is becoming a common task in industrial scenarios. In an Internet of Things (IoT) context, data are emitted from heterogeneous stream sources, i.e., coming from different domains and data models. This requires that IoT applications efficiently handle data integration mechanisms. The processing of RDF data streams hence became an important research field. This trend enables a wide range of innovative applications where the real-time and reasoning aspects are pervasive. The key implementation goal of such application consists in efficiently handling massive incoming data streams and supporting advanced data analytics services like anomaly detection. However, a modern RSP engine has to address volume and velocity characteristics encountered in the Big Data era. In an on-going industrial project, we found out that a 24/7 available stream processing engine usually faces massive data volume, dynamically changing data structure and workload characteristics. These facts impact the engine's performance and reliability. To address these issues, we propose Strider, a hybrid adaptive distributed RDF Stream Processing engine that optimizes logical query plan according to the state of data streams. Strider has been designed to guarantee important industrial properties such as scalability, high availability, fault-tolerant, high throughput and acceptable latency. These guarantees are obtained by designing the engine's architecture with state-of-the-art Apache components such as Spark and Kafka. Moreover, an increasing number of processing jobs executed over RSP engines are requiring reasoning mechanisms. It usually comes at the cost of finding a trade-off between data throughput, latency and the computational cost of expressive inferences. Therefore, we extend Strider to support real-time RDFS+ (i.e., RDFS + owl:sameAs) reasoning capability. We combine Strider with a query rewriting approach for SPARQL that benefits from an intelligent encoding of knowledge base. The system is evaluated along different dimensions and over multiple datasets to emphasize its performance. Finally, we have stepped further to exploratory RDF stream reasoning with a fragment of Answer Set Programming. This part of our research work is mainly motivated by the fact that more and more streaming applications require more expressive and complex reasoning tasks. The main challenge is to cope with the large volume and high-velocity dimensions in a scalable and inference-enabled manner. Recent efforts in this area still missing the aspect of system scalability for stream reasoning. Thus, we aim to explore the ability of modern distributed computing frameworks to process highly expressive knowledge inference queries over Big Data streams. To do so, we consider queries expressed as a positive fragment of LARS (a temporal logic framework based on Answer Set Programming) and propose solutions to process such queries, based on the two main execution models adopted by major parallel and distributed execution frameworks: Bulk Synchronous Parallel (BSP) and Record-at-A-Time (RAT). We implement our solution named BigSR and conduct a series of evaluations. Our experiments show that BigSR achieves high throughput beyond million-triples per second using a rather small cluster of machines
Blin, Lélia. "Algorithmes auto-stabilisants pour la construction d'arbres couvrants et la gestion d'entités autonomes". Habilitation à diriger des recherches, Université Pierre et Marie Curie - Paris VI, 2011. http://tel.archives-ouvertes.fr/tel-00847179.
Pełny tekst źródłaNardecchia, Alessandro. "Chemometric exploration in hyperspectral imaging in the framework of big data and multimodality". Electronic Thesis or Diss., Université de Lille (2022-....), 2022. https://pepite-depot.univ-lille.fr/LIBRE/EDSMRE/2022/2022ULILR021.pdf.
Pełny tekst źródłaNowadays, it is widely known that hyperspectral imaging is a very good tool used in many chemical-related research areas. Indeed, it can be exploited for the study of samples of different nature, whatever the spectroscopic technique used. Despite the very interesting characteristics related to this kind of acquired data, various limitations are potentially faced. First of all, modern instruments can generate a huge amount of data (big datasets). Furthermore, the fusion of different spectroscopic responses on the same sample (multimodality) can be potentially applied, leading to even more data to be analyzed. This aspect can be a problem, considering the fact that if the right approach is not used, it could be complicated to obtain satisfying results or even lead to a biased vision of the analytical reality of the sample. Obviously, some spectral artifacts can be present in a dataset, and so the correction of these imperfections has to be taken into account to carry out good outcomes. Another important challenge related to the use of hyperspectral image analysis is that normally, the simultaneous observation of spectral and spatial information is almost impossible. Clearly, this leads to an incomplete investigation of the sample of interest. Chemometrics is a modern branch of chemistry that can perfectly match the current limitations related to hyperspectral imaging. The purpose of this PhD work is to give to the reader a series of different topics in which many challenges related to hyperspectral images can be overcome using different chemometric facets. Particularly, as it will described, problems such as the generation of big amount of data can be faced using algorithms based on the selection of the purest information (i.e., SIMPLISMA), or related to the creation of clusters in which similar components will be grouped (i.e., KM clustering). In order to correct instrumental artifacts such as saturated signals will be used a methodology that exploits the statistical imputation, in order to recreate in a very elegant way the missing information and thus, obtain signals that otherwise would be irremediably lost. A significant part of this thesis has been related to the investigation of data acquired using LIBS imaging, a spectroscopic technique that is currently obtaining an increasing interest in many research areas, but that, still, has not really been exploited to its full potential by the use of chemometric approaches. In this manuscript, it will be shown a general pipeline focusing on the selection of the most important information related to this kind of data cube (due to the huge amount of spectral data that can be easily generated) in order to overcome some limitations faced during the analysis of this instrumental response. Furthermore, the same approach will be exploited for the data fusion analysis, related to LIBS and other spectroscopic data. Lastly, it will be shown an interesting way to use wavelet transform, in order to not limit the analysis only to spectral data, but also to spatial ones, to obtain a more complete chemical investigation
Laroche, Benjamin. "Le big data à l’épreuve du règlement européen général sur la protection des données". Thesis, Toulouse 1, 2020. http://www.theses.fr/2020TOU10041.
Pełny tekst źródłaCitizens’ daily uses of technologies in a digital society exponentially produce data. In this context, the development of massive data collection appears as inevitable. Such technologies involve the processing of personal data in order to create economic value or to optimize business or decision-making processes. The General Data Protection Regulation (EU) 2016/679 (GDPR) aims to regulate these practices while respecting the imperatives of flexibility and technological neutrality. However, big data is proving to be an unprecedentedly complex legal issue, as its specific characteristics oppose several principles of the General Data Protection Regulation. Widely shared, this observation has gradually imposed an implicit form of status quo that does not allow for the effective resolution of the incompatibility between the reality of big data and the legal framework provided by the GDPR. In order to solve this equation, a distributive approach, based on the components of the big data: its structure, its data and its algorithmic capabilities, will then make it possible to study the qualification of this notion in order to identify an appropriate regime. Overcoming such a problem will, first of all, involve updating the qualification of personal data in order to respond to the increasing complexity of data processing carried out using advanced algorithmic capabilities. In addition, the accountability of the various actors involved, in particular through joint responsibilities for processing, will be associated with the notion of risk in order to bring the necessary updating to the regulation of big data. Finally, the application of a data protection impact analysis methodology will test and then synthesize the indispensable strengthening of the adequacy between legal theory and the practical reality of big data
Chennen, Kirsley. "Maladies rares et "Big Data" : solutions bioinformatiques vers une analyse guidée par les connaissances : applications aux ciliopathies". Thesis, Strasbourg, 2016. http://www.theses.fr/2016STRAJ076/document.
Pełny tekst źródłaOver the last decade, biomedical research and medical practice have been revolutionized by the post-genomic era and the emergence of Big Data in biology. The field of rare diseases, are characterized by scarcity from the patient to the domain knowledge. Nevertheless, rare diseases represent a real interest as the fundamental knowledge accumulated as well as the developed therapeutic solutions can also benefit to common underlying disorders. This thesis focuses on the development of new bioinformatics solutions, integrating Big Data and Big Data associated approaches to improve the study of rare diseases. In particular, my work resulted in (i) the creation of PubAthena, a tool for the recommendation of relevant literature updates, (ii) the development of a tool for the analysis of exome datasets, VarScrut, which combines multi-level knowledge to improve the resolution rate
Mondal, Kartick Chandra. "Algorithmes pour la fouille de données et la bio-informatique". Thesis, Nice, 2013. http://www.theses.fr/2013NICE4049.
Pełny tekst źródłaKnowledge pattern extraction is one of the major topics in the data mining and background knowledge integration domains. Out of several data mining techniques, association rule mining and bi-clustering are two major complementary tasks for these topics. These tasks gained much importance in many domains in recent years. However, no approach was proposed to perform them in one process. This poses the problems of resources required (memory, execution times and data accesses) to perform independent extractions and of the unification of the different results. We propose an original approach for extracting different categories of knowledge patterns while using minimum resources. This approach is based on the frequent closed patterns theoretical framework and uses a novel suffix-tree based data structure to extract conceptual minimal representations of association rules, bi-clusters and classification rules. These patterns extend the classical frameworks of association and classification rules, and bi-clusters as data objects supporting each pattern and hierarchical relationships between patterns are also extracted. This approach was applied to the analysis of HIV-1 and human protein-protein interaction data. Analyzing such inter-species protein interactions is a recent major challenge in computational biology. Databases integrating heterogeneous interaction information and biological background knowledge on proteins have been constructed. Experimental results show that the proposed approach can efficiently process these databases and that extracted conceptual patterns can help the understanding and analysis of the nature of relationships between interacting proteins
Jain, Sheenam. "Big data management using artificial intelligence in the apparel supply chain : opportunities and challenges". Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I051.
Pełny tekst źródłaOver the past decade, the apparel industry has seen several applications of big data and artificial intelligence (AI) in dealing with various business problems. With the increase in competition and customer demands for the personalization of products and services which can enhance their brand experience and satisfaction, supply-chain managers in apparel firms are constantly looking for ways to improve their business strategies so as to bring speed and cost efficiency to their organizations. The big data management solutions presented in this thesis highlight opportunities for apparel firms to look into their supply chains and identify big data resources that may be valuable, rare, and inimitable, and to use them to create data-driven strategies and establish dynamic capabilities to sustain their businesses in an uncertain business environment. With the help of these data-driven strategies, apparel firms can produce garments smartly to provide customers with a product that closer meets their needs, and as such drive sustainable consumption and production practices.In this context, this thesis aims to investigate whether apparel firms can improve their business operations by employing big data and AI, and in so doing, seek big data management opportunities using AI solutions. Firstly, the thesis identifies and classifies AI techniques that can be used at various stages of the supply chain to improve existing business operations. Secondly, the thesis presents product-related data to create a classification model and design rules that can create opportunities for providing personalized recommendations or customization, enabling better shopping experiences for customers. Thirdly, this thesis draws from the evidence in the industry and existing literature to make suggestions that may guide managers in developing data-driven strategies for improving customer satisfaction through personalized services. Finally, this thesis shows the effectiveness of data-driven analytical solutions in sustaining competitive advantage via the data and knowledge already present within the apparel supply chain. More importantly, this thesis also contributes to the field by identifying specific opportunities with big data management using AI solutions. These opportunities can be a starting point for other research in the field of technology and management