Thèses sur le sujet « Données massives – Analyse informatique »
Créez une référence correcte selon les styles APA, MLA, Chicago, Harvard et plusieurs autres
Consultez les 50 meilleures thèses pour votre recherche sur le sujet « Données massives – Analyse informatique ».
À côté de chaque source dans la liste de références il y a un bouton « Ajouter à la bibliographie ». Cliquez sur ce bouton, et nous générerons automatiquement la référence bibliographique pour la source choisie selon votre style de citation préféré : APA, MLA, Harvard, Vancouver, Chicago, etc.
Vous pouvez aussi télécharger le texte intégral de la publication scolaire au format pdf et consulter son résumé en ligne lorsque ces informations sont inclues dans les métadonnées.
Parcourez les thèses sur diverses disciplines et organisez correctement votre bibliographie.
Haddad, Raja. « Apprentissage supervisé de données symboliques et l'adaptation aux données massives et distribuées ». Thesis, Paris Sciences et Lettres (ComUE), 2016. http://www.theses.fr/2016PSLED028/document.
Texte intégralThis Thesis proposes new supervised methods for Symbolic Data Analysis (SDA) and extends this domain to Big Data. We start by creating a supervised method called HistSyr that converts automatically continuous variables to the most discriminant histograms for classes of individuals. We also propose a new method of symbolic decision trees that we call SyrTree. SyrTree accepts many types of inputs and target variables and can use all symbolic variables describing the target to construct the decision tree. Finally, we extend HistSyr to Big Data, by creating a distributed method called CloudHistSyr. Using the Map/Reduce framework, CloudHistSyr creates of the most discriminant histograms for data too big for HistSyr. We tested CloudHistSyr on Amazon Web Services. We show the efficiency of our method on simulated data and on actual car traffic data in Nantes. We conclude on overall utility of CloudHistSyr which, through its results, allows the study of massive data using existing symbolic analysis methods
Adjout, Rehab Moufida. « Big Data : le nouvel enjeu de l'apprentissage à partir des données massives ». Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCD052.
Texte intégralIn recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction
Ledieu, Thibault. « Analyse et visualisation de trajectoires de soins par l’exploitation de données massives hospitalières pour la pharmacovigilance ». Thesis, Rennes 1, 2018. http://www.theses.fr/2018REN1B032/document.
Texte intégralThe massification of health data is an opportunity to answer questions about vigilance and quality of care. The emergence of big data in health is an opportunity to answer questions about vigilance and quality of care. In this thesis work, we will present approaches to exploit the diversity and volume of intra-hospital data for pharmacovigilance use and monitoring the proper use of drugs. This approach will be based on the modelling of intra-hospital care trajectories adapted to the specific needs of pharmacovigilance. Using data from a hospital warehouse, it will be necessary to characterize events of interest and identify a link between the administration of these health products and the occurrence of adverse reactions, or to look for cases of misuse of the drug. The hypothesis put forward in this thesis is that an interactive visual approach would be suitable for the exploitation of these heterogeneous and multi-domain biomedical data in the field of pharmacovigilance. We have developed two prototypes allowing the visualization and analysis of care trajectories. The first prototype is a tool for visualizing the patient file in the form of a timeline. The second application is a tool for visualizing and searching a cohort of event sequences The latter tool is based on the implementation of sequence analysis algorithms (Smith-Waterman, Apriori, GSP) for the search for similarity or patterns of recurring events. These human-machine interfaces have been the subject of usability studies on use cases from actual practice that have proven their potential for routine use
El, Ouazzani Saïd. « Analyse des politiques publiques en matière d’adoption du cloud computing et du big data : une approche comparative des modèles français et marocain ». Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLE009/document.
Texte intégralOur research concerns the public policy analysis on how Cloud Computing and Big data are adopted by French and Moroccan States with a comparative approach between the two models. We have covered these main areas: The impact of the digital on the organization of States and Government ; The digital Public Policy in both France and Morocco countries ;The concept related to the data protection, data privacy ; The limits between security, in particular home security, and the civil liberties ; The future and the governance of the Internet ; A use case on how the Cloud could change the daily work of a public administration ; Our research aims to analyze how the public sector could be impacted by the current digital (re) evolution and how the States could be changed by emerging a new model in digital area called Cyber-State. This term is a new concept and is a new representation of the State in the cyberspace. We tried to analyze the digital transformation by looking on how the public authorities treat the new economics, security and social issues and challenges based on the Cloud Computing and Big Data as the key elements on the digital transformation. We tried also to understand how the States – France and Morocco - face the new security challenges and how they fight against the terrorism, in particular, in the cyberspace. We studied the recent adoption of new laws and legislation that aim to regulate the digital activities. We analyzed the limits between security risks and civil liberties in context of terrorism attacks. We analyzed the concepts related to the data privacy and the data protection. Finally, we focused also on the future of the internet and the impacts on the as is internet architecture and the challenges to keep it free and available as is the case today
Belghache, Elhadi. « AMAS4BigData : analyse dynamique de grandes masses de données par systèmes multi-agents adaptatifs ». Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30149.
Texte intégralUnderstanding data is the main purpose of data science and how to achieve it is one of the challenges of data science, especially when dealing with big data. The big data era brought us new data processing and data management challenges to face. Existing state-of-the-art analytics tools come now close to handle ongoing challenges and provide satisfactory results with reasonable cost. But the speed at which new data is generated and the need to manage changes in data both for content and structure lead to new rising challenges. This is especially true in the context of complex systems with strong dynamics, as in for instance large scale ambient systems. One existing technology that has been shown as particularly relevant for modeling, simulating and solving problems in complex systems are Multi-Agent Systems. The AMAS (Adaptive Multi-Agent Systems) theory proposes to solve complex problems for which there is no known algorithmic solution by self-organization. The cooperative behavior of the agents enables the system to self-adapt to a dynamical environment so as to maintain the system in a functionality adequate state. In this thesis, we apply this theory to Big Data Analytics. In order to find meaning and relevant information drowned in the data flood, while overcoming big data challenges, a novel analytic tool is needed, able to continuously find relations between data, evaluate them and detect their changes and evolution over time. The aim of this thesis is to present the AMAS4BigData analytics framework based on the Adaptive Multi-agent systems technology, which uses a new data similarity metric, the Dynamics Correlation, for dynamic data relations discovery and dynamic display. This framework is currently being applied in the neOCampus operation, the ambient campus of the University Toulouse III - Paul Sabatier
Cantu, Alma. « Proposition de modes de visualisation et d'interaction innovants pour les grandes masses de données et/ou les données structurées complexes en prenant en compte les limitations perceptives des utilisateurs ». Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2018. http://www.theses.fr/2018IMTA0068/document.
Texte intégralAs a result of the improvement of data capture and storage, recent years have seen the amount of data to be processed increase dramatically. Many studies, ranging from automatic processing to information visualization, have been performed, but some areas are still too specific to take advantage of. This is the case of ELectromagnetic INTelligence(ELINT). This domain does not only deal with a huge amount of data but also has to handle complex data and usage as well as populations of users with less and less experience. In this thesis we focus on the use of existing and new technologies applied to visualization to propose solutions to the combination of issues such as huge amount and complex data. We begin by presenting an analysis of the ELINT field which made it possible to extract the issues that it must faces. Then, we focus on the visual solutions handling the combinations of such issues but the existing work do not contain directly such solutions. Therefore, we focus on the description of visual issues and propose a characterization of these issues. This characterization allows us to describe the existing representations and to build a recommendation tool based on how the existing work solves the issues. Finally, we focus on identifying new metaphors to complete the existing work and propose an immersive representation to solve the issues of ELINT. These contributions make it possible to analyze and use the existing and deepen the use of immersive representations for the visualization of information
Soler, Maxime. « Réduction et comparaison de structures d'intérêt dans des jeux de données massifs par analyse topologique ». Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS364.
Texte intégralIn this thesis, we propose different methods, based on topological data analysis, in order to address modern problematics concerning the increasing difficulty in the analysis of scientific data. In the case of scalar data defined on geometrical domains, extracting meaningful knowledge from static data, then time-varying data, then ensembles of time-varying data proves increasingly challenging. Our approaches for the reduction and analysis of such data are based on the idea of defining structures of interest in scalar fields as topological features. In a first effort to address data volume growth, we propose a new lossy compression scheme which offers strong topological guarantees, allowing topological features to be preserved throughout compression. The approach is shown to yield high compression factors in practice. Extensions are proposed to offer additional control over the geometrical error. We then target time-varying data by designing a new method for tracking topological features over time, based on topological metrics. We extend the metrics in order to overcome robustness and performance limitations. We propose a new efficient way to compute them, gaining orders of magnitude speedups over state-of-the-art approaches. Finally, we apply and adapt our methods to ensemble data related to reservoir simulation, for modeling viscous fingering in porous media. We show how to capture viscous fingers with topological features, adapt topological metrics for capturing discrepancies between simulation runs and a ground truth, evaluate the proposed metrics with feedback from experts, then implement an in-situ ranking framework for rating the fidelity of simulation runs
Liu, Rutian. « Semantic services for assisting users to augment data in the context of analytic data sources ». Electronic Thesis or Diss., Sorbonne université, 2020. http://www.theses.fr/2020SORUS208.
Texte intégralThe production of analytic datasets is a significant big data trend and has gone well beyond the scope of traditional IT-governed dataset development. Analytic datasets are now created by data scientists and data analysts using bigdata frameworks and agile data preparation tools. However, it still remains difficult for a data analyst to start from a dataset at hand and customize it with additional attributes coming from other existing datasets. This thesis presents a new solution for business users and data scientists who want to augment the schema of analytic datasets with attributes coming from other semantically related datasets : We introduce attribute graphs as a novel concise and natural way to represent literal functional dependencies over hierarchical dimension level types to infer unique dimension and fact table identifiers We give formal definitions for schema augmentation, schema complement, and merge query in the context of analytic tables. We then introduce several reduction operations to enforce schema complements when schema augmentation yields a row multiplication in the augmented dataset. We define formal quality criteria and algorithms to control the correctness, non-ambiguity, and completeness of generated schema augmentations and schema complements. We describe the implementation of our solution as a REST service within the SAP HANA platform and provide a detailed description of our algorithms. We evaluate the performance of our algorithms to compute unique identifiers in dimension and fact tables and analyze the effectiveness of our REST service using two application scenarios
Baudin, Alexis. « Cliques statiques et temporelles : algorithmes d'énumération et de détection de communautés ». Electronic Thesis or Diss., Sorbonne université, 2023. http://www.theses.fr/2023SORUS609.
Texte intégralGraphs are mathematical objects used to model interactions or connections between entities of various types. A graph can represent, for example, a social network that connects users to each other, a transport network like the metro where stations are connected to each other, or a brain with the billions of interacting neurons it contains. In recent years, the dynamic nature of these structures has been highlighted, as well as the importance of taking into account the temporal evolution of these networks to understand their functioning. While many concepts and algorithms have been developed on graphs to describe static network structures, much remains to be done to formalize and develop relevant algorithms to describe the dynamics of real networks. This thesis aims to better understand how massive graphs are structured in the real world, and to develop tools to extend our understanding to structures that evolve over time. It has been shown that these graphs have particular properties, which distinguish them from theoretical or randomly drawn graphs. Exploiting these properties then enables the design of algorithms to solve certain difficult problems much more quickly on these instances than in the general case. My PhD thesis focuses on cliques, which are groups of elements that are all connected to each other. We study the enumeration of cliques in static and temporal graphs and the detection of communities they enable. The communities of a graph are sets of vertices such that, within a community, the vertices interact strongly with each other, and little with the rest of the graph. Their study helps to understand the structural and functional properties of networks. We are evaluating our algorithms on massive real-world graphs, opening up new perspectives for understanding interactions within these networks. We first work on graphs, without taking into account the temporal component of interactions. We begin by using the clique percolation method of community detection, highlighting its limitations in memory, which prevent it from being applied to graphs that are too massive. By introducing an approximate problem-solving algorithm, we overcome this limitation. Next, we improve the enumeration of maximal cliques in the case of bipartite graphs. These correspond to interactions between groups of vertices of different types, e.g. links between people and viewed content, participation in events, etc. Next, we consider interactions that take place over time, using the link stream formalism. We seek to extend the algorithms presented in the first part, to exploit their advantages in the study of temporal interactions. We provide a new algorithm for enumerating maximal cliques in link streams, which is much more efficient than the state-of-the-art on massive datasets. Finally, we focus on communities in link streams by clique percolation, developing an extension of the method used on graphs. The results show a significant improvement over the state of the art, and we analyze the communities obtained to provide relevant information on the organization of temporal interactions in link streams. My PhD work has provided new insights into the study of massive real-world networks. This shows the importance of exploring the potential of graphs in a real-world context, and could contribute to the emergence of innovative solutions for the complex challenges of our modern society
Larroche, Corentin. « Network-wide intrusion detection through statistical analysis of event logs : an interaction-centric approach ». Electronic Thesis or Diss., Institut polytechnique de Paris, 2021. http://www.theses.fr/2021IPPAT041.
Texte intégralEvent logs are structured records of all kinds of activities taking place in a computer network. In particular, malicious actions taken by intruders are likely to leave a trace in the logs, making this data source useful for security monitoring and intrusion detection. However, the considerable volume of real-world event logs makes them difficult to analyze. This limitation has motivated a fair amount of research on malicious behavior detection through statistical methods. This thesis addresses some of the challenges that currently hinder the use of this approach in realistic settings. First of all, building an abstract representation of the data is nontrivial: event logs are complex and multi-faceted, making it difficult to capture all the relevant information they contain in a simple mathematical object. We take an interaction-centric approach to event log representation, motivated by the intuition that malicious events can often be seen as unexpected interactions between entities (users, hosts, etc.). While this representation preserves critical information, it also makes statistical modelling difficult. We thus build an ad hoc model and design a suitable inference procedure, using elements of latent space modelling, Bayesian filtering and multi-task learning.Another key challenge in event log analysis is that benign events account for a vast majority of the data, including a lot of unusual albeit legitimate events. Detecting individually anomalous events is thus not enough, and we also deal with spotting clusters of potentially malicious events. To that end, we leverage the concept of event graph and recast event-wise anomaly scores as a noisy graph-structured signal. This allows us to use graph signal processing tools to improve anomaly scores provided by statistical models.Finally, we propose scalable methods for anomalous cluster detection in node-valued signals defined over large graphs
Scholler, Rémy. « Analyse de données de signalisation mobile pour l’étude de la mobilité respectueuse de la vie privée : Application au secteur du transport routier de marchandises ». Electronic Thesis or Diss., Bourgogne Franche-Comté, 2024. http://www.theses.fr/2024UBFCD001.
Texte intégralMobile network operators have a significant data source derived from communications of all connected objects (not just smartphones) with the network. These signaling data is a massive source of location data and are regularly used for the mobility analysis. However, potential uses face two major challenges: their low spatiotemporal precision and their highly sensitive nature concerning privacy.In the first phase, the thesis work enhances the understanding of the mobility state (stationary or in motion), speed, direction of movement of connected objects, and the route they take on a transportation infrastructure (e.g., road or rail).In the second phase, we demonstrate how to ensure the confidentiality of continuously produced mobility statistics. The use of signaling data, whether related to users or various connected objects, is legally regulated. For the study of mobility, operators tend to publish anonymized statistics (aggregated data). Specifically, the aim is to calculate complex and anonymized mobility statistics "on the fly" using differential privacy methods and probabilistic data structures (such as Bloom filters).Finally, in the third phase, we illustrate the potential of signaling data and the proposed approaches in this manuscript for quasi-real-time calculation of anonymous statistics on road freight transport. However, this is just an example of what could apply to other subjects analyzing population behaviors and activities with significant public and economic policy implications
Aussel, Nicolas. « Real-time anomaly detection with in-flight data : streaming anomaly detection with heterogeneous communicating agents ». Electronic Thesis or Diss., Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLL007.
Texte intégralWith the rise of the number of sensors and actuators in an aircraft and the development of reliable data links from the aircraft to the ground, it becomes possible to improve aircraft security and maintainability by applying real-time analysis techniques. However, given the limited availability of on-board computing and the high cost of the data links, current architectural solutions cannot fully leverage all the available resources limiting their accuracy.Our goal is to provide a distributed algorithm for failure prediction that could be executed both on-board of the aircraft and on a ground station and that would produce on-board failure predictions in near real-time under a communication budget. In this approach, the ground station would hold fast computation resources and historical data and the aircraft would hold limited computational resources and current flight's data.In this thesis, we will study the specificities of aeronautical data and what methods already exist to produce failure prediction from them and propose a solution to the problem stated. Our contribution will be detailed in three main parts.First, we will study the problem of rare event prediction created by the high reliability of aeronautical systems. Many learning methods for classifiers rely on balanced datasets. Several approaches exist to correct a dataset imbalance and we will study their efficiency on extremely imbalanced datasets.Second, we study the problem of log parsing as many aeronautical systems do not produce easy to classify labels or numerical values but log messages in full text. We will study existing methods based on a statistical approach and on Deep Learning to convert full text log messages into a form usable as an input by learning algorithms for classifiers. We will then propose our own method based on Natural Language Processing and show how it outperforms the other approaches on a public benchmark.Last, we offer a solution to the stated problem by proposing a new distributed learning algorithm that relies on two existing learning paradigms Active Learning and Federated Learning. We detail our algorithm, its implementation and provide a comparison of its performance with existing methods
Ben, Abdallah Emna. « Étude de la dynamique des réseaux biologiques : apprentissage des modèles, intégration des données temporelles et analyse formelle des propriétés dynamiques ». Thesis, Ecole centrale de Nantes, 2017. http://www.theses.fr/2017ECDN0041.
Texte intégralOver the last few decades, the emergence of a wide range of new technologies has produced a massive amount of biological data (genomics, proteomics...). Thus, a very large amount of time series data is now produced every day. The newly produced data can give us new ideas about the behavior of biological systems. This leads to considerable developments in the field of bioinformatics that could benefit from these enormous data. This justifies the motivation to develop efficient methods for learning Biological Regulatory Networks (BRN) modeling a biological system from its time series data. Then, in order to understand the nature of system functions, we study, in this thesis, the dynamics of their BRN models. Indeed, we focus on developing original and scalable logical methods (implemented in Answer Set Programming) to deciphering the emerging complexity of dynamics of biological systems. The main contributions of this thesis are enumerated in the following. (i) Refining the dynamics of the BRN, modeling with the automata Network (AN) formalism, by integrating a temporal parameter (delay) in the local transitions of the automata. We call the extended formalism a Timed Automata Network (T-AN). This integration allows the parametrization of the transitions between each automata local states as well as between the network global states. (ii) Learning BRNs modeling biological systems from their time series data. (iii) Model checking of discrete dynamical properties of BRN (modeling with AN and T-AN) by dynamical formal analysis : attractors identification (minimal trap domains from which the network cannot escape) and reachability verification of an objective from a network global initial state
Hannou, Fatma-Zohra. « A Pattern Model and Algebra for Representing and Querying Relative Information Completenes ». Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS110.
Texte intégralInformation incompleteness is a major data quality issue which is amplified by the increasing amount of data collected from unreliable sources. Assessing the completeness of data is crucial for determining the quality of the data and the validity of query answers.In this work, we tackle the issue of extracting and reasoning about complete and missing information under relative information completeness setting. Under this setting, the completeness of a dataset is assessed with respect to a complete reference dataset. We advance the field by proposing two contributions: a pattern model for providing minimal covers summarizing the extent of complete and missing data partitions and a pattern algebra for deriving minimal pattern covers for query answers to analyze their validity.The completeness pattern framework presents an intriguing opportunity to achieve many applications, particularly those aiming at improving the quality of tasks impacted by missing data. Data imputation is a well-known technique for repairing missing data values but can incur a prohibitive cost when applied to large data sets. Query-driven imputation offers a better alternative as it allows for We adopt a rule-based query rewriting technique for imputing the answers of aggregation queries that are missing or suffer from incorrectness due to data incompleteness. We present a novel query rewriting mechanism that is guided by the completeness pattern model and algebra.We also investigate the generalization of our pattern model for summarizing any data fragments. Summaries can be queried to analyze and compare data fragments in a synthetic and flexible way
Debaere, Steven. « Proactive inferior member participation management in innovation communities ». Thesis, Lille, 2018. http://www.theses.fr/2018LIL1A012.
Texte intégralNowadays, companies increasingly recognize the benefits of innovation communities (ICs) to inject external consumer knowledge into innovation processes. Despite the advantages of ICs, guaranteeing the viability poses two important challenges. First, ICs are big data environments that can quickly overwhelm community managers as members communicate through posts, thereby creating substantial (volume), rapidly expanding (velocity), and unstructured data that might encompass combinations of linguistic, video, image, and audio cues (variety). Second, most online communities fail to generate successful outcomes as they are often unable to derive value from individual IC members owing to members’ inferior participation. This doctoral dissertation leverages customer relationship management strategies to tackle these challenges and adds value by introducing a proactive inferior member participation management framework for community managers to proactively reduce inferior member participation, while effectively dealing with the data-rich IC environment. It proves that inferior member participation can be identified proactively by analyzing community actors’ writing style. It shows that dependencies between members’ participation behaviour can be exploited to improve prediction performance. Using a field experiment, it demonstrates that a proactive targeted email campaign allows to effectively reduce inferior member participation
Chen, Longbiao. « Big data-driven optimization in transportation and communication networks ». Electronic Thesis or Diss., Sorbonne université, 2018. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2018SORUS393.pdf.
Texte intégralThe evolution of metropolitan structures and the development of urban systems have created various kinds of urban networks, among which two types of networks are of great importance for our daily life, the transportation networks corresponding to human mobility in the physical space, and the communication networks supporting human interactions in the digital space. The rapid expansion in the scope and scale of these two networks raises a series of fundamental research questions on how to optimize these networks for their users. Some of the major objectives include demand responsiveness, anomaly awareness, cost effectiveness, energy efficiency, and service quality. Despite the distinct design intentions and implementation technologies, both the transportation and communication networks share common fundamental structures, and exhibit similar spatio-temporal dynamics. Correspondingly, there exists an array of key challenges that are common in the optimization in both networks, including network profiling, mobility prediction, traffic clustering, and resource allocation. To achieve the optimization objectives and address the research challenges, various analytical models, optimization algorithms, and simulation systems have been proposed and extensively studied across multiple disciplines. Generally, these simulation-based models are not evaluated in real-world networks, which may lead to sub-optimal results in deployment. With the emergence of ubiquitous sensing, communication and computing diagrams, a massive number of urban network data can be collected. Recent advances in big data analytics techniques have provided researchers great potentials to understand these data. Motivated by this trend, we aim to explore a new big data-driven network optimization paradigm, in which we address the above-mentioned research challenges by applying state-of-the-art data analytics methods to achieve network optimization goals. Following this research direction, in this dissertation, we propose two data-driven algorithms for network traffic clustering and user mobility prediction, and apply these algorithms to real-world optimization tasks in the transportation and communication networks. First, by analyzing large-scale traffic datasets from both networks, we propose a graph-based traffic clustering algorithm to better understand the traffic similarities and variations across different area and time. Upon this basis, we apply the traffic clustering algorithm to the following two network optimization applications. 1. Dynamic traffic clustering for demand-responsive bikeshare networks. In this application, we dynamically cluster bike stations with similar usage patterns to obtain stable and predictable cluster-wise bike traffic demands, so as to foresee over-demand stations in the network and enable demand-responsive bike scheduling. Evaluation results using real-world data from New York City and Washington, D.C. show that our framework accurately foresees over-demand clusters (e.g. with 0.882 precision and 0.938 recall in NYC), and outperforms other baseline methods significantly. 2. Complementary traffic clustering for cost-effective C-RAN. In this application, we cluster RRHs with complementary traffic patterns (e.g., an RRH in residential area and an RRH in business district) to reuse the total capacity of the BBUs, so as to reduce the overall deployment cost. We evaluate our framework with real-world network data collected from the city of Milan, Italy and the province of Trentino, Italy. Results show that our method effectively reduces the overall deployment cost to 48.4\% and 51.7\% of the traditional RAN architecture in the two datasets, respectively, and consistently outperforms other baseline methods. Second, by analyzing large-scale user mobility datasets from both networks, we propose [...]
Aussel, Nicolas. « Real-time anomaly detection with in-flight data : streaming anomaly detection with heterogeneous communicating agents ». Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLL007/document.
Texte intégralWith the rise of the number of sensors and actuators in an aircraft and the development of reliable data links from the aircraft to the ground, it becomes possible to improve aircraft security and maintainability by applying real-time analysis techniques. However, given the limited availability of on-board computing and the high cost of the data links, current architectural solutions cannot fully leverage all the available resources limiting their accuracy.Our goal is to provide a distributed algorithm for failure prediction that could be executed both on-board of the aircraft and on a ground station and that would produce on-board failure predictions in near real-time under a communication budget. In this approach, the ground station would hold fast computation resources and historical data and the aircraft would hold limited computational resources and current flight's data.In this thesis, we will study the specificities of aeronautical data and what methods already exist to produce failure prediction from them and propose a solution to the problem stated. Our contribution will be detailed in three main parts.First, we will study the problem of rare event prediction created by the high reliability of aeronautical systems. Many learning methods for classifiers rely on balanced datasets. Several approaches exist to correct a dataset imbalance and we will study their efficiency on extremely imbalanced datasets.Second, we study the problem of log parsing as many aeronautical systems do not produce easy to classify labels or numerical values but log messages in full text. We will study existing methods based on a statistical approach and on Deep Learning to convert full text log messages into a form usable as an input by learning algorithms for classifiers. We will then propose our own method based on Natural Language Processing and show how it outperforms the other approaches on a public benchmark.Last, we offer a solution to the stated problem by proposing a new distributed learning algorithm that relies on two existing learning paradigms Active Learning and Federated Learning. We detail our algorithm, its implementation and provide a comparison of its performance with existing methods
Caigny, Arno de. « Innovation in customer scoring for the financial services industry ». Thesis, Lille, 2019. http://www.theses.fr/2019LIL1A011.
Texte intégralThis dissertation improves customer scoring. Customer scoring is important for companies in their decision making processes because it helps to solve key managerial issues such as the decision of which customers to target for a marketing campaign or the assessment of customer that are likely to leave the company. The research in this dissertation makes several contributions in three areas of the customer scoring literature. First, new sources of data are used to score customers. Second, methodology to go from data to decisions is improved. Third, customer life event prediction is proposed as a new application of customer scoring
Audebert, Nicolas. « Classification de données massives de télédétection ». Thesis, Lorient, 2018. http://www.theses.fr/2018LORIS502/document.
Texte intégralThanks to high resolution imaging systems and multiplication of data sources, earth observation(EO) with satellite or aerial images has entered the age of big data. This allows the development of new applications (EO data mining, large-scale land-use classification, etc.) and the use of tools from information retrieval, statistical learning and computer vision that were not possible before due to the lack of data. This project is about designing an efficient classification scheme that can benefit from very high resolution and large datasets (if possible labelled) for creating thematic maps. Targeted applications include urban land use, geology and vegetation for industrial purposes.The PhD thesis objective will be to develop new statistical tools for classification of aerial andsatellite image. Beyond state-of-art approaches that combine a local spatial characterization of the image content and supervised learning, machine learning approaches which take benefit from large labeled datasets for training classifiers such that Deep Neural Networks will be particularly investigated. The main issues are (a) structured prediction (how to incorporate knowledge about the underlying spatial and contextual structure), (b) data fusion from various sensors (how to merge heterogeneous data such as SAR, hyperspectral and Lidar into the learning process ?), (c) physical plausibility of the analysis (how to include prior physical knowledge in the classifier ?) and (d) scalability (how to make the proposed solutions tractable in presence of Big RemoteSensing Data ?)
Marty, Philippe. « Etalonnages de l'instrument EPIC du satellite XMM-Newton : observations d'amas de galaxies en rayons-X ». Paris 11, 2003. https://tel.archives-ouvertes.fr/tel-00141571.
Texte intégralThe XMM-Newton satellite is one of the four cornerstones on which the Euro-pean Space Agency based its sky exploration program, and is aimed at opening further the X-rays window and map the high energies population from the Galaxy as well as from the deep Universe. Within the first part, I make a review of the high energies astrophysics current main topics, like the observations of clusters of galaxies, and summarize the. . . Having brought forward such an X-ray space observatory. A description of the XMM-Newton X-ray telescopes is then presented in the second part, as detailed as needed by the following. I explain in the third part how were conducted the EPIC instruments ground calibration campaigns within the synchrotron test facility in Orsay, and the analysis of the resulting data. In the fourth part, ground calibrations results are compared to some in-flight calibrations datasets, and methods for analysing data from extended sources (like clusters of galaxies) observations are extensively described. Finally, my conclusions regarding future extended sources observations with XMM-Newton and relevant data analysis strategies are drawn in the light of X-rays instrumentation general perspectives as well as the development of Virtual Observatories
Dia, Amadou Fall. « Filtrage sémantique et gestion distribuée de flux de données massives ». Electronic Thesis or Diss., Sorbonne université, 2018. http://www.theses.fr/2018SORUS495.
Texte intégralOur daily use of the Internet and related technologies generates, at a rapid and variable speeds, large volumes of heterogeneous data issued from sensor networks, search engine logs, multimedia content sites, weather forecasting, geolocation, Internet of Things (IoT) applications, etc. Processing such data in conventional databases (Relational Database Management Systems) may be very expensive in terms of time and memory storage resources. To effectively respond to the needs of rapid decision-making, these streams require real-time processing. Data Stream Management Systems (SGFDs) evaluate queries on the recent data of a stream within structures called windows. The input data are different formats such as CSV, XML, RSS, or JSON. This heterogeneity lock comes from the nature of the data streams and must be resolved. For this, several research groups have benefited from the advantages of semantic web technologies (RDF and SPARQL) by proposing RDF data streams processing systems called RSPs. However, large volumes of RDF data, high input streams, concurrent queries, combination of RDF streams and large volumes of stored RDF data and expensive processing drastically reduce the performance of these systems. A new approach is required to considerably reduce the processing load of RDF data streams. In this thesis, we propose several complementary solutions to reduce the processing load in centralized environment. An on-the-fly RDF graphs streams sampling approach is proposed to reduce data and processing load while preserving semantic links. This approach is deepened by adopting a graph-oriented summary approach to extract the most relevant information from RDF graphs by using centrality measures issued from the Social Networks Analysis. We also adopt a compressed format of RDF data and propose an approach for querying compressed RDF data without decompression phase. To ensure parallel and distributed data streams management, the presented work also proposes two solutions for reducing the processing load in distributed environment. An engine and parallel processing approaches and distributed RDF graphs streams. Finally, an optimized processing approach for static and dynamic data combination operations is also integrated into a new distributed RDF graphs streams management system
Rebecq, Antoine. « Méthodes de sondage pour les données massives ». Thesis, Paris 10, 2019. http://www.theses.fr/2019PA100014/document.
Texte intégralThis thesis presents three different parts with ties to survey sampling theory. In the first part, we present two original results that led to practical applications in surveys conducted at Insee (French official statistics Institute). The first chapter deals with allocations in stratified sampling. We present a theorem that proves the existence of an optimal compromise between the dispersion of the sampling weights and the allocation yielding optimal precision for a specific variable of interest. Survey data are commonly used to compute estimates for variables that were not included in the survey design. Expected precision is poor, but a low dispersion of the weights limits risks of very high variance for one or several estimates. The second chapter deals with reweighting factors in calibration estimates. We study an algorithm that computes the minimal bounds so that the calibration estimators exist, and propose an efficient way of resolution. We also study the statistical properties of estimates using these minimal bounds. The second part studies asymptotic properties of sampling estimates. Obtaining asymptotic guarantees is often hard in practice. We present an original method that establishes weak convergence for the Horvitz-Thompson empirical process indexed by a class of functions for a lot of sampling algorithms used in practice. In the third and last part, we focus on sampling methods for populations that can be described as networks. They have many applications when the graphs are so big that storing and computing algorithms on them are very costly. Two applications are presented, one using Twitter data, and the other using simulated data to establish guidelines to design efficient sampling designs for graphs
Pageau, Jasmine. « Choix occupationnels et espérance de vie : une analyse par l'approche des données massives ». Master's thesis, Université Laval, 2019. http://hdl.handle.net/20.500.11794/33867.
Texte intégralIn this thesis, we try to assess the impact of occupational choice on life expectancy using machine learning techniques. We use Conditional Inference Trees (CTree) to obtain Kaplan- Meier survival curves that enable us to predict mortality rates regarding the influential sociodemographic features. Using the Québec and Ontario data from the 1991 census merged with the Canadian Mortality Database from 1991 to 2006, we observe a correlation between occupational choice and life expectancy for particular groups. As it was expected, we find that the primary predictor of life expectancy is the person’s sex. Education and Canadian-born status are respectively the most influential variable for men and women of both provinces
El, Malki Mohammed. « Modélisation NoSQL des entrepôts de données multidimensionnelles massives ». Thesis, Toulouse 2, 2016. http://www.theses.fr/2016TOU20139/document.
Texte intégralDecision support systems occupy a large space in companies and large organizations in order to enable analyzes dedicated to decision making. With the advent of big data, the volume of analyzed data reaches critical sizes, challenging conventional approaches to data warehousing, for which current solutions are mainly based on R-OLAP databases. With the emergence of major Web platforms such as Google, Facebook, Twitter, Amazon...etc, many solutions to process big data are developed and called "Not Only SQL". These new approaches are an interesting attempt to build multidimensional data warehouse capable of handling large volumes of data. The questioning of the R-OLAP approach requires revisiting the principles of modeling multidimensional data warehouses.In this manuscript, we proposed implementation processes of multidimensional data warehouses with NoSQL models. We defined four processes for each model; an oriented NoSQL column model and an oriented documents model. Each of these processes fosters a specific treatment. Moreover, the NoSQL context adds complexity to the computation of effective pre-aggregates that are typically set up within the ROLAP context (lattice). We have enlarged our implementations processes to take into account the construction of the lattice in both detained models.As it is difficult to choose a single NoSQL implementation that supports effectively all the applicable treatments, we proposed two translation processes. While the first one concerns intra-models processes, i.e., pass rules from an implementation to another of the same NoSQL logic model, the second process defines the transformation rules of a logic model implementation to another implementation on another logic model
Collet, Julien. « Exploration of parallel graph-processing algorithms on distributed architectures ». Thesis, Compiègne, 2017. http://www.theses.fr/2017COMP2391/document.
Texte intégralWith the advent of ever-increasing graph datasets in a large number of domains, parallel graph-processing applications deployed on distributed architectures are more and more needed to cope with the growing demand for memory and compute resources. Though large-scale distributed architectures are available, notably in the High-Performance Computing (HPC) domain, the programming and deployment complexity of such graphprocessing algorithms, whose parallelization and complexity are highly data-dependent, hamper usability. Moreover, the difficult evaluation of performance behaviors of these applications complexifies the assessment of the relevance of the used architecture. With this in mind, this thesis work deals with the exploration of graph-processing algorithms on distributed architectures, notably using GraphLab, a state of the art graphprocessing framework. Two use-cases are considered. For each, a parallel implementation is proposed and deployed on several distributed architectures of varying scales. This study highlights operating ranges, which can eventually be leveraged to appropriately select a relevant operating point with respect to the datasets processed and used cluster nodes. Further study enables a performance comparison of commodity cluster architectures and higher-end compute servers using the two use-cases previously developed. This study highlights the particular relevance of using clustered commodity workstations, which are considerably cheaper and simpler with respect to node architecture, over higher-end systems in this applicative context. Then, this thesis work explores how performance studies are helpful in cluster design for graph-processing. In particular, studying throughput performances of a graph-processing system gives fruitful insights for further node architecture improvements. Moreover, this work shows that a more in-depth performance analysis can lead to guidelines for the appropriate sizing of a cluster for a given workload, paving the way toward resource allocation for graph-processing. Finally, hardware improvements for next generations of graph-processing servers areproposed and evaluated. A flash-based victim-swap mechanism is proposed for the mitigation of unwanted overloaded operations. Then, the relevance of ARM-based microservers for graph-processing is investigated with a port of GraphLab on a NVIDIA TX2-based architecture
Bouhamoum, Redouane. « Découverte automatique de schéma pour les données irrégulières et massives ». Electronic Thesis or Diss., université Paris-Saclay, 2021. http://www.theses.fr/2021UPASG081.
Texte intégralThe web of data is a huge global data space, relying on semantic web technologies, where a high number of sources are published and interlinked. This data space provides an unprecedented amount of knowledge available for novel applications, but the meaningful usage of its sources is often difficult due to the lack of schema describing the content of these data sources. Several automatic schema discovery approaches have been proposed, but while they provide good quality schemas, their use for massive data sources is a challenge as they rely on costly algorithms. In our work, we are interested in both the scalability and the incrementality of schema discovery approaches for RDF data sources where the schema is incomplete or missing.Furthermore, we extend schema discovery to take into account not only the explicit information provided by a data source, but also the implicit information which can be inferred.Our first contribution consists of a scalable schema discovery approach which extracts the classes describing the content of a massive RDF data source.We have proposed to extract a condensed representation of the source, which will be used as an input to the schema discovery process in order to improve its performances.This representation is a set of patterns, each one representing a combination of properties describing some entities in the dataset. We have also proposed a scalable schema discovery approach relying on a distributed clustering algorithm that forms groups of structurally similar entities representing the classes of the schema.Our second contribution aims at maintaining the generated schema consistent with the data source it describes, as this latter may evolve over time. We propose an incremental schema discovery approach that modifies the set of extracted classes by propagating the changes occurring at the source, in order to keep the schema consistent with its evolutions.Finally, the goal of our third contribution is to extend schema discovery to consider the whole semantics expressed by a data source, which is represented not only by the explicitly declared triples, but also by the ones which can be inferred through reasoning. We propose an extension allowing to take into account all the properties of an entity during schema discovery, represented either by explicit or by implicit triples, which will improve the quality of the generated schema
Alshaer, Mohammad. « An Efficient Framework for Processing and Analyzing Unstructured Text to Discover Delivery Delay and Optimization of Route Planning in Realtime ». Thesis, Lyon, 2019. http://www.theses.fr/2019LYSE1105/document.
Texte intégralInternet of Things (IoT) is leading to a paradigm shift within the logistics industry. The advent of IoT has been changing the logistics service management ecosystem. Logistics services providers today use sensor technologies such as GPS or telemetry to collect data in realtime while the delivery is in progress. The realtime collection of data enables the service providers to track and manage their shipment process efficiently. The key advantage of realtime data collection is that it enables logistics service providers to act proactively to prevent outcomes such as delivery delay caused by unexpected/unknown events. Furthermore, the providers today tend to use data stemming from external sources such as Twitter, Facebook, and Waze. Because, these sources provide critical information about events such as traffic, accidents, and natural disasters. Data from such external sources enrich the dataset and add value in analysis. Besides, collecting them in real-time provides an opportunity to use the data for on-the-fly analysis and prevent unexpected outcomes (e.g., such as delivery delay) at run-time. However, data are collected raw which needs to be processed for effective analysis. Collecting and processing data in real-time is an enormous challenge. The main reason is that data are stemming from heterogeneous sources with a huge speed. The high-speed and data variety fosters challenges to perform complex processing operations such as cleansing, filtering, handling incorrect data, etc. The variety of data – structured, semi-structured, and unstructured – promotes challenges in processing data both in batch-style and real-time. Different types of data may require performing operations in different techniques. A technical framework that enables the processing of heterogeneous data is heavily challenging and not currently available. In addition, performing data processing operations in real-time is heavily challenging; efficient techniques are required to carry out the operations with high-speed data, which cannot be done using conventional logistics information systems. Therefore, in order to exploit Big Data in logistics service processes, an efficient solution for collecting and processing data in both realtime and batch style is critically important. In this thesis, we developed and experimented with two data processing solutions: SANA and IBRIDIA. SANA is built on Multinomial Naïve Bayes classifier whereas IBRIDIA relies on Johnson's hierarchical clustering (HCL) algorithm which is hybrid technology that enables data collection and processing in batch style and realtime. SANA is a service-based solution which deals with unstructured data. It serves as a multi-purpose system to extract the relevant events including the context of the event (such as place, location, time, etc.). In addition, it can be used to perform text analysis over the targeted events. IBRIDIA was designed to process unknown data stemming from external sources and cluster them on-the-fly in order to gain knowledge/understanding of data which assists in extracting events that may lead to delivery delay. According to our experiments, both of these approaches show a unique ability to process logistics data. However, SANA is found more promising since the underlying technology (Naïve Bayes classifier) out-performed IBRIDIA from performance measuring perspectives. It is clearly said that SANA was meant to generate a graph knowledge from the events collected immediately in realtime without any need to wait, thus reaching maximum benefit from these events. Whereas, IBRIDIA has an important influence within the logistics domain for identifying the most influential category of events that are affecting the delivery. Unfortunately, in IBRIRDIA, we should wait for a minimum number of events to arrive and always we have a cold start. Due to the fact that we are interested in re-optimizing the route on the fly, we adopted SANA as our data processing framework
Baron, Benjamin. « Transport intermodal de données massives pour le délestage des réseaux d'infrastructure ». Electronic Thesis or Diss., Paris 6, 2016. http://www.theses.fr/2016PA066454.
Texte intégralIn this thesis, we exploit the daily mobility of vehicles to create an alternative transmission medium. Our objective is to draw on the many vehicular trips taken by cars or public transports to overcome the limitations of conventional data networks such as the Internet. In the first part, we take advantage of the bandwidth resulting from the mobility of vehicles equipped with storage capabilities to offload large amounts of delay-tolerant traffic from the Internet. Data is transloaded to data storage devices we refer to as offloading spots, located where vehicles stop often and long enough to transfer large amounts of data. Those devices act as data relays, i.e., they store data it is until loaded on and carried by a vehicle to the next offloading spot where it can be dropped off for later pick-up and delivery by another vehicle. We further extend the concept of offloading spots according to two directions in the context of vehicular cloud services. In the first extension, we exploit the storage capabilities of the offloading spots to design a cloud-like storage and sharing system for vehicle passengers. In the second extension, we dematerialize the offloading spots into pre-defined areas with high densities of vehicles that meet long enough to transfer large amounts of data. The performance evaluation of the various works conducted in this thesis shows that everyday mobility of entities surrounding us enables innovative services with limited reliance on conventional data networks
Legrand, Nicolas. « Numerical and modeling methods for multi-level large eddy simulations of turbulent flows in complex geometries ». Thesis, Normandie, 2017. http://www.theses.fr/2017NORMIR16/document.
Texte intégralLarge-Eddy Simulation (LES) has become a major tool for the analysis of highly turbulent flows in complex geometries. However, due to the steadily increase of computational resources, the amount of data generated by well-resolved numerical simulations is such that it has become very challenging to manage them with traditional data processing tools. In Computational Fluid Dynamics (CFD), this emerging problematic leads to the same "Big Data" challenges as in the computer science field. Some techniques have already been developed such as data partitioning and ordering or parallel processing but still remain insufficient for modern numerical simulations. Hence, the objective of this work is to propose new processing formalisms to circumvent the data volume issue for the future 2020 exa-scale computing objectives. To this aim, a massively parallel co-processing method, suited for complex geometries, was developed in order to extract large-scale features in turbulent flows. The principle of the method is to introduce a series of coarser nested grids to reduce the amount of data while keeping the large scales of interest. Data is transferred from one grid level to another using high-order filters and accurate interpolation techniques. This method enabled to apply modal decomposition techniques to a billion-cell LES of a 3D turbulent turbine blade, thus demonstrating its effectiveness. The capability of performing calculations on several embedded grid levels was then used to devise the multi-resolution LES (MR-LES). The aim of the method is to evaluate the modeling and numerical errors during an LES by conducting the same simulation on two different mesh resolutions, simultaneously. This error estimation is highly valuable as it allows to generate optimal grids through the building of an objective grid quality measure. MR-LES intents to limit the computational cost of the simulation while minimizing the sub-grid scale modeling errors. This novel framework was applied successfully to the simulation of a turbulent flow around a 3D cylinder
Laur, Pierre Alain. « Données semi structurées : Découverte, maintenance et analyse de tendances ». Montpellier 2, 2004. http://www.theses.fr/2004MON20053.
Texte intégralMadera, Cedrine. « L’évolution des systèmes et architectures d’information sous l’influence des données massives : les lacs de données ». Thesis, Montpellier, 2018. http://www.theses.fr/2018MONTS071/document.
Texte intégralData is on the heart of the digital transformation.The consequence is anacceleration of the information system evolution , which must adapt. The Big data phenomenonplays the role of catalyst of this evolution.Under its influence appears a new component of the information system: the data lake.Far from replacing the decision support systems that make up the information system, data lakes comecomplete information systems’s architecture.First, we focus on the factors that influence the evolution of information systemssuch as new software and middleware, new infrastructure technologies, but also the decision support system usage itself.Under the big data influence we study the impact that this entails especially with the appearance ofnew technologies such as Apache Hadoop as well as the current limits of the decision support system .The limits encountered by the current decision support system force a change to the information system which mustadapt and that gives birth to a new component: the data lake.In a second time we study in detail this new component, formalize our definition, giveour point of view on its positioning in the information system as well as with regard to the decision support system .In addition, we highlight a factor influencing the architecture of data lakes: data gravity, doing an analogy with the law of gravity and focusing on the factors that mayinfluence the data-processing relationship. We highlight, through a use case, that takingaccount of the data gravity can influence the design of a data lake.We complete this work by adapting the software product line approach to boot a methodof formalizations and modeling of data lakes. This method allows us:- to establish a minimum list of components to be put in place to operate a data lake without transforming it into a data swamp,- to evaluate the maturity of an existing data lake,- to quickly diagnose the missing components of an existing data lake that would have become a dataswamp- to conceptualize the creation of data lakes by being "software agnostic “
Baron, Benjamin. « Transport intermodal de données massives pour le délestage des réseaux d'infrastructure ». Thesis, Paris 6, 2016. http://www.theses.fr/2016PA066454/document.
Texte intégralIn this thesis, we exploit the daily mobility of vehicles to create an alternative transmission medium. Our objective is to draw on the many vehicular trips taken by cars or public transports to overcome the limitations of conventional data networks such as the Internet. In the first part, we take advantage of the bandwidth resulting from the mobility of vehicles equipped with storage capabilities to offload large amounts of delay-tolerant traffic from the Internet. Data is transloaded to data storage devices we refer to as offloading spots, located where vehicles stop often and long enough to transfer large amounts of data. Those devices act as data relays, i.e., they store data it is until loaded on and carried by a vehicle to the next offloading spot where it can be dropped off for later pick-up and delivery by another vehicle. We further extend the concept of offloading spots according to two directions in the context of vehicular cloud services. In the first extension, we exploit the storage capabilities of the offloading spots to design a cloud-like storage and sharing system for vehicle passengers. In the second extension, we dematerialize the offloading spots into pre-defined areas with high densities of vehicles that meet long enough to transfer large amounts of data. The performance evaluation of the various works conducted in this thesis shows that everyday mobility of entities surrounding us enables innovative services with limited reliance on conventional data networks
Fraisse, Bernard. « Automatisation, traitement du signal et recueil de données en diffraction x et analyse thermique : Exploitation, analyse et représentation des données ». Montpellier 2, 1995. http://www.theses.fr/1995MON20152.
Texte intégralDavid, Claire. « Analyse de XML avec données non-bornées ». Paris 7, 2009. http://www.theses.fr/2009PA077107.
Texte intégralThe motivation of the work is the specification and static analysis of schema for XML documents paying special attention to data values. We consider words and trees whose positions are labeled both by a letter from a finite alphabet and a data value from an infinite domain. Our goal is to find formalisms which offer good trade-offs between expressibility, decidability and complexity (for the satisfiability problem). We first study an extension of first-order logic with a binary predicate representing data equality. We obtain interesting some interesting results when we consider the two variable fragment. This appraoch is elegant but the complexity results are not encouraging. We proposed another formalism based data patterns which can be desired, forbidden or any boolean combination thereof. We drw precisely the decidability frontier for various fragments on this model. The complexity results that we get, while still high, seems more amenable. In terms of expressivity theses two approaches are orthogonal, the two variable fragment of the extension of FO can expressed unary key and unary foreign key while the boolean combination of data pattern can express arbitrary key but can not express foreign key
Abdali, Abdelkebir. « Systèmes experts et analyse de données industrielles ». Lyon, INSA, 1992. http://www.theses.fr/1992ISAL0032.
Texte intégralTo analyses industrial process behavio, many kinds of information are needed. As tye ar mostly numerical, statistical and data analysis methods are well-suited to this activity. Their results must be interpreted with other knowledge about analysis prcess. Our work falls within the framework of the application of the techniques of the Artificial Intelligence to the Statistics. Its aim is to study the feasibility and the development of statistical expert systems in an industrial process field. The prototype ALADIN is a knowledge-base system designed to be an intelligent assistant to help a non-specialist user analyze data collected on industrial processes, written in Turbo-Prolong, it is coupled with the statistical package MODULAD. The architecture of this system is flexible and combing knowledge with general plants, the studied process and statistical methods. Its validation is performed on continuous manufacturing processes (cement and cast iron processes). At present time, we have limited to principal Components analysis problems
Rabah, Mazouzi. « Approches collaboratives pour la classification des données complexes ». Electronic Thesis or Diss., Paris 8, 2016. http://www.theses.fr/2016PA080079.
Texte intégralThis thesis focuses on the collaborative classification in the context of complex data, in particular the context of Big Data, we used some computational paradigms to propose new approaches based on HPC technologies. In this context, we aim at offering massive classifiers in the sense that the number of elementary classifiers that make up the multiple classifiers system can be very high. In this case, conventional methods of interaction between classifiers is no longer valid and we had to propose new forms of interaction, where it is not constrain to take all classifiers predictions to build an overall prediction. According to this, we found ourselves faced with two problems: the first is the potential of our approaches to scale up. The second, is the diversity that must be created and maintained within the system, to ensure its performance. Therefore, we studied the distribution of classifiers in a cloud-computing environment, this multiple classifiers system can be massive and their properties are those of a complex system. In terms of diversity of data, we proposed a training data enrichment approach for the generation of synthetic data from analytical models that describe a part of the phenomenon studied. so, the mixture of data reinforces learning classifiers. The experimentation made have shown the great potential for the substantial improvement of classification results
Cayot, Robert-Olivier. « Récupération automatique d'erreurs syntaxiques en analyse discriminante rétrograde ». Nice, 2001. http://www.theses.fr/2001NICE5690.
Texte intégralSibony, Eric. « Analyse mustirésolution de données de classements ». Electronic Thesis or Diss., Paris, ENST, 2016. http://www.theses.fr/2016ENST0036.
Texte intégralThis thesis introduces a multiresolution analysis framework for ranking data. Initiated in the 18th century in the context of elections, the analysis of ranking data has attracted a major interest in many fields of the scientific literature : psychometry, statistics, economics, operations research, machine learning or computational social choice among others. It has been even more revitalized by modern applications such as recommender systems, where the goal is to infer users preferences in order to make them the best personalized suggestions. In these settings, users express their preferences only on small and varying subsets of a large catalog of items. The analysis of such incomplete rankings poses however both a great statistical and computational challenge, leading industrial actors to use methods that only exploit a fraction of available information. This thesis introduces a new representation for the data, which by construction overcomes the two aforementioned challenges. Though it relies on results from combinatorics and algebraic topology, it shares several analogies with multiresolution analysis, offering a natural and efficient framework for the analysis of incomplete rankings. As it does not involve any assumption on the data, it already leads to overperforming estimators in small-scale settings and can be combined with many regularization procedures for large-scale settings. For all those reasons, we believe that this multiresolution representation paves the way for a wide range of future developments and applications
Ghesmoune, Mohammed. « Apprentissage non supervisé de flux de données massives : application aux Big Data d'assurance ». Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCD061/document.
Texte intégralThe research outlined in this thesis concerns the development of approaches based on growing neural gas (GNG) for clustering of data streams. We propose three algorithmic extensions of the GNG approaches: sequential, distributed and parallel, and hierarchical; as well as a model for scalability using MapReduce and its application to learn clusters from the real insurance Big Data in the form of a data stream. We firstly propose the G-Stream method. G-Stream, as a “sequential" clustering method, is a one-pass data stream clustering algorithm that allows us to discover clusters of arbitrary shapes without any assumptions on the number of clusters. G-Stream uses an exponential fading function to reduce the impact of old data whose relevance diminishes over time. The links between the nodes are also weighted. A reservoir is used to hold temporarily the distant observations in order to reduce the movements of the nearest nodes to the observations. The batchStream algorithm is a micro-batch based method for clustering data streams which defines a new cost function taking into account that subsets of observations arrive in discrete batches. The minimization of this function, which leads to a topological clustering, is carried out using dynamic clusters in two steps: an assignment step which assigns each observation to a cluster, followed by an optimization step which computes the prototype for each node. A scalable model using MapReduce is then proposed. It consists of decomposing the data stream clustering problem into the elementary functions, Map and Reduce. The observations received in each sub-dataset (within a time interval) are processed through deterministic parallel operations (Map and Reduce) to produce the intermediate states or the final clusters. The batchStream algorithm is validated on the insurance Big Data. A predictive and analysis system is proposed by combining the clustering results of batchStream with decision trees. The architecture and these different modules from the computational core of our Big Data project, called Square Predict. GH-Stream for both visualization and clustering tasks is our third extension. The presented approach uses a hierarchical and topological structure for both of these tasks
Bodin, Bruno. « Analyse d'Applications Flot de Données pour la Compilation Multiprocesseur ». Phd thesis, Université Pierre et Marie Curie - Paris VI, 2013. http://tel.archives-ouvertes.fr/tel-00922578.
Texte intégralLefebvre, Sylvain. « Services de répartition de charge pour le Cloud : application au traitement de données multimédia ». Phd thesis, Conservatoire national des arts et metiers - CNAM, 2013. http://tel.archives-ouvertes.fr/tel-01062823.
Texte intégralFize, Jacques. « Mise en correspondance de données textuelles hétérogènes fondée sur la dimension spatiale ». Thesis, Montpellier, 2019. http://www.theses.fr/2019MONTS099.
Texte intégralWith the rise of Big Data, the processing of Volume, Velocity (growth and evolution) and data Variety concentrates the efforts of communities to exploit these new resources. These new resources have become so important that they are considered the new "black gold". In recent years, volume and velocity have been aspects of the data that are controlled, unlike variety, which remains a major challenge. This thesis presents two contributions in the field of heterogeneous data matching, with a focus on the spatial dimension.The first contribution is based on a two-step process for matching heterogeneous textual data: georepresentation and geomatching. In the first phase, we propose to represent the spatial dimension of each document in a corpus through a dedicated structure, the Spatial Textual Representation (STR). This graph representation is composed of the spatial entities identified in the document, as well as the spatial relationships they maintain. To identify the spatial entities of a document and their spatial relationships, we propose a dedicated resource, called Geodict. The second phase, geomatching, computes the similarity between the generated representations (STR). Based on the nature of the STR structure (i.e. graph), different algorithms of graph matching were studied. To assess the relevance of a match, we propose a set of 6 criteria based on a definition of the spatial similarity between two documents.The second contribution is based on the thematic dimension of textual data and its participation in the spatial matching process. We propose to identify the themes that appear in the same contextual window as certain spatial entities. The objective is to induce some of the implicit spatial similarities between the documents. To do this, we propose to extend the structure of STR using two concepts: the thematic entity and the thematic relationship. The thematic entity represents a concept specific to a particular field (agronomic, medical) and represented according to different spellings present in a terminology resource, in this case a vocabulary. A thematic relationship links a spatial entity to a thematic entity if they appear in the same window. The selected vocabularies and the new form of STR integrating the thematic dimension are evaluated according to their coverage on the studied corpora, as well as their contributions to the heterogeneous textual matching process on the spatial dimension
Lefebvre, Sylvain. « Services de répartition de charge pour le Cloud : application au traitement de données multimédia ». Electronic Thesis or Diss., Paris, CNAM, 2013. http://www.theses.fr/2013CNAM0910.
Texte intégralThe research work carried out in this thesis consists in the development of new load balancing algorithms aimed at big data computing. The first algorithm, called « WACA » (Workload and Cache Aware Algorithm), enhances response times by locating data efficiently through content summaries. The second algorithm, called CAWA (Cost AWare Algorithm) takes advantage of the cost information available on Cloud Computing platforms by studying the workload history.Evaluation of these algorithms required the development of a cloud infrastructure simulator named Simizer, to enable testing of these policies prior to their deployment. This deployment can be transparently done thanks to the Cloudizer web service distribution and monitoring system, also developed during this thesis. These works are included in the Multimedia for Machine to Machine (MCUBE) project, where the Cloudizer Framework is deployed
Sibony, Eric. « Analyse mustirésolution de données de classements ». Thesis, Paris, ENST, 2016. http://www.theses.fr/2016ENST0036/document.
Texte intégralThis thesis introduces a multiresolution analysis framework for ranking data. Initiated in the 18th century in the context of elections, the analysis of ranking data has attracted a major interest in many fields of the scientific literature : psychometry, statistics, economics, operations research, machine learning or computational social choice among others. It has been even more revitalized by modern applications such as recommender systems, where the goal is to infer users preferences in order to make them the best personalized suggestions. In these settings, users express their preferences only on small and varying subsets of a large catalog of items. The analysis of such incomplete rankings poses however both a great statistical and computational challenge, leading industrial actors to use methods that only exploit a fraction of available information. This thesis introduces a new representation for the data, which by construction overcomes the two aforementioned challenges. Though it relies on results from combinatorics and algebraic topology, it shares several analogies with multiresolution analysis, offering a natural and efficient framework for the analysis of incomplete rankings. As it does not involve any assumption on the data, it already leads to overperforming estimators in small-scale settings and can be combined with many regularization procedures for large-scale settings. For all those reasons, we believe that this multiresolution representation paves the way for a wide range of future developments and applications
Ben, Hedia Belgacem. « Analyse temporelle des systèmes d'acquisition de données : une approche à base d'automates temporisés communicants et d'observateurs ». Lyon, INSA, 2008. http://theses.insa-lyon.fr/publication/2008ISAL0111/these.pdf.
Texte intégralBisgambiglia, Paul-Antoine. « Traitement numérique et informatique de la modélisation spectrale ». Corte, 1989. http://www.theses.fr/1989CORT3002.
Texte intégralAuber, David. « Outils de visualisation de larges structures de données ». Bordeaux 1, 2002. http://www.theses.fr/2002BOR12607.
Texte intégralAubert, Pierre. « Calcul haute performance pour la détection de rayon Gamma ». Thesis, Université Paris-Saclay (ComUE), 2018. http://www.theses.fr/2018SACLV058/document.
Texte intégralThe new generation research experiments will introduce huge data surge to a continuously increasing data production by current experiments. This increasing data rate causes upheavals at many levels, such as data storage, analysis, diffusion and conservation.The CTA project will become the utmost observatory of gamma astronomy on the ground from 2021. It will generate hundreds Peta-Bytes of data by 2030 and will have to be stored, compressed and analyzed each year.This work address the problems of data analysis optimization using high performance computing techniques via an efficient data format generator, very low level programming to optimize the CPU pipeline and vectorization of existing algorithms, introduces a fast compression algorithm for integers and finally exposes a new analysis algorithm based on efficient pictures comparison
Marcel, Patrick. « Manipulations de données multidimensionnelles et langages de règles ». Lyon, INSA, 1998. http://www.theses.fr/1998ISAL0093.
Texte intégralThis works is a contribution to the study of the manipulations in data warehouses. In the first part, we present a state of the art about multidimensional data manipulation languages in systems dedicated to On-Line analytical Processing (OLAP systems). We point out interesting combinations that haven't been studied. These conclusions are used in the second part to propose a simple rule-based language allowing specifying typical treatments arising in OLAP systems. In a third part, we illustrate the use of the language to describe OLAP treatments in spreadsheets, and to generate semi automatic spreadsheet programs
Schaefer, Xavier. « Bases de données orientées objet, contraintes d'intégrité et analyse statique ». Paris 1, 1997. http://www.theses.fr/1997PA010098.
Texte intégral