Zaloguj się

Gotowe bibliografie tematyczne / ANALYZE BIG DATA / Rozprawy doktorskie

Kliknij ten link, aby zobaczyć inne rodzaje publikacji na ten temat: ANALYZE BIG DATA.

Rozprawy doktorskie na temat „ANALYZE BIG DATA”

Autor: Grafiati

Data publikacji: 11 września 2023

Utwórz poprawne odniesienie w stylach APA, MLA, Chicago, Harvard i wielu innych

Wybierz rodzaj źródła:

Sprawdź 50 najlepszych rozpraw doktorskich naukowych na temat „ANALYZE BIG DATA”.

Przycisk „Dodaj do bibliografii” jest dostępny obok każdej pracy w bibliografii. Użyj go – a my automatycznie utworzymy odniesienie bibliograficzne do wybranej pracy w stylu cytowania, którego potrzebujesz: APA, MLA, Harvard, Chicago, Vancouver itp.

Możesz również pobrać pełny tekst publikacji naukowej w formacie „.pdf” i przeczytać adnotację do pracy online, jeśli odpowiednie parametry są dostępne w metadanych.

Przeglądaj rozprawy doktorskie z różnych dziedzin i twórz odpowiednie bibliografie.

1

SHARMA, DIVYA. "APPLICATION OF ML TO MAKE SENCE OF BIOLOGICAL BIG DATA IN DRUG DISCOVERY PROCESS". Thesis, DELHI TECHNOLOGICAL UNIVERSITY, 2021. http://dspace.dtu.ac.in:8080/jspui/handle/repository/18378.

Pełny tekst źródła

Streszczenie:

Scientists have been working over years to assemble and accumulate data from biological sources to find solutions for many principal questions. Since a tremendous amount of data has been collected over the past and still increasing at an exponential rate, hence it now becomes unachievable for a human being alone to handle or analyze this data. Most of the data collection and maintenance is now done in digitalized format and hence requires an organization to have better data management and analysis to convert the vast data resource into insights to achieve their objectives. The continuous explosion of information both from biomedical and healthcare sources calls for urgent solutions. Healthcare data needs to be closely combined with biomedical research data to make it more effective in providing personalized medicine and better treatment procedures. Therefore, big data analytics would help in integrating large data sets for proper management, decision-making, and cost- effectiveness in any medical/healthcare organization. The scope of the thesis is to highlight the need for big data analytics in healthcare, explain data processing pipeline, and machine learning used to analyze big data.

Style APA, Harvard, Vancouver, ISO itp.

2

Uřídil, Martin. "Big data - použití v bankovní sféře". Master's thesis, Vysoká škola ekonomická v Praze, 2012. http://www.nusl.cz/ntk/nusl-149908.

Pełny tekst źródła

Streszczenie:

There is a growing volume of global data, which is offering new possibilities for those market participants, who know to take advantage of it. Data, information and knowledge are new highly regarded commodity especially in the banking industry. Traditional data analytics is intended for processing data with known structure and meaning. But how can we get knowledge from data with no such structure? The thesis focuses on Big Data analytics and its use in banking and financial industry. Definition of specific applications in this area and description of benefits for international and Czech banking institutions are the main goals of the thesis. The thesis is divided in four parts. The first part defines Big Data trend, the second part specifies activities and tools in banking. The purpose of the third part is to apply Big Data analytics on those activities and shows its possible benefits. The last part focuses on the particularities of Czech banking and shows what actual situation about Big Data in Czech banks is. The thesis gives complex description of possibilities of using Big Data analytics. I see my personal contribution in detailed characterization of the application in real banking activities.

Style APA, Harvard, Vancouver, ISO itp.

3

Flike, Felix, i Markus Gervard. "BIG DATA-ANALYS INOM FOTBOLLSORGANISATIONER En studie om big data-analys och värdeskapande". Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20117.

Pełny tekst źródła

Streszczenie:

Big data är ett relativt nytt begrepp men fenomenet har funnits länge. Det går att beskriva utifrån fem V:n; volume, veracity, variety, velocity och value. Analysen av Big Data har kommit att visa sig värdefull för organisationer i arbetet med beslutsfattande, generering av mätbara ekonomiska fördelar och förbättra verksamheten. Inom idrottsbranschen började detta på allvar användas i början av 2000-talet i baseballorganisationen Oakland Athletics. Man började värva spelare baserat på deras statistik istället för hur bra scouterna bedömde deras förmåga vilket gav stora framgångar. Detta ledde till att fler organisationer tog efter och det har inte dröjt länge innan Big Data-analys används i alla stora sporter för att vinna fördelar gentemot konkurrenter. I svensk kontext så är användningen av dessa verktyg fortfarande relativt ny och mångaorganisationer har möjligtvis gått för fort fram i implementeringen av dessa verktyg. Dennastudie syftar till att undersöka fotbollsorganisationers arbete när det gäller deras Big Dataanalys kopplat till organisationens spelare utifrån en fallanalys. Resultatet visar att båda organisationerna skapar värde ur sina investeringar som de har nytta av i arbetet med att nå sina strategiska mål. Detta gör organisationerna på olika sätt. Vilket sätt som är mest effektivt utifrån värdeskapande går inte att svara på utifrån denna studie.

Style APA, Harvard, Vancouver, ISO itp.

4

Šoltýs, Matej. "Big Data v technológiách IBM". Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-193914.

Pełny tekst źródła

Streszczenie:

This diploma thesis presents Big Data technologies and their possible use cases and applications. Theoretical part is initially focused on definition of term Big Data and afterwards is focused on Big Data technology, particularly on Hadoop framework. There are described principles of Hadoop, such as distributed storage and data processing, and its individual components. Furthermore are presented the largest vendors of Big Data technologies. At the end of this part of the thesis are described possible use cases of Big Data technologies and also some case studies. The practical part describes implementation of demo example of Big Data technologies and it is divided into two chapters. The first chapter of the practical part deals with conceptual design of demo example, used products and architecture of the solution. Afterwards, implementation of the demo example is described in the second chapter, from preparation of demo environment to creation of applications. Goals of this thesis are description and characteristics of Big Data, presentation of the largest vendors and their Big Data products, description of possible use cases of Big Data technologies and especially implementation of demo example in Big Data tools from IBM.

Style APA, Harvard, Vancouver, ISO itp.

5

Victoria, Åkestrand, i Wisen My. "Big Data-analyser och beslutsfattande i svenska myndigheter". Thesis, Högskolan i Halmstad, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-34752.

Pełny tekst źródła

Streszczenie:

Det finns mycket data att samla in om människor och mängden av data som går att samla in ökar. Allt fler verksamheter tar steget in i Big Data-‐användningen och svenska myndigheter är en av dem. Att analysera Big Data kan generera bättre beslutsunderlag, men det finns en problematik i hur inhämtad data ska analyseras och användas vid beslutsprocessen. Studiens resultat visar på att svenska myndigheter inte kan använda befintliga beslutsmodeller vid beslut som grundas i en Big Data-‐analys. Resultatet av studien visar även på att svenska myndigheter inte använder sig av givna steg i beslutsprocessen, utan det handlar mest om att identifiera Big Data-‐ analysens innehåll för att fatta ett beslut. Då beslutet grundas i vad Big Data-‐ analysen pekar på så blir det kringliggande aktiviteterna som insamling av data, kvalitetssäkring av data, analysering av data och visualisering av data allt mer essentiella.

Style APA, Harvard, Vancouver, ISO itp.

6

Kleisarchaki, Sofia. "Analyse des différences dans le Big Data : Exploration, Explication, Évolution". Thesis, Université Grenoble Alpes (ComUE), 2016. http://www.theses.fr/2016GREAM055/document.

Pełny tekst źródła

Streszczenie:

La Variabilité dans le Big Data se réfère aux données dont la signification change de manière continue. Par exemple, les données des plateformes sociales et les données des applications de surveillance, présentent une grande variabilité. Cette variabilité est dûe aux différences dans la distribution de données sous-jacente comme l’opinion de populations d’utilisateurs ou les mesures des réseaux d’ordinateurs, etc. L’Analyse de Différences a comme objectif l’étude de la variabilité des Données Massives. Afin de réaliser cet objectif, les data scientists ont besoin (a) de mesures de comparaison de données pour différentes dimensions telles que l’âge pour les utilisateurs et le sujet pour le traffic réseau, et (b) d’algorithmes efficaces pour la détection de différences à grande échelle. Dans cette thèse, nous identifions et étudions trois nouvelles tâches analytiques : L’Exploration des Différences, l’Explication des Différences et l’Evolution des Différences.L’Exploration des Différences s’attaque à l’extraction de l’opinion de différents segments d’utilisateurs (ex., sur un site de films). Nous proposons des mesures adaptées à la com- paraison de distributions de notes attribuées par les utilisateurs, et des algorithmes efficaces qui permettent, à partir d’une opinion donnée, de trouver les segments qui sont d’accord ou pas avec cette opinion. L’Explication des Différences s’intéresse à fournir une explication succinte de la différence entre deux ensembles de données (ex., les habitudes d’achat de deux ensembles de clients). Nous proposons des fonctions de scoring permettant d’ordonner les explications, et des algorithmes qui guarantissent de fournir des explications à la fois concises et informatives. Enfin, l’Evolution des Différences suit l’évolution d’un ensemble de données dans le temps et résume cette évolution à différentes granularités de temps. Nous proposons une approche basée sur le requêtage qui utilise des mesures de similarité pour comparer des clusters consécutifs dans le temps. Nos index et algorithmes pour l’Evolution des Différences sont capables de traiter des données qui arrivent à différentes vitesses et des types de changements différents (ex., soudains, incrémentaux). L’utilité et le passage à l’échelle de tous nos algorithmes reposent sur l’exploitation de la hiérarchie dans les données (ex., temporelle, démographique).Afin de valider l’utilité de nos tâches analytiques et le passage à l’échelle de nos algo- rithmes, nous réalisons un grand nombre d’expériences aussi bien sur des données synthé- tiques que réelles.Nous montrons que l’Exploration des Différences guide les data scientists ainsi que les novices à découvrir l’opinion de plusieurs segments d’internautes à grande échelle. L’Explication des Différences révèle la nécessité de résumer les différences entre deux ensembles de donnes, de manière parcimonieuse et montre que la parcimonie peut être atteinte en exploitant les relations hiérarchiques dans les données. Enfin, notre étude sur l’Evolution des Différences fournit des preuves solides qu’une approche basée sur les requêtes est très adaptée à capturer des taux d’arrivée des données variés à plusieurs granularités de temps. De même, nous montrons que les approches de clustering sont adaptées à différents types de changement
Variability in Big Data refers to data whose meaning changes continuously. For instance, data derived from social platforms and from monitoring applications, exhibits great variability. This variability is essentially the result of changes in the underlying data distributions of attributes of interest, such as user opinions/ratings, computer network measurements, etc. {em Difference Analysis} aims to study variability in Big Data. To achieve that goal, data scientists need: (a) measures to compare data in various dimensions such as age for users or topic for network traffic, and (b) efficient algorithms to detect changes in massive data. In this thesis, we identify and study three novel analytical tasks to capture data variability: {em Difference Exploration, Difference Explanation} and {em Difference Evolution}.Difference Exploration is concerned with extracting the opinion of different user segments (e.g., on a movie rating website). We propose appropriate measures for comparing user opinions in the form of rating distributions, and efficient algorithms that, given an opinion of interest in the form of a rating histogram, discover agreeing and disargreeing populations. Difference Explanation tackles the question of providing a succinct explanation of differences between two datasets of interest (e.g., buying habits of two sets of customers). We propose scoring functions designed to rank explanations, and algorithms that guarantee explanation conciseness and informativeness. Finally, Difference Evolution tracks change in an input dataset over time and summarizes change at multiple time granularities. We propose a query-based approach that uses similarity measures to compare consecutive clusters over time. Our indexes and algorithms for Difference Evolution are designed to capture different data arrival rates (e.g., low, high) and different types of change (e.g., sudden, incremental). The utility and scalability of all our algorithms relies on hierarchies inherent in data (e.g., time, demographic).We run extensive experiments on real and synthetic datasets to validate the usefulness of the three analytical tasks and the scalability of our algorithms. We show that Difference Exploration guides end-users and data scientists in uncovering the opinion of different user segments in a scalable way. Difference Explanation reveals the need to parsimoniously summarize differences between two datasets and shows that parsimony can be achieved by exploiting hierarchy in data. Finally, our study on Difference Evolution provides strong evidence that a query-based approach is well-suited to tracking change in datasets with varying arrival rates and at multiple time granularities. Similarly, we show that different clustering approaches can be used to capture different types of change

Style APA, Harvard, Vancouver, ISO itp.

7

Nováková, Martina. "Analýza Big Data v oblasti zdravotnictví". Master's thesis, Vysoká škola ekonomická v Praze, 2014. http://www.nusl.cz/ntk/nusl-201737.

Pełny tekst źródła

Streszczenie:

This thesis deals with the analysis of Big Data in healthcare. The aim is to define the term Big Data, to acquaint the reader with data growth in the world and in the health sector. Another objective is to explain the concept of a data expert and to define team members of the data experts team. In following chapters phases of the Big Data analysis according to methodology of EMC2 company are defined and basic technologies for analysing Big Data are described. As beneficial and interesting I consider the part dealing with definition of tasks in which Big Data technologies are already used in healthcare. In the practical part I perform the Big Data analysis task focusing on meteorotropic diseases in which I use real medical and meteorological data. The reader is not only acquainted with the one of recommended methods of analysis and with used statistical models, but also with terms from the field of biometeorology and healthcare. An integral part of the analysis is also information about its limitations, the consultation on results, and conclusions of experts in meteorology and healthcare.

Style APA, Harvard, Vancouver, ISO itp.

8

El, alaoui Imane. "Transformer les big social data en prévisions - méthodes et technologies : Application à l'analyse de sentiments". Thesis, Angers, 2018. http://www.theses.fr/2018ANGE0011/document.

Pełny tekst źródła

Streszczenie:

Extraire l'opinion publique en analysant les Big Social data a connu un essor considérable en raison de leur nature interactive, en temps réel. En effet, les données issues des réseaux sociaux sont étroitement liées à la vie personnelle que l’on peut utiliser pour accompagner les grands événements en suivant le comportement des personnes. C’est donc dans ce contexte que nous nous intéressons particulièrement aux méthodes d’analyse du Big data. La problématique qui se pose est que ces données sont tellement volumineuses et hétérogènes qu’elles en deviennent difficiles à gérer avec les outils classiques. Pour faire face aux défis du Big data, de nouveaux outils ont émergés. Cependant, il est souvent difficile de choisir la solution adéquate, car la vaste liste des outils disponibles change continuellement. Pour cela, nous avons fourni une étude comparative actualisée des différents outils utilisés pour extraire l'information stratégique du Big Data et les mapper aux différents besoins de traitement.La contribution principale de la thèse de doctorat est de proposer une approche d’analyse générique pour détecter de façon automatique des tendances d’opinion sur des sujets donnés à partir des réseaux sociaux. En effet, étant donné un très petit ensemble de hashtags annotés manuellement, l’approche proposée transfère l'information du sentiment connue des hashtags à des mots individuels. La ressource lexicale qui en résulte est un lexique de polarité à grande échelle dont l'efficacité est mesurée par rapport à différentes tâches de l’analyse de sentiment. La comparaison de notre méthode avec différents paradigmes dans la littérature confirme l'impact bénéfique de notre méthode dans la conception des systèmes d’analyse de sentiments très précis. En effet, notre modèle est capable d'atteindre une précision globale de 90,21%, dépassant largement les modèles de référence actuels sur l'analyse du sentiment des réseaux sociaux
Extracting public opinion by analyzing Big Social data has grown substantially due to its interactive nature, in real time. In fact, our actions on social media generate digital traces that are closely related to our personal lives and can be used to accompany major events by analysing peoples' behavior. It is in this context that we are particularly interested in Big Data analysis methods. The volume of these daily-generated traces increases exponentially creating massive loads of information, known as big data. Such important volume of information cannot be stored nor dealt with using the conventional tools, and so new tools have emerged to help us cope with the big data challenges. For this, the aim of the first part of this manuscript is to go through the pros and cons of these tools, compare their respective performances and highlight some of its interrelated applications such as health, marketing and politics. Also, we introduce the general context of big data, Hadoop and its different distributions. We provide a comprehensive overview of big data tools and their related applications.The main contribution of this PHD thesis is to propose a generic analysis approach to automatically detect trends on given topics from big social data. Indeed, given a very small set of manually annotated hashtags, the proposed approach transfers information from hashtags known sentiments (positive or negative) to individual words. The resulting lexical resource is a large-scale lexicon of polarity whose efficiency is measured against different tasks of sentiment analysis. The comparison of our method with different paradigms in literature confirms the impact of our method to design accurate sentiment analysis systems. Indeed, our model reaches an overall accuracy of 90.21%, significantly exceeding the current models on social sentiment analysis

Style APA, Harvard, Vancouver, ISO itp.

9

Pragarauskaitė, Julija. "Frequent pattern analysis for decision making in big data". Doctoral thesis, Lithuanian Academic Libraries Network (LABT), 2013. http://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2013~D_20130701_092451-80961.

Pełny tekst źródła

Streszczenie:

Huge amounts of digital information are stored in the World today and the amount is increasing by quintillion bytes every day. Approximate data mining algorithms are very important to efficiently deal with such amounts of data due to the computation speed required by various real-world applications, whereas exact data mining methods tend to be slow and are best employed where the precise results are of the highest important. This thesis focuses on several data mining tasks related to analysis of big data: frequent pattern mining and visual representation. For mining frequent patterns in big data, three novel approximate methods are proposed and evaluated on real and artificial databases: • Random Sampling Method (RSM) creates a random sample from the original database and makes assumptions on the frequent and rare sequences based on the analysis results of the random sample. A significant benefit is a theoretical estimate of classification errors made by this method using standard statistical methods. • Multiple Re-sampling Method (MRM) is an improved version of RSM method with a re-sampling strategy that decreases the probability to incorrectly classify the sequences as frequent or rare. • Markov Property Based Method (MPBM) relies upon the Markov property. MPBM requires reading the original database several times (the number equals to the order of the Markov process) and then calculates the empirical frequencies using the Markov property. For visual representation... [to full text]
Didžiuliai informacijos kiekiai yra sukaupiami kiekvieną dieną pasaulyje bei jie sparčiai auga. Apytiksliai duomenų tyrybos algoritmai yra labai svarbūs analizuojant tokius didelius duomenų kiekius, nes algoritmų greitis yra ypač svarbus daugelyje sričių, tuo tarpu tikslieji metodai paprastai yra lėti bei naudojami tik uždaviniuose, kuriuose reikalingas tikslus atsakymas. Ši disertacija analizuoja kelias duomenų tyrybos sritis: dažnų sekų paiešką bei vizualizaciją sprendimų priėmimui. Dažnų sekų paieškai buvo pasiūlyti trys nauji apytiksliai metodai, kurie buvo testuojami naudojant tikras bei dirbtinai sugeneruotas duomenų bazes: • Atsitiktinės imties metodas (Random Sampling Method - RSM) formuoja pradinės duomenų bazės atsitiktinę imtį ir nustato dažnas sekas, remiantis atsitiktinės imties analizės rezultatais. Šio metodo privalumas yra teorinis paklaidų tikimybių įvertinimas, naudojantis standartiniais statistiniais metodais. • Daugybinio perskaičiavimo metodas (Multiple Re-sampling Method - MRM) yra RSM metodo patobulinimas, kuris formuoja kelias pradinės duomenų bazės atsitiktines imtis ir taip sumažina paklaidų tikimybes. • Markovo savybe besiremiantis metodas (Markov Property Based Method - MPBM) kelis kartus skaito pradinę duomenų bazę, priklausomai nuo Markovo proceso eilės, bei apskaičiuoja empirinius dažnius remdamasis Markovo savybe. Didelio duomenų kiekio vizualizavimui buvo naudojami pirkėjų internetu elgsenos duomenys, kurie analizuojami naudojant... [toliau žr. visą tekstą]

Style APA, Harvard, Vancouver, ISO itp.

10

Landelius, Cecilia. "Data governance in big data : How to improve data quality in a decentralized organization". Thesis, KTH, Industriell ekonomi och organisation (Inst.), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-301258.

Pełny tekst źródła

Streszczenie:

The use of internet has increased the amount of data available and gathered. Companies are investing in big data analytics to gain insights from this data. However, the value of the analysis and decisions made based on it, is dependent on the quality ofthe underlying data. For this reason, data quality has become a prevalent issue for organizations. Additionally, failures in data quality management are often due to organizational aspects. Due to the growing popularity of decentralized organizational structures, there is a need to understand how a decentralized organization can improve data quality. This thesis conducts a qualitative single case study of an organization currently shifting towards becoming data driven and struggling with maintaining data quality within the logistics industry. The purpose of the thesis is to answer the questions: • RQ1: What is data quality in the context of logistics data? • RQ2: What are the obstacles for improving data quality in a decentralized organization? • RQ3: How can these obstacles be overcome? Several data quality dimensions were identified and categorized as critical issues,issues and non-issues. From the gathered data the dimensions completeness, accuracy and consistency were found to be critical issues of data quality. The three most prevalent obstacles for improving data quality were data ownership, data standardization and understanding the importance of data quality. To overcome these obstacles the most important measures are creating data ownership structures, implementing data quality practices and changing the mindset of the employees to a data driven mindset. The generalizability of a single case study is low. However, there are insights and trends which can be derived from the results of this thesis and used for further studies and companies undergoing similar transformations.
Den ökade användningen av internet har ökat mängden data som finns tillgänglig och mängden data som samlas in. Företag påbörjar därför initiativ för att analysera dessa stora mängder data för att få ökad förståelse. Dock är värdet av analysen samt besluten som baseras på analysen beroende av kvaliteten av den underliggande data. Av denna anledning har datakvalitet blivit en viktig fråga för företag. Misslyckanden i datakvalitetshantering är ofta på grund av organisatoriska aspekter. Eftersom decentraliserade organisationsformer blir alltmer populära, finns det ett behov av att förstå hur en decentraliserad organisation kan arbeta med frågor som datakvalitet och dess förbättring. Denna uppsats är en kvalitativ studie av ett företag inom logistikbranschen som i nuläget genomgår ett skifte till att bli datadrivna och som har problem med att underhålla sin datakvalitet. Syftet med denna uppsats är att besvara frågorna: • RQ1: Vad är datakvalitet i sammanhanget logistikdata? • RQ2: Vilka är hindren för att förbättra datakvalitet i en decentraliserad organisation? • RQ3: Hur kan dessa hinder överkommas? Flera datakvalitetsdimensioner identifierades och kategoriserades som kritiska problem, problem och icke-problem. Från den insamlade informationen fanns att dimensionerna, kompletthet, exakthet och konsekvens var kritiska datakvalitetsproblem för företaget. De tre mest förekommande hindren för att förbättra datakvalité var dataägandeskap, standardisering av data samt att förstå vikten av datakvalitet. För att överkomma dessa hinder är de viktigaste åtgärderna att skapa strukturer för dataägandeskap, att implementera praxis för hantering av datakvalitet samt att ändra attityden hos de anställda gentemot datakvalitet till en datadriven attityd. Generaliseringsbarheten av en enfallsstudie är låg. Dock medför denna studie flera viktiga insikter och trender vilka kan användas för framtida studier och för företag som genomgår liknande transformationer.

Style APA, Harvard, Vancouver, ISO itp.

11

Grüning, Björn [Verfasser], i Stefan [Akademischer Betreuer] Günther. "Integrierte bioinformatische Methoden zur reproduzierbaren und transparenten Hochdurchsatz-Analyse von Life Science Big Data". Freiburg : Universität, 2015. http://d-nb.info/1122593996/34.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

12

Åhlander, Niclas, i Saed Aldaamsah. "Inhämtning & analys av Big Data med fokus på sociala medier". Thesis, Högskolan i Halmstad, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-29978.

Pełny tekst źródła

Streszczenie:

I en värld som till allt större del använder sig av sociala medier skapas och synliggörs information om användarna som tidigare inte varit enkel att i stor mängd analysera. I det här arbetet visas processen för att skapa ett automatiserat insamlingssätt av specifik data från sociala medier. Insamlad data analyseras därefter med noggrant utformade algoritmer och slutligen demonstreras processens nytta i sin helhet. Datainhämtningen från sociala medier automatiserades med hjälp av en mängd kombinerade metoder. Därefter kunde analysen av det inhämtade datat utföras med hjälp av specifika algoritmer som redovisades i det här arbetet. Tillsammans resulterade metoderna i att vissa mönster framkom i datan, vilket avslöjade en mängd olika typer av information kring analysens utvalda individer.

Style APA, Harvard, Vancouver, ISO itp.

13

Rivetti, di Val Cervo Nicolo. "Efficient Stream Analysis and its Application to Big Data Processing". Thesis, Nantes, 2016. http://www.theses.fr/2016NANT4046/document.

Pełny tekst źródła

Streszczenie:

L’analyse de flux de données est utilisée dans beaucoup de contexte où la masse des données et/ou le débit auquel elles sont générées, excluent d’autres approches (par exemple le traitement par lots). Le modèle flux fourni des solutions aléatoires et/ou fondées sur des approximations pour calculer des fonctions d’intérêt sur des flux (repartis) de n-uplets, en considérant le pire cas, et en essayant de minimiser l’utilisation des ressources. En particulier, nous nous intéressons à deux problèmes classiques : l’estimation de fréquence et les poids lourds. Un champ d’application moins courant est le traitement de flux qui est d’une certaine façon un champ complémentaire aux modèle flux. Celui-ci fournis des systèmes pour effectuer des calculs génériques sur les flux en temps réel souple, qui passent à l’échèle. Cette dualité nous permet d’appliquer des solutions du modèle flux pour optimiser des systèmes de traitement de flux. Dans cette thèse, nous proposons un nouvel algorithme pour la détection d’éléments surabondants dans des flux repartis, ainsi que deux extensions d’un algorithme classique pour l’estimation des fréquences des items. Nous nous intéressons également à deux problèmes : construire un partitionnement équitable de l’univers des n-uplets par rapport à leurs poids et l’estimation des valeurs de ces n-uplets. Nous utilisons ces algorithmes pour équilibrer et/ou délester la charge dans les systèmes de traitement de flux
Nowadays stream analysis is used in many context where the amount of data and/or the rate at which it is generated rules out other approaches (e.g., batch processing). The data streaming model provides randomized and/or approximated solutions to compute specific functions over (distributed) stream(s) of data-items in worst case scenarios, while striving for small resources usage. In particular, we look into two classical and related data streaming problems: frequency estimation and (distributed) heavy hitters. A less common field of application is stream processing which is somehow complementary and more practical, providing efficient and highly scalable frameworks to perform soft real-time generic computation on streams, relying on cloud computing. This duality allows us to apply data streaming solutions to optimize stream processing systems. In this thesis, we provide a novel algorithm to track heavy hitters in distributed streams and two extensions of a well-known algorithm to estimate the frequencies of data items. We also tackle two related problems and their solution: provide even partitioning of the item universe based on their weights and provide an estimation of the values carried by the items of the stream. We then apply these results to both network monitoring and stream processing. In particular, we leverage these solutions to perform load shedding as well as to load balance parallelized operators in stream processing systems

Style APA, Harvard, Vancouver, ISO itp.

14

Chen, Longbiao. "Big data-driven optimization in transportation and communication networks". Electronic Thesis or Diss., Sorbonne université, 2018. https://accesdistant.sorbonne-universite.fr/login?url=https://theses-intra.sorbonne-universite.fr/2018SORUS393.pdf.

Pełny tekst źródła

Streszczenie:

L'évolution des structures métropolitaines ont créé divers types de réseaux urbains. Parmi lesquels deux types de réseaux sont d'une grande importance pour notre vie quotidienne : les réseaux de transport correspondant à la mobilité humaine dans l'espace physique et les réseaux de communications soutenant les interactions humaines dans l'espace numérique. L'expansion rapide dans la portée et l'échelle de ces deux réseaux soulève des questions de recherche fondamentales sur la manière d’optimiser ces réseaux. Certains des objectifs principaux comprennent le provisioning de ressources à la demande, la détection des anomalies, l'efficacité énergétique et la qualité de service. Malgré les différences dans la conception et les technologies de mise en œuvre, les réseaux de transport et les réseaux de communications partagent des structures fondamentales communes, et présentent des caractéristiques spatio-temporelles dynamiques similaires. En conséquence, ils existent les défis communs dans l’optimisation de ces deux réseaux : le profil du trafic, la prédiction de la mobilité, l’agrégation de trafic, le clustering des nœuds et l'allocation de ressources. Pour atteindre les objectifs d'optimisation et relever les défis de la recherche, différents modèles analytiques, algorithmes d'optimisation et systèmes de simulation ont été proposés et largement étudiés à travers plusieurs disciplines. Ces modèles analytiques sont souvent validés par la simulation et pourraient conduire à des résultats sous-optimaux dans le déploiement. Avec l'émergence de l’Internet, un volume massif de données de réseau urbain peuvent être collecté. Les progrès récents dans les techniques d'analyse de données Big Data ont fourni aux chercheurs de grands potentiels pour comprendre ces données. Motivé par cette tendance, l’objectif de cette thèse est d'explorer un nouveau paradigme d'optimisation des réseaux basé sur les données. Nous abordons les défis scientifiques mentionnés ci-dessus en appliquant des méthodes d'analyse de données pour l'optimisation des réseaux. Nous proposons deux algorithmes data-driven pour le clustering de trafic réseau et la prédiction de la mobilité d’utilisateur, et appliquer ces algorithmes à l'optimisation dans les réseaux de transport et de communications. Premièrement, en analysant les jeux de données de trafic à grande échelle des deux réseaux, nous proposons un algorithme de clustering à base de graphe pour mieux comprendre les similitudes de la circulation et les variations de trafic entre différents zones et heures. Sur cette base, nous appliquons l'algorithme d’agrégation (clustering) de trafic aux deux applications d'optimisation de réseau suivants : 1. Un clustering de trafic dynamique pour la planification à la demande des réseaux de vélos partagés. Dans cette application, nous regroupons dynamiquement les stations de vélos avec des motifs de trafic similaires pour obtenir des demandes de trafic groupées (en cluster) plus stables et plus prédictible, de manière à pouvoir prévoir les stations surchargés dans le réseau et à permettre une planification dynamique de réseau en fonction de la demande. Les résultats d'évaluation en utilisant les données réelles de New York City et Washington, D.C. montrent que notre solution prévoit précisément des clusters surchargés [...]
The evolution of metropolitan structures and the development of urban systems have created various kinds of urban networks, among which two types of networks are of great importance for our daily life, the transportation networks corresponding to human mobility in the physical space, and the communication networks supporting human interactions in the digital space. The rapid expansion in the scope and scale of these two networks raises a series of fundamental research questions on how to optimize these networks for their users. Some of the major objectives include demand responsiveness, anomaly awareness, cost effectiveness, energy efficiency, and service quality. Despite the distinct design intentions and implementation technologies, both the transportation and communication networks share common fundamental structures, and exhibit similar spatio-temporal dynamics. Correspondingly, there exists an array of key challenges that are common in the optimization in both networks, including network profiling, mobility prediction, traffic clustering, and resource allocation. To achieve the optimization objectives and address the research challenges, various analytical models, optimization algorithms, and simulation systems have been proposed and extensively studied across multiple disciplines. Generally, these simulation-based models are not evaluated in real-world networks, which may lead to sub-optimal results in deployment. With the emergence of ubiquitous sensing, communication and computing diagrams, a massive number of urban network data can be collected. Recent advances in big data analytics techniques have provided researchers great potentials to understand these data. Motivated by this trend, we aim to explore a new big data-driven network optimization paradigm, in which we address the above-mentioned research challenges by applying state-of-the-art data analytics methods to achieve network optimization goals. Following this research direction, in this dissertation, we propose two data-driven algorithms for network traffic clustering and user mobility prediction, and apply these algorithms to real-world optimization tasks in the transportation and communication networks. First, by analyzing large-scale traffic datasets from both networks, we propose a graph-based traffic clustering algorithm to better understand the traffic similarities and variations across different area and time. Upon this basis, we apply the traffic clustering algorithm to the following two network optimization applications. 1. Dynamic traffic clustering for demand-responsive bikeshare networks. In this application, we dynamically cluster bike stations with similar usage patterns to obtain stable and predictable cluster-wise bike traffic demands, so as to foresee over-demand stations in the network and enable demand-responsive bike scheduling. Evaluation results using real-world data from New York City and Washington, D.C. show that our framework accurately foresees over-demand clusters (e.g. with 0.882 precision and 0.938 recall in NYC), and outperforms other baseline methods significantly. 2. Complementary traffic clustering for cost-effective C-RAN. In this application, we cluster RRHs with complementary traffic patterns (e.g., an RRH in residential area and an RRH in business district) to reuse the total capacity of the BBUs, so as to reduce the overall deployment cost. We evaluate our framework with real-world network data collected from the city of Milan, Italy and the province of Trentino, Italy. Results show that our method effectively reduces the overall deployment cost to 48.4\% and 51.7\% of the traditional RAN architecture in the two datasets, respectively, and consistently outperforms other baseline methods. Second, by analyzing large-scale user mobility datasets from both networks, we propose [...]

Style APA, Harvard, Vancouver, ISO itp.

15

Tian, Yongchao. "Accéler la préparation des données pour l'analyse du big data". Thesis, Paris, ENST, 2017. http://www.theses.fr/2017ENST0017/document.

Pełny tekst źródła

Streszczenie:

Nous vivons dans un monde de big data, où les données sont générées en grand volume, grande vitesse et grande variété. Le big data apportent des valeurs et des avantages énormes, de sorte que l’analyse des données est devenue un facteur essentiel de succès commercial dans tous les secteurs. Cependant, si les données ne sont pas analysées assez rapidement, les bénéfices de big data seront limités ou même perdus. Malgré l’existence de nombreux systèmes modernes d’analyse de données à grande échelle, la préparation des données est le processus le plus long de l’analyse des données, n’a pas encore reçu suffisamment d’attention. Dans cette thèse, nous étudions le problème de la façon d’accélérer la préparation des données pour le big data d’analyse. En particulier, nous nous concentrons sur deux grandes étapes de préparation des données, le chargement des données et le nettoyage des données. Comme première contribution de cette thèse, nous concevons DiNoDB, un système SQL-on-Hadoop qui réalise l’exécution de requêtes à vitesse interactive sans nécessiter de chargement de données. Les applications modernes impliquent de lourds travaux de traitement par lots sur un grand volume de données et nécessitent en même temps des analyses interactives ad hoc efficaces sur les données temporaires générées dans les travaux de traitement par lots. Les solutions existantes ignorent largement la synergie entre ces deux aspects, nécessitant de charger l’ensemble des données temporaires pour obtenir des requêtes interactives. En revanche, DiNoDB évite la phase coûteuse de chargement et de transformation des données. L’innovation importante de DiNoDB est d’intégrer à la phase de traitement par lots la création de métadonnées que DiNoDB exploite pour accélérer les requêtes interactives. La deuxième contribution est un système de flux distribué de nettoyage de données, appelé Bleach. Les approches de nettoyage de données évolutives existantes s’appuient sur le traitement par lots pour améliorer la qualité des données, qui demandent beaucoup de temps. Nous ciblons le nettoyage des données de flux dans lequel les données sont nettoyées progressivement en temps réel. Bleach est le premier système de nettoyage qualitatif de données de flux, qui réalise à la fois la détection des violations en temps réel et la réparation des données sur un flux de données sale. Il s’appuie sur des structures de données efficaces, compactes et distribuées pour maintenir l’état nécessaire pour nettoyer les données et prend également en charge la dynamique des règles. Nous démontrons que les deux systèmes résultants, DiNoDB et Bleach, ont tous deux une excellente performance par rapport aux approches les plus avancées dans nos évaluations expérimentales, et peuvent aider les chercheurs à réduire considérablement leur temps consacré à la préparation des données
We are living in a big data world, where data is being generated in high volume, high velocity and high variety. Big data brings enormous values and benefits, so that data analytics has become a critically important driver of business success across all sectors. However, if the data is not analyzed fast enough, the benefits of big data will be limited or even lost. Despite the existence of many modern large-scale data analysis systems, data preparation which is the most time-consuming process in data analytics has not received sufficient attention yet. In this thesis, we study the problem of how to accelerate data preparation for big data analytics. In particular, we focus on two major data preparation steps, data loading and data cleaning. As the first contribution of this thesis, we design DiNoDB, a SQL-on-Hadoop system which achieves interactive-speed query execution without requiring data loading. Modern applications involve heavy batch processing jobs over large volume of data and at the same time require efficient ad-hoc interactive analytics on temporary data generated in batch processing jobs. Existing solutions largely ignore the synergy between these two aspects, requiring to load the entire temporary dataset to achieve interactive queries. In contrast, DiNoDB avoids the expensive data loading and transformation phase. The key innovation of DiNoDB is to piggyback on the batch processing phase the creation of metadata, that DiNoDB exploits to expedite the interactive queries. The second contribution is a distributed stream data cleaning system, called Bleach. Existing scalable data cleaning approaches rely on batch processing to improve data quality, which are very time-consuming in nature. We target at stream data cleaning in which data is cleaned incrementally in real-time. Bleach is the first qualitative stream data cleaning system, which achieves both real-time violation detection and data repair on a dirty data stream. It relies on efficient, compact and distributed data structures to maintain the necessary state to clean data, and also supports rule dynamics. We demonstrate that the two resulting systems, DiNoDB and Bleach, both of which achieve excellent performance compared to state-of-the-art approaches in our experimental evaluations, and can help data scientists significantly reduce their time spent on data preparation

Style APA, Harvard, Vancouver, ISO itp.

16

Rodriguez, Pellière Lineth Arelys. "A qualitative analysis to investigate the enablers of big data analytics that impacts sustainable supply chain". Thesis, Ecole centrale de Nantes, 2019. http://www.theses.fr/2019ECDN0019/document.

Pełny tekst źródła

Streszczenie:

Les académiques et les professionnels ont déjà montré que le Big Data et l'analyse prédictive, également connus dans la littérature sous le nom de BDPA, peuvent jouer un rôle fondamental dans la transformation et l'amélioration des fonctions de l'analyse de la chaîne d'approvisionnement durable (SSCA). Cependant, les connaissances sur la meilleure manière d'utiliser la BDPA pour augmenter simultanément les performances sociales, environnementale et financière. Par conséquent, avec les connaissances tirées de la littérature sur la SSCA, il semble que les entreprises peinent encore à mettre en oeuvre les pratiques de la SSCA. Les chercheursconviennent qu'il est encore nécessaire de comprendre les techniques, outils et facteurs des concepts de base de la SSCA pour adoption. C’est encore plus important d’intégrer BDPA en tant qu’atout stratégique dans les activités commerciales. Par conséquent, cette étude examine, par exemple, quels sont les facteurs de SSCA et quels sont les outils et techniques de BDPA qui permettent de mettre en évidence le 3BL (pour ses abréviations en anglais : "triple bottom line") des rendements de durabilité (environnementale, sociale et financière) via SCA.La thèse a adopté un constructionniste modéré, car elle comprend l’impact des facteurs Big Data sur les applications et les indicateurs de performance de la chaîne logistique analytique et durable. La thèse a également adopté un questionnaire et une étude de cas en tant que stratégie de recherche permettant de saisir les différentes perceptions des personnes et des entreprises dans l'application des mégadonnées sur la chaîne d'approvisionnement analytique et durable. La thèse a révélé une meilleure vision des facteurs pouvant influencer l'adoption du Big Data dans la chaîne d'approvisionnement analytique et durable. Cette recherche a permis de déterminer les facteurs en fonction des variables ayant une incidence sur l'adoption de BDPA pour SSCA, des outils et techniques permettant la prise de décision via SSCA et du coefficient de chaque facteur pour faciliter ou retarder l'adoption de la durabilité. Il n'a pas été étudié avant. Les résultats de la thèse suggèrent que les outils actuels utilisés par les entreprises ne peuvent pas analyser de grandes quantités de données par eux-mêmes. Les entreprises ont besoin d'outils plus appropriés pour effectuer ce travail
Scholars and practitioners already shown that Big Data and Predictive Analytics also known in the literature as BDPA can play a pivotal role in transforming and improving the functions of sustainable supply chain analytics (SSCA). However, there is limited knowledge about how BDPA can be best leveraged to grow social, environmental and financial performance simultaneously. Therefore, with the knowledge coming from literature around SSCA, it seems that companies still struggled to implement SSCA practices. Researchers agree that is still a need to understand the techniques, tools, and enablers of the basics SSCA for its adoption; this is even more important to integrate BDPA as a strategic asset across business activities. Hence, this study investigates, for instance, what are the enablers of SSCA, and what are the tools and techniques of BDPA that enable the triple bottom line (3BL) of sustainability performances through SCA. The thesis adopted moderate constructionism since understanding of how the enablers of big data impacts sustainable supply chain analytics applications and performances. The thesis also adopted a questionnaire and a case study as a research strategy in order to capture the different perceptions of the people and the company on big data application on sustainable supply chain analytics. The thesis revealed a better insight of the factors that can affect in the adoption of big data on sustainable supply chain analytics. This research was capable to find the factors depending on the variable loadings that impact in the adoption of BDPA for SSCA, tools and techniques that enable decision making through SSCA, and the coefficient of each factor for facilitating or delaying sustainability adoption that wasn’t investigated before. The findings of the thesis suggest that the current tools that companies are using by itself can’t analyses data. The companies need more appropriate tools for the data analysis

Style APA, Harvard, Vancouver, ISO itp.

17

Pšurný, Michal. "Big data analýzy a statistické zpracování metadat v archivu obrazové zdravotnické dokumentace". Master's thesis, Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií, 2017. http://www.nusl.cz/ntk/nusl-316821.

Pełny tekst źródła

Streszczenie:

This Diploma thesis describes issues of big data in healthcare focus on picture archiving and communication system. DICOM format are store images with header where it could be other valuable information. This thesis mapping data from 1215 studies.

Style APA, Harvard, Vancouver, ISO itp.

18

Botes, André Romeo. "An artefact to analyse unstructured document data stores / by André Romeo Botes". Thesis, North-West University, 2014. http://hdl.handle.net/10394/10608.

Pełny tekst źródła

Streszczenie:

Structured data stores have been the dominating technologies for the past few decades. Although dominating, structured data stores lack the functionality to handle the ‘Big Data’ phenomenon. A new technology has recently emerged which stores unstructured data and can handle the ‘Big Data’ phenomenon. This study describes the development of an artefact to aid in the analysis of NoSQL document data stores in terms of relational database model constructs. Design science research (DSR) is the methodology implemented in the study and it is used to assist in the understanding, design and development of the problem, artefact and solution. This study explores the existing literature on DSR, in addition to structured and unstructured data stores. The literature review formulates the descriptive and prescriptive knowledge used in the development of the artefact. The artefact is developed using a series of six activities derived from two DSR approaches. The problem domain is derived from the existing literature and a real application environment (RAE). The reviewed literature provided a general problem statement. A representative from NFM (the RAE) is interviewed for a situation analysis providing a specific problem statement. An objective is formulated for the development of the artefact and suggestions are made to address the problem domain, assisting the artefact’s objective. The artefact is designed and developed using the descriptive knowledge of structured and unstructured data stores, combined with prescriptive knowledge of algorithms, pseudo code, continuous design and object-oriented design. The artefact evolves through multiple design cycles into a final product that analyses document data stores in terms of relational database model constructs. The artefact is evaluated for acceptability and utility. This provides credibility and rigour to the research in the DSR paradigm. Acceptability is demonstrated through simulation and the utility is evaluated using a real application environment (RAE). A representative from NFM is interviewed for the evaluation of the artefact. Finally, the study is communicated by describing its findings, summarising the artefact and looking into future possibilities for research and application.
MSc (Computer Science), North-West University, Vaal Triangle Campus, 2014

Style APA, Harvard, Vancouver, ISO itp.

19

Chennen, Kirsley. "Maladies rares et "Big Data" : solutions bioinformatiques vers une analyse guidée par les connaissances : applications aux ciliopathies". Thesis, Strasbourg, 2016. http://www.theses.fr/2016STRAJ076/document.

Pełny tekst źródła

Streszczenie:

Au cours de la dernière décennie, la recherche biomédicale et la pratique médicale ont été révolutionné par l'ère post-génomique et l'émergence des « Big Data » en biologie. Il existe toutefois, le cas particulier des maladies rares caractérisées par la rareté, allant de l’effectif des patients jusqu'aux connaissances sur le domaine. Néanmoins, les maladies rares représentent un réel intérêt, car les connaissances fondamentales accumulées en temps que modèle d'études et les solutions thérapeutique qui en découlent peuvent également bénéficier à des maladies plus communes. Cette thèse porte sur le développement de nouvelles solutions bioinformatiques, intégrant des données Big Data et des approches guidées par la connaissance pour améliorer l'étude des maladies rares. En particulier, mon travail a permis (i) la création de PubAthena, un outil de criblage de la littérature pour la recommandation de nouvelles publications pertinentes, (ii) le développement d'un outil pour l'analyse de données exomique, VarScrut, qui combine des connaissance multiniveaux pour améliorer le taux de résolution
Over the last decade, biomedical research and medical practice have been revolutionized by the post-genomic era and the emergence of Big Data in biology. The field of rare diseases, are characterized by scarcity from the patient to the domain knowledge. Nevertheless, rare diseases represent a real interest as the fundamental knowledge accumulated as well as the developed therapeutic solutions can also benefit to common underlying disorders. This thesis focuses on the development of new bioinformatics solutions, integrating Big Data and Big Data associated approaches to improve the study of rare diseases. In particular, my work resulted in (i) the creation of PubAthena, a tool for the recommendation of relevant literature updates, (ii) the development of a tool for the analysis of exome datasets, VarScrut, which combines multi-level knowledge to improve the resolution rate

Style APA, Harvard, Vancouver, ISO itp.

20

Sinkala, Musalula. "Leveraging big data resources and data integration in biology: applying computational systems analyses and machine learning to gain insights into the biology of cancers". Doctoral thesis, Faculty of Health Sciences, 2020. http://hdl.handle.net/11427/32983.

Pełny tekst źródła

Streszczenie:

Recently, many "molecular profiling" projects have yielded vast amounts of genetic, epigenetic, transcription, protein expression, metabolic and drug response data for cancerous tumours, healthy tissues, and cell lines. We aim to facilitate a multi-scale understanding of these high-dimensional biological data and the complexity of the relationships between the different data types taken from human tumours. Further, we intend to identify molecular disease subtypes of various cancers, uncover the subtype-specific drug targets and identify sets of therapeutic molecules that could potentially be used to inhibit these targets. We collected data from over 20 publicly available resources. We then leverage integrative computational systems analyses, network analyses and machine learning, to gain insights into the pathophysiology of pancreatic cancer and 32 other human cancer types. Here, we uncover aberrations in multiple cell signalling and metabolic pathways that implicate regulatory kinases and the Warburg effect as the likely drivers of the distinct molecular signatures of three established pancreatic cancer subtypes. Then, we apply an integrative clustering method to four different types of molecular data to reveal that pancreatic tumours can be segregated into two distinct subtypes. We define sets of proteins, mRNAs, miRNAs and DNA methylation patterns that could serve as biomarkers to accurately differentiate between the two pancreatic cancer subtypes. Then we confirm the biological relevance of the identified biomarkers by showing that these can be used together with pattern-recognition algorithms to infer the drug sensitivity of pancreatic cancer cell lines accurately. Further, we evaluate the alterations of metabolic pathway genes across 32 human cancers. We find that while alterations of metabolic genes are pervasive across all human cancers, the extent of these gene alterations varies between them. Based on these gene alterations, we define two distinct cancer supertypes that tend to be associated with different clinical outcomes and show that these supertypes are likely to respond differently to anticancer drugs. Overall, we show that the time has already arrived where we can leverage available data resources to potentially elicit more precise and personalised cancer therapies that would yield better clinical outcomes at a much lower cost than is currently being achieved.

Style APA, Harvard, Vancouver, ISO itp.

21

Adjout, Rehab Moufida. "Big Data : le nouvel enjeu de l'apprentissage à partir des données massives". Thesis, Sorbonne Paris Cité, 2016. http://www.theses.fr/2016USPCD052.

Pełny tekst źródła

Streszczenie:

Le croisement du phénomène de mondialisation et du développement continu des technologies de l’information a débouché sur une explosion des volumes de données disponibles. Ainsi, les capacités de production, de stockage et de traitement des donnée sont franchi un tel seuil qu’un nouveau terme a été mis en avant : Big Data.L’augmentation des quantités de données à considérer, nécessite la mise en oeuvre de nouveaux outils de traitement. En effet, les outils classiques d’apprentissage sont peu adaptés à ce changement de volumétrie tant au niveau de la complexité de calcul qu’à la durée nécessaire au traitement. Ce dernier, étant le plus souvent centralisé et séquentiel,ce qui rend les méthodes d’apprentissage dépendantes de la capacité de la machine utilisée. Par conséquent, les difficultés pour analyser un grand jeu de données sont multiples.Dans le cadre de cette thèse, nous nous sommes intéressés aux problèmes rencontrés par l’apprentissage supervisé sur de grands volumes de données. Pour faire face à ces nouveaux enjeux, de nouveaux processus et méthodes doivent être développés afin d’exploiter au mieux l’ensemble des données disponibles. L’objectif de cette thèse est d’explorer la piste qui consiste à concevoir une version scalable de ces méthodes classiques. Cette piste s’appuie sur la distribution des traitements et des données pou raugmenter la capacité des approches sans nuire à leurs précisions.Notre contribution se compose de deux parties proposant chacune une nouvelle approche d’apprentissage pour le traitement massif de données. Ces deux contributions s’inscrivent dans le domaine de l’apprentissage prédictif supervisé à partir des données volumineuses telles que la Régression Linéaire Multiple et les méthodes d’ensemble comme le Bagging.La première contribution nommée MLR-MR, concerne le passage à l’échelle de la Régression Linéaire Multiple à travers une distribution du traitement sur un cluster de machines. Le but est d’optimiser le processus du traitement ainsi que la charge du calcul induite, sans changer évidement le principe de calcul (factorisation QR) qui permet d’obtenir les mêmes coefficients issus de la méthode classique.La deuxième contribution proposée est appelée "Bagging MR_PR_D" (Bagging based Map Reduce with Distributed PRuning), elle implémente une approche scalable du Bagging,permettant un traitement distribué sur deux niveaux : l’apprentissage et l’élagage des modèles. Le but de cette dernière est de concevoir un algorithme performant et scalable sur toutes les phases de traitement (apprentissage et élagage) et garantir ainsi un large spectre d’applications.Ces deux approches ont été testées sur une variété de jeux de données associées àdes problèmes de régression. Le nombre d’observations est de plusieurs millions. Nos résultats expérimentaux démontrent l’efficacité et la rapidité de nos approches basées sur la distribution de traitement dans le Cloud Computing
In recent years we have witnessed a tremendous growth in the volume of data generatedpartly due to the continuous development of information technologies. Managing theseamounts of data requires fundamental changes in the architecture of data managementsystems in order to adapt to large and complex data. Single-based machines have notthe required capacity to process such massive data which motivates the need for scalablesolutions.This thesis focuses on building scalable data management systems for treating largeamounts of data. Our objective is to study the scalability of supervised machine learningmethods in large-scale scenarios. In fact, in most of existing algorithms and datastructures,there is a trade-off between efficiency, complexity, scalability. To addressthese issues, we explore recent techniques for distributed learning in order to overcomethe limitations of current learning algorithms.Our contribution consists of two new machine learning approaches for large scale data.The first contribution tackles the problem of scalability of Multiple Linear Regressionin distributed environments, which permits to learn quickly from massive volumes ofexisting data using parallel computing and a divide and-conquer approach to providethe same coefficients like the classic approach.The second contribution introduces a new scalable approach for ensembles of modelswhich allows both learning and pruning be deployed in a distributed environment.Both approaches have been evaluated on a variety of datasets for regression rangingfrom some thousands to several millions of examples. The experimental results showthat the proposed approaches are competitive in terms of predictive performance while reducing significantly the time of training and prediction

Style APA, Harvard, Vancouver, ISO itp.

22

Lindh, Felicia, i Anna Södersten. "Användning av Big Data-analys vid revision : En jämförelse mellan revisionsbyråers framställning och revisionsteamens användning". Thesis, Luleå tekniska universitet, Institutionen för ekonomi, teknik, konst och samhälle, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:ltu:diva-85115.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

23

Bycroft, Clare. "Genomic data analyses for population history and population health". Thesis, University of Oxford, 2017. https://ora.ox.ac.uk/objects/uuid:c8a76d94-ded6-4a16-b5af-09bbad6292a2.

Pełny tekst źródła

Streszczenie:

Many of the patterns of genetic variation we observe today have arisen via the complex dynamics of interactions and isolation of historic human populations. In this thesis, we focus on two important features of the genetics of populations that can be used to learn about human history: population structure and admixture. The Iberian peninsula has a complex demographic history, as well as rich linguistic and cultural diversity. However, previous studies using small genomic regions (such as Y-chromosome and mtDNA) as well as genome-wide data have so far detected limited genetic structure in Iberia. Larger datasets and powerful new statistical methods that exploit information in the correlation structure of nearby genetic markers have made it possible to detect and characterise genetic differentiation at fine geographic scales. We performed the largest and most comprehensive study of Spanish population structure to date by analysing genotyping array data for ~1,400 Spanish individuals genotyped at ~700,000 polymorphic loci. We show that at broad scales, the major axis of genetic differentiation in Spain runs from west to east, while there is remarkable genetic similarity in the north-south direction. Our analysis also reveals striking patterns of geographically-localised and subtle population structure within Spain at scales down to tens of kilometres. We developed and applied new approaches to show how this structure has arisen from a complex and regionally-varying mix of genetic isolation and recent gene-flow within and from outside of Iberia. To further explore the genetic impact of historical migrations and invasions of Iberia, we assembled a data set of 2,920 individuals (~300,000 markers) from Iberia and the surrounding regions of north Africa, Europe, and sub-Saharan Africa. Our admixture analysis implies that north African-like DNA in Iberia was mainly introduced in the earlier half (860 - 1120 CE) of the period of Muslim rule in Iberia, and we estimate that the closest modern-day equivalents to the initial migrants are located in Western Sahara. We also find that north African-like DNA in Iberia shows striking regional variation, with near-zero contributions in the Basque regions, low amounts (~3%) in the north east of Iberia, and as high as (~11%) in Galicia and Portugal. The UK Biobank project is a large prospective cohort study of ~500,000 individuals from across the United Kingdom, aged between 40-69 at recruitment. A rich variety of phenotypic and health-related information is available on each participant, making the resource unprecedented in its size and scope. Understanding the role that genetics plays in phenotypic variation, and its potential interactions with other factors, provides a critical route to a better understanding of human biology and population health. As such, a key component of the UK Biobank resource has been the collection of genome-wide genetic data (~805,000 markers) on every participant using purpose-designed genotyping arrays. These data are the focus of the second part of this thesis. In particular, we designed and implemented a quality control (QC) pipeline on behalf of the current and future use of this multi-purpose resource. Genotype data on this scale offers novel opportunities for assessing quality issues, although the wide range of ancestral backgrounds in the cohort also creates particular challenges. We also conducted a set of analyses that reveal properties of the genetic data, including population structure and familial relatedness, that can be important for downstream analyses. We find that cryptic relatedness is common among UK Biobank participants (~30% have at least one first cousin relative or closer), and a full range of human population structure is present in this cohort: from world-wide ancestral diversity to subtle population structure at sub-national geographic scales. Finally, we performed a genome-wide association scan on a well-studied and highly polygenic phenotype: standing height. This provided a further test of the effectiveness of our QC, as well as highlighting the potential of the resource to uncover novel regions of association.

Style APA, Harvard, Vancouver, ISO itp.

24

Al-Odat, Zeyad Abdel-Hameed. "Analyses, Mitigation and Applications of Secure Hash Algorithms". Diss., North Dakota State University, 2020. https://hdl.handle.net/10365/32058.

Pełny tekst źródła

Streszczenie:

Cryptographic hash functions are one of the widely used cryptographic primitives with a purpose to ensure the integrity of the system or data. Hash functions are also utilized in conjunction with digital signatures to provide authentication and non-repudiation services. Secure Hash Algorithms are developed over time by the National Institute of Standards and Technology (NIST) for security, optimal performance, and robustness. The most known hash standards are SHA-1, SHA-2, and SHA-3. The secure hash algorithms are considered weak if security requirements have been broken. The main security attacks that threaten the secure hash standards are collision and length extension attacks. The collision attack works by finding two different messages that lead to the same hash. The length extension attack extends the message payload to produce an eligible hash digest. Both attacks already broke some hash standards that follow the Merkle-Damgrard construction. This dissertation proposes methodologies to improve and strengthen weak hash standards against collision and length extension attacks. We propose collision-detection approaches that help to detect the collision attack before it takes place. Besides, a proper replacement, which is supported by a proper construction, is proposed. The collision detection methodology helps to protect weak primitives from any possible collision attack using two approaches. The first approach employs a near-collision detection mechanism that was proposed by Marc Stevens. The second approach is our proposal. Moreover, this dissertation proposes a model that protects the secure hash functions from collision and length extension attacks. The model employs the sponge structure to construct a hash function. The resulting function is strong against collision and length extension attacks. Furthermore, to keep the general structure of the Merkle-Damgrard functions, we propose a model that replaces the SHA-1 and SHA-2 hash standards using the Merkle-Damgrard construction. This model employs the compression function of the SHA-1, the function manipulators of the SHA-2, and the $10*1$ padding method. In the case of big data over the cloud, this dissertation presents several schemes to ensure data security and authenticity. The schemes include secure storage, anonymous privacy-preserving, and auditing of the big data over the cloud.

Style APA, Harvard, Vancouver, ISO itp.

25

Belghache, Elhadi. "AMAS4BigData : analyse dynamique de grandes masses de données par systèmes multi-agents adaptatifs". Thesis, Toulouse 3, 2019. http://www.theses.fr/2019TOU30149.

Pełny tekst źródła

Streszczenie:

L'ère des grandes masses de données (big data) nous a mis face à de nouvelles problématiques de gestion et de traitement des données. Les outils conventionnels actuels d'analyse sont maintenant proches de répondre aux problématiques actuelles et de fournir des résultats satisfaisants avec un coût raisonnable. Mais la vitesse à laquelle les nouvelles données sont générées et la nécessité de gérer les modifications de ces données à la fois dans le contenu et la structure conduisent à de nouvelles problématiques émergentes. La théorie des AMAS (Adaptive Multi-Agent Systems) propose de résoudre par autoorganisation des problèmes complexes pour lesquels aucune solution algorithmique n'est connue. Le comportement coopératif des agents permet au système de s'adapter à un environnement dynamique pour maintenir le système dans un état de fonctionnement adéquat. Les systèmes ambiants présentent un exemple typique de système complexe nécessitant ce genre d'approche, et ont donc été choisis comme domaine d'application pour notre travail. Cette thèse vise à explorer et décrire comment la théorie des Systèmes Multi-Agents Adaptatifs peut être appliquée aux grandes masses de données en fournissant des capacités d'analyse dynamique, en utilisant un nouvel outil analytique qui mesure en temps réel la similarité des évolutions des données. Cette recherche présente des résultats prometteurs et est actuellement appliquée dans l'opération neOCampus, le campus ambiant de l'Université Toulouse III
Understanding data is the main purpose of data science and how to achieve it is one of the challenges of data science, especially when dealing with big data. The big data era brought us new data processing and data management challenges to face. Existing state-of-the-art analytics tools come now close to handle ongoing challenges and provide satisfactory results with reasonable cost. But the speed at which new data is generated and the need to manage changes in data both for content and structure lead to new rising challenges. This is especially true in the context of complex systems with strong dynamics, as in for instance large scale ambient systems. One existing technology that has been shown as particularly relevant for modeling, simulating and solving problems in complex systems are Multi-Agent Systems. The AMAS (Adaptive Multi-Agent Systems) theory proposes to solve complex problems for which there is no known algorithmic solution by self-organization. The cooperative behavior of the agents enables the system to self-adapt to a dynamical environment so as to maintain the system in a functionality adequate state. In this thesis, we apply this theory to Big Data Analytics. In order to find meaning and relevant information drowned in the data flood, while overcoming big data challenges, a novel analytic tool is needed, able to continuously find relations between data, evaluate them and detect their changes and evolution over time. The aim of this thesis is to present the AMAS4BigData analytics framework based on the Adaptive Multi-agent systems technology, which uses a new data similarity metric, the Dynamics Correlation, for dynamic data relations discovery and dynamic display. This framework is currently being applied in the neOCampus operation, the ambient campus of the University Toulouse III - Paul Sabatier

Style APA, Harvard, Vancouver, ISO itp.

26

Lindström, Maja. "Food Industry Sales Prediction : A Big Data Analysis & Sales Forecast of Bake-off Products". Thesis, Umeå universitet, Institutionen för fysik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-184184.

Pełny tekst źródła

Streszczenie:

In this thesis, the sales of bread and coffee bread at Coop Värmland AB have been studied. The aim was to find what factors that are important for the sales and then make predictions of how the sales will look like in the future to reduce waste and increase profits. Big data analysis and data exploration was used to get to know the data and find the factors that affect the sales the most. Time series forecasting and supervised machine learning models were used to predict future sales. The main focus was five different models that were compared and analysed, they were; Decision tree regression, Random forest regression, Artificial neural networks, Recurrent neural networks and a time series model called Prophet. Comparing the observed values to the predictions made by the models indicated that using a model based on the time series is to be preferred, that is, Prophet and Recurrent neural network. These two models gave the lowest errors and by that, the most accurate results. Prophet yielded mean absolute percentage errors of 8.295% for bread and 9.156% for coffee bread. The Recurrent neural network gave mean absolute percentage errors of 7.938% for bread and 13.12% for coffee bread. That is about twice as good as the models they are using today at Coop which are based on the mean value of the previous sales.
I denna avhandling har försäljningen av matbröd och fikabröd på Coop Värmland AB studerats. Målet var att hitta vilka faktorer som är viktiga för försäljningen och sedan förutsäga hur försäljningen kommer att se ut i framtiden för att minska svinn och öka vin- ster. Big data- analys och explorativ dataanalys har använts för att lära känna datat och hitta de faktorer som påverkar försäljningen mest. Tidsserieprediktion och olika mask- ininlärningsmodeller användes för att förutspå den framtida försäljningen. Huvudfokus var fem olika modeller som jämfördes och analyserades. De var Decision tree regression, Random forest regression, Artificial neural networks, Recurrent neural networks och en tidsseriemodell som kallas Prophet. Jämförelse mellan de observerade värdena och de värden som predicerats med modellerna indikerade att de modeller som är baserade på tidsserierna är att föredra, det vill säga Prophet och Recurrent neural networks. Dessa två modeller gav de lägsta felen och därmed de mest exakta resultaten. Prophet gav genomsnittliga absoluta procentuella fel på 8.295% för matbröd och 9.156% för fikabröd. Recurrent neural network gav genomsnittliga absoluta procentuella fel på 7.938% för matbröd och 13.12% för fikabröd. Det är ungefär dubbelt så korrekt som de modeller de använder idag på Coop som baseras på medelvärdet av tidigare försäljning.

Style APA, Harvard, Vancouver, ISO itp.

27

Leonardelli, Lorena. "Grapevine acidity: SVM tool development and NGS data analyses". Doctoral thesis, University of Trento, 2014. http://eprints-phd.biblio.unitn.it/1350/1/PhD-Thesis.pdf.

Pełny tekst źródła

Streszczenie:

Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine.

Style APA, Harvard, Vancouver, ISO itp.

28

Kozas, Anastasios. "OLAP-Analyse von Propagationsprozessen". [S.l. : s.n.], 2005. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB12168115.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

29

Leonardelli, Lorena. "Grapevine acidity: SVM tool development and NGS data analyses". Doctoral thesis, Università degli studi di Trento, 2014. https://hdl.handle.net/11572/368613.

Pełny tekst źródła

Streszczenie:

Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine.

Style APA, Harvard, Vancouver, ISO itp.

30

Mansiaux, Yohann. "Analyse d'un grand jeu de données en épidémiologie : problématiques et perspectives méthodologiques". Thesis, Paris 6, 2014. http://www.theses.fr/2014PA066272/document.

Pełny tekst źródła

Streszczenie:

L'augmentation de la taille des jeux de données est une problématique croissante en épidémiologie. La cohorte CoPanFlu-France (1450 sujets), proposant une étude du risque d'infection par la grippe H1N1pdm comme une combinaison de facteurs très divers en est un exemple. Les méthodes statistiques usuelles (e.g. les régressions) pour explorer des associations sont limitées dans ce contexte. Nous comparons l'apport de méthodes exploratoires data-driven à celui de méthodes hypothesis-driven.Une première approche data-driven a été utilisée, évaluant la capacité à détecter des facteurs de l'infection de deux méthodes de data mining, les forêts aléatoires et les arbres de régression boostés, de la méthodologie " régressions univariées/régression multivariée" et de la régression logistique LASSO, effectuant une sélection des variables importantes. Une approche par simulation a permis d'évaluer les taux de vrais et de faux positifs de ces méthodes. Nous avons ensuite réalisé une étude causale hypothesis-driven du risque d'infection, avec un modèle d'équations structurelles (SEM) à variables latentes, pour étudier des facteurs très divers, leur impact relatif sur l'infection ainsi que leurs relations éventuelles. Cette thèse montre la nécessité de considérer de nouvelles approches statistiques pour l'analyse des grands jeux de données en épidémiologie. Le data mining et le LASSO sont des alternatives crédibles aux outils conventionnels pour la recherche d'associations. Les SEM permettent l'intégration de variables décrivant différentes dimensions et la modélisation explicite de leurs relations, et sont dès lors d'un intérêt majeur dans une étude multidisciplinaire comme CoPanFlu
The increasing size of datasets is a growing issue in epidemiology. The CoPanFlu-France cohort(1450 subjects), intended to study H1N1 pandemic influenza infection risk as a combination of biolo-gical, environmental, socio-demographic and behavioral factors, and in which hundreds of covariatesare collected for each patient, is a good example. The statistical methods usually employed to exploreassociations have many limits in this context. We compare the contribution of data-driven exploratorymethods, assuming the absence of a priori hypotheses, to hypothesis-driven methods, requiring thedevelopment of preliminary hypotheses.Firstly a data-driven study is presented, assessing the ability to detect influenza infection determi-nants of two data mining methods, the random forests (RF) and the boosted regression trees (BRT), ofthe conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regres-sion - UFMLR) and of the Least Absolute Shrinkage and Selection Operator (LASSO), with penaltyin multivariate logistic regression to achieve a sparse selection of covariates. A simulation approachwas used to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.Between three and twenty-four determinants of infection were identified, the pre-epidemic antibodytiter being the unique covariate selected with all methods. The mean TPR were the highest for RF(85%) and BRT (80%), followed by the LASSO (up to 78%), while the UFMLR methodology wasinefficient (below 50%). A slight increase of alpha risk (mean FPR up to 9%) was observed for logisticregression-based models, LASSO included, while the mean FPR was 4% for the data-mining methods.Secondly, we propose a hypothesis-driven causal analysis of the infection risk, with a structural-equation model (SEM). We exploited the SEM specificity of modeling latent variables to study verydiverse factors, their relative impact on the infection, as well as their eventual relationships. Only thelatent variables describing host susceptibility (modeled by the pre-epidemic antibody titer) and com-pliance with preventive behaviors were directly associated with infection. The behavioral factors des-cribing risk perception and preventive measures perception positively influenced compliance with pre-ventive behaviors. The intensity (number and duration) of social contacts was not associated with theinfection.This thesis shows the necessity of considering novel statistical approaches for the analysis of largedatasets in epidemiology. Data mining and LASSO are credible alternatives to the tools generally usedto explore associations with a high number of variables. SEM allows the integration of variables des-cribing diverse dimensions and the explicit modeling of their relationships ; these models are thereforeof major interest in a multidisciplinary study as CoPanFlu

Style APA, Harvard, Vancouver, ISO itp.

31

Scholz, Matthias. "Approaches to analyse and interpret biological profile data". Phd thesis, [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=980988799.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

32

El, Ouazzani Saïd. "Analyse des politiques publiques en matière d’adoption du cloud computing et du big data : une approche comparative des modèles français et marocain". Thesis, Université Paris-Saclay (ComUE), 2016. http://www.theses.fr/2016SACLE009/document.

Pełny tekst źródła

Streszczenie:

Notre recherche repose sur l’analyse des politiques publiques françaises et marocaines en matière d’adoption des technologies du Cloud Computing et du Big Data. Nous avons analysé ce que les Etats, français et marocain, font — ou ne font pas — pour faire face aux enjeux du numérique. Enjeux pour lesquels l’Etat doit apporter aujourd’hui des réponses politiques et techniques. En effet, l’Etat, dans une acception weberienne, voit sa représentation idéal-typique se modifier en un cyber-Etat qui a pour mission :— Assurer une souveraineté en développant des plateformes Cloud Computing nationales susceptibles de fournir les mêmes services que des plateformes étrangères ;— Développer des outils numériques du type Big Data articulés à des solutions « Cloud Computing » afin d’améliorer des services publics. — Développer et assurer la présence de l’Etat et de ses administrations dans le cyberespace ;— Mettre les outils du type Coud Computing au service de la sécurité nationale pour faire face aux dispositifs de cyber-renseignement étrangers.Dans un contexte de transformations profondes de la société induites par le numérique, l’Etat doit réaffirmer ses droits sur son propre territoire. En effet, le Net offre aux individus des possibilités de sociabilité croissantes à travers une «vie numérique» qui constitue une facette, un prolongement de la vie réelle. Cette vie numérique individuelle évolue en suivant les transformations de la technologie qui potentialisent la sociabilité en ligne et qui s’accompagnent de contraintes liées au traitement des données personnelles et font surgir des débats relatifs à la vie privée.Pour faire face aux risques sécuritaires, l’Etat français comme l’Etat marocain se sont dotés des instruments juridiques et techniques qui s’appuient précisément sur les technologies du Cloud Computing et du Big Data. L’arsenal juridique français s’est vu renforcé dernièrement par l’adoption successive et accélérée — sans débat national — de la Loi de programmation militaire (2014-2019) puis sur les lois anti-terroriste (2014) et sur le Renseignement (2015). Ces différents textes ont agité le débat politique en instillant une inquiétude grandissante relative au déploiement de dispositifs numériques de surveillance. Surveillance, ou cyber-surveillance, qui trouve sa légitimité dans la lutte contre le terrorisme en faisant, à chaque fois, référence à la notion de sécurité nationale, concept au contenu juridiquement flou et dépendant des autorités publiques. Notre travail couvre quatre axes principaux : 1- L’évolution de la conception même de l’Etat qui implique la mise en place de cyber-politiques publiques ainsi que le développement d’un cyber-secteur public, d’un cyber-service publique et également d’une évolution de la fonction publique elle-même.2- Les enjeux sécuritaires à l’ère du Cyber-Etat. Nous avons ainsi pu traiter des notions comme celles de cyber-sécurité, de cyber-souveraineté et de cyber-surveillance au sein du Cyber-Etat.3- Les enjeux liés au traitement des données personnelles au sein du Cyber-Etat et produites par les activités quotidiennes du cyber-citoyen.4- Les fondements techniques du Cyber-Etat : le Cloud Computing et et le Big Data. On pu être ainsi analysées techniquement ces deux technologies.C’est grâce à la collaboration avec des partenaires français et nord-américains : la Mairie de Boulogne Billancourt et les Engaged Public et CausesLabs que nous avons pu montrer, à travers une étude de cas, l’apport concret du Cloud Computing dans le cadre d’une collectivité locale française. Une expérimentation qu’il conviendra de suivre, si ce n’est développer, dans l’avenir
Our research concerns the public policy analysis on how Cloud Computing and Big data are adopted by French and Moroccan States with a comparative approach between the two models. We have covered these main areas: The impact of the digital on the organization of States and Government ; The digital Public Policy in both France and Morocco countries ;The concept related to the data protection, data privacy ; The limits between security, in particular home security, and the civil liberties ; The future and the governance of the Internet ; A use case on how the Cloud could change the daily work of a public administration ; Our research aims to analyze how the public sector could be impacted by the current digital (re) evolution and how the States could be changed by emerging a new model in digital area called Cyber-State. This term is a new concept and is a new representation of the State in the cyberspace. We tried to analyze the digital transformation by looking on how the public authorities treat the new economics, security and social issues and challenges based on the Cloud Computing and Big Data as the key elements on the digital transformation. We tried also to understand how the States – France and Morocco - face the new security challenges and how they fight against the terrorism, in particular, in the cyberspace. We studied the recent adoption of new laws and legislation that aim to regulate the digital activities. We analyzed the limits between security risks and civil liberties in context of terrorism attacks. We analyzed the concepts related to the data privacy and the data protection. Finally, we focused also on the future of the internet and the impacts on the as is internet architecture and the challenges to keep it free and available as is the case today

Style APA, Harvard, Vancouver, ISO itp.

33

Leonardelli, Lorena. "Grapevine acidity: SVM tool development and NGS data analyses". Doctoral thesis, country:IT, 2014. http://hdl.handle.net/10449/24467.

Pełny tekst źródła

Streszczenie:

Single Nucleotide Polymorphisms (SNPs) represent the most abundant type of genetic variation and they are a valuable tool for several biological applications like linkage mapping, integration of genetic and physical maps, population genetics as well as evolutionary and protein structure-function studies. SNP genotyping by mapping DNA reads produced via Next generation sequencing (NGS) technologies on a reference genome is a very common and convenient approach in our days, but still prone to a significant error rate. The need of defining in silico true genetic variants in genomic and transcriptomic sequences is prompted by the high costs of the experimental validation through re-sequencing or SNP arrays, not only in terms of money but also time and sample availability. Several open-source tools have been recently developed to identify small variants in whole-genome data, but still the candidate variants, provided in the VCF output format, present a high false positive calling rate. Goal of this thesis work is the development of a bioinformatic method that classifies variant calling outputs in order to reduce the number of false positive calls. With the aim to dissect the molecular bases of grape acidity (Vitis vinifera L.), this tool has been then used to select SNPs in two grapevine varieties, which show very different content of organic acids in the berry. The VCF parameters have been used to train a Support Vector Machine (SVM) that classifies the VCF records in true and false positive variants, cleaning the output from the most likely false positive results. The SVM approach has been implemented in a new software, called VerySNP, and applied to model and non-model organisms. In both cases, the machine learning method efficiently recognized true positive from false positive variants in both genomic and transcriptomic sequences. In the second part of the thesis, VerySNP was applied to identify true SNPs in RNA-seq data of the grapevine variety Gora Chirine, characterized by low acidity, and Sultanine, a normal acidity variety closely related to Gora. The comparative transcriptomic analysis crossed with the SNP information lead to discover non-synonymous polymorphisms inside coding regions and, thus, provided a list of candidate genes potentially affecting acidity in grapevine

Style APA, Harvard, Vancouver, ISO itp.

34

Carel, Léna. "Analyse de données volumineuses dans le domaine du transport". Thesis, Université Paris-Saclay (ComUE), 2019. http://www.theses.fr/2019SACLG001/document.

Pełny tekst źródła

Streszczenie:

L'objectif de cette thèse est de proposer de nouvelles méthodologies à appliquer aux données du transport public. En effet, nous sommes entourés de plus en plus de capteurs et d'ordinateurs générant d'énormes quantités de données. Dans le domaine des transports publics, les cartes sans contact génèrent des données à chaque fois que nous les utilisons, que ce soit pour les chargements ou nos trajets. Dans cette thèse, nous utilisons ces données dans deux buts distincts. Premièrement, nous voulions être capable de détecter des groupes de passagers ayant des habitudes temporelles similaires. Pour ce faire, nous avons commencé par utilisé la factorisation de matrices non-négatives comme un outil de pré-traitement pour la classification. Puis nous avons introduit l'algorithme NMF-EM permettant une réduction de la dimension et une classification de manière simultanée pour un modèle de mélange de distributions multinomiales. Dans un second temps, nous avons appliqué des méthodes de régression à ces données afin d'être capable de fournir une fourchette de ces validations probables. De même, nous avons appliqué cette méthodologie à la détection d'anomalies sur le réseau
The aim of this thesis is to apply new methodologies to public transportation data. Indeed, we are more and more surrounded by sensors and computers generating huge amount of data. In the field of public transportation, smart cards generate data about our purchases and our travels every time we use them. In this thesis, we used this data for two purposes. First of all, we wanted to be able to detect passenger's groups with similar temporal habits. To that end, we began to use the Non-negative Matrix Factorization as a pre-processing tool for clustering. Then, we introduced the NMF-EM algorithm allowing simultaneous dimension reduction and clustering on a multinomial mixture model. The second purpose of this thesis is to apply regression methods on these data to be able to forecast the number of check-ins on a network and give a range of likely check-ins. We also used this methodology to be able to detect anomalies on the network

Style APA, Harvard, Vancouver, ISO itp.

35

Ren, Zheng. "Case Studies on Fractal and Topological Analyses of Geographic Features Regarding Scale Issues". Thesis, Högskolan i Gävle, Samhällsbyggnad, GIS, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hig:diva-23996.

Pełny tekst źródła

Streszczenie:

Scale is an essential notion in geography and geographic information science (GIScience). However, the complex concepts of scale and traditional Euclidean geometric thinking have created tremendous confusion and uncertainty. Traditional Euclidean geometry uses absolute size, regular shape and direction to describe our surrounding geographic features. In this context, different measuring scales will affect the results of geospatial analysis. For example, if we want to measure the length of a coastline, its length will be different using different measuring scales. Fractal geometry indicates that most geographic features are not measurable because of their fractal nature. In order to deal with such scale issues, the topological and scaling analyses are introduced. They focus on the relationships between geographic features instead of geometric measurements such as length, area and slope. The scale change will affect the geometric measurements such as length and area but will not affect the topological measurements such as connectivity. This study uses three case studies to demonstrate the scale issues of geographic features though fractal analyses. The first case illustrates that the length of the British coastline is fractal and scale-dependent. The length of the British coastline increases with the decreased measuring scale. The yardstick fractal dimension of the British coastline was also calculated. The second case demonstrates that the areal geographic features such as British island are also scale-dependent in terms of area. The box-counting fractal dimension, as an important parameter in fractal analysis, was also calculated. The third case focuses on the scale effects on elevation and the slope of the terrain surface. The relationship between slope value and resolution in this case is not as simple as in the other two cases. The flat and fluctuated areas generate different results. These three cases all show the fractal nature of the geographic features and indicate the fallacies of scale existing in geography. Accordingly, the fourth case tries to exemplify how topological and scaling analyses can be used to deal with such unsolvable scale issues. The fourth case analyzes the London OpenStreetMap (OSM) streets in a topological approach to reveal the scaling or fractal property of street networks. The fourth case further investigates the ability of the topological metric to predict Twitter user’s presence. The correlation between number of tweets and connectivity of London named natural streets is relatively high and the coefficient of determination r2 is 0.5083. Regarding scale issues in geography, the specific technology or method to handle the scale issues arising from the fractal essence of the geographic features does not matter. Instead, the mindset of shifting from traditional Euclidean thinking to novel fractal thinking in the field of GIScience is more important. The first three cases revealed the scale issues of geographic features under the Euclidean thinking. The fourth case proved that topological analysis can deal with such scale issues under fractal way of thinking. With development of data acquisition technologies, the data itself becomes more complex than ever before. Fractal thinking effectively describes the characteristics of geographic big data across all scales. It also overcomes the drawbacks of traditional Euclidean thinking and provides deeper insights for GIScience research in the big data era.

Style APA, Harvard, Vancouver, ISO itp.

36

Asadi, Abduljabbar [Verfasser], i Peter [Akademischer Betreuer] Dietrich. "Advanced Data Mining and Machine Learning Algorithms for Integrated Computer-Based Analyses of Big Environmental Databases / Abduljabbar Asadi ; Betreuer: Peter Dietrich". Tübingen : Universitätsbibliothek Tübingen, 2017. http://d-nb.info/1199392979/34.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

37

GRIMAUDO, LUIGI. "Data Mining Algorithms for Internet Data: from Transport to Application Layer". Doctoral thesis, Politecnico di Torino, 2014. http://hdl.handle.net/11583/2537089.

Pełny tekst źródła

Streszczenie:

Nowadays we live in a data-driven world. Advances in data generation, collection and storage technology have enabled organizations to gather data sets of massive size. Data mining is a discipline that blends traditional data analysis methods with sophisticated algorithms to handle the challenges posed by these new types of data sets. The Internet is a complex and dynamic system with new protocols and applications that arise at a constant pace. All these characteristics designate the Internet a valuable and challenging data source and application domain for a research activity, both looking at Transport layer, analyzing network tra c flows, and going up to Application layer, focusing on the ever-growing next generation web services: blogs, micro-blogs, on-line social networks, photo sharing services and many other applications (e.g., Twitter, Facebook, Flickr, etc.). In this thesis work we focus on the study, design and development of novel algorithms and frameworks to support large scale data mining activities over huge and heterogeneous data volumes, with a particular focus on Internet data as data source and targeting network tra c classification, on-line social network analysis, recommendation systems and cloud services and Big data.

Style APA, Harvard, Vancouver, ISO itp.

38

Coquidé, Célestin. "Analyse de réseaux complexes réels via des méthodes issues de la matrice de Google". Thesis, Bourgogne Franche-Comté, 2020. http://www.theses.fr/2020UBFCD038.

Pełny tekst źródła

Streszczenie:

Dans une époque où Internet est de plus en plus utilisé et où les populations sont de plus en plus connectées à travers le monde, notre vie quotidienne est grandement facilitée. Un domaine scientifique très récent, la science des réseaux, dont les prémices viennent des mathématiques et plus précisément de la théorie des graphes a justement pour objet d'étude de tels systèmes complexes. Un réseau est un objet mathématique fait de nœuds et de connexions entre ces nœuds. Dans la nature, on retrouve une multitude de phénomènes pouvant être vus ainsi, par exemple, le mycélium qui est un réseau souterrain capable d'avoir accès à courtes et moyennes distances aux ressources organiques propices à sa survie, ou bien encore le réseau vasculaire sanguin. À notre échelle, il existe aussi des réseaux dont nous sommes les nœuds. Dans cette thèse, nous allons nous intéresser aux réseaux réels, réseaux construits à partir de banques de données, afin de les analyser, puis d'extraire des informations difficilement accessibles dans des réseaux pouvant contenir, parfois, des millions de nœuds et cent fois plus de connexions. Les réseaux étudiés sont aussi dirigés, autrement dit, les liens ont une direction. On représente une marche aléatoire dans un tel réseau à l'aide d'une matrice stochastique appelée matrice de Google. Elle permet notamment de mesurer l'importance des nœuds d'un réseau à l'aide de son vecteur propre dominant, le vecteur PageRank. À partir de la matrice de Google, nous pouvons aussi construire une matrice de Google de taille réduite représentant toutes les connexions entre les éléments d'un sous-réseau d'intérêt, le réseau réduit, mais aussi et surtout de pouvoir quantifier les connexions indirectes entre ces nœuds, obtenues par diffusion à travers tout le reste du réseau. Cette matrice de Google réduite permet, en plus de réduire considérablement la taille du réseau et de la matrice de Google associée, d'extraire des liens indirects non-triviaux entre les nœuds d'intérêts, appelés liens cachés. À l'aide d'outils construits à partir de la matrice de Google, notamment la matrice de Google réduite, nous allons, à travers le réseau Wikipédia, identifier les interactions entre les universités et leurs influences sur le monde, et utiliser des données de comportements utilisateurs Wikipédia afin de mesurer les tendances culturelles actuelles. À partir de réseaux économiques, nous allons mesurer la résistance économique de l'Union européenne face à une hausse des prix liés au pétrole et au gaz extérieurs, mais aussi établir les interdépendances entre secteurs de production propres à quelques puissances économiques comme les États-Unis ou encore la Chine. Enfin, nous allons établir un modèle de propagation de crise économique et l'appliquer au réseau du commerce international et au réseau de transactions de Bitcoin
In a current period where people use more and more the Internet and are connected worldwide, our lives become easier. The Network science, a recent scientific domain coming from graph theory, handle such connected complex systems. A network is a mathematical object consisting in a set of interconnected nodes and a set of links connecting them. We find networks in nature such as networks of mycelium which grow underground and are able to feed their cells with organic nutrients located at low and long range from them, as well as the circulation system transporting blood throughout the human body. Networks also exist at a human scale where humans are nodes of such networks. In this thesis we are interested in what we call real complex networks which are networks constructed from databases. We can extract information which is normally hard to get since such a network might contain one million of nodes and one hundred times more links. Moreover, networks we are going to study are directed meaning that links have a direction. One can represent a random walk through a directed network with the use of the so-called Google matrix. The PageRank is the leading eigenvector associated to this stochastic matrix and allows us to measure nodes importance. We can also build a smaller Google matrix based on the Google matrix and a subregion of the network. This reduced Google matrix allows us to extract every existing links between the nodes composing the subregion of interest as well as all possible indirect connections between them by spreading through the entire network. With the use of tools developed from the Google matrix, especially the reduced Google matrix, considering the network of Wikipedia's articles we have identified interactions between universities of the world as well as their influence. We have extracted social trends by using data related to actual Wikipedia's users behaviour. Regarding the World Trade Network, we were able to measure economic response of the European Union to external petroleum and gas price variation. Regarding the World Network of economical activities we have figured out interdependence of sectors of production related to powerhouse such as The United States of America and China. We also built a crisis contagion model we applied on the World Trade Network and on the Bitcoin transactions Network

Style APA, Harvard, Vancouver, ISO itp.

39

Walczak, Nathalie. "La protection des données personnelles sur l’internet.- Analyse des discours et des enjeux sociopolitiques". Thesis, Lyon 2, 2014. http://www.theses.fr/2014LYO20052/document.

Pełny tekst źródła

Streszczenie:

Cette thèse, dans le cadre des Sciences de l'Information et de la Communication, aborde la question de la protection des données personnelles sur l’internet à travers l’étude des discours de quatre acteurs concernés par ce sujet : les entreprises de l’internet, les instances régulatrices, la population française et la presse nationale. L’objectif est de comprendre comment, à travers les discours de chacun de ces acteurs, se dessinent la question du brouillage des sphères privée et publique sur l’internet. C’est une question qui prend de l’ampleur avec le développement de l’internet, notamment avec la multiplication des réseaux socionumériques, qui offrent aux internautes différentes possibilités pour afficher leur extimité. La multiplication des dispositifs de mise en relation interpersonnelle s'accompagne alors d'une nouvelle dialectique contemporaine entre le privé et le public, pas toujours maîtrisée par les personnes concernées.Cette interaction entre le public et le privé induit un déplacement de la frontière qui sépare les deux sphères et peut entraîner certaines dérives de la part des entreprises spécialisées, telles Google ou Facebook, par rapport à l'agrégation des données personnelles des internautes. En effet, les bases de données sont au cœur du système économique de ces entreprises et ont acquis une valeur marchande liée à des enjeux essentiels par rapport à leur fonctionnement. Or, l’utilisation commerciale des ces données n’est pas nécessairement connue par l’utilisateur et peut être réalisée sans son accord, du moins de manière explicite. Ce double questionnement lié au brouillage des sphères privée et publique, c'est-à-dire, premièrement, l’aspect individuel où l’internaute est incité à dévoiler de plus en plus d’éléments personnels, et, deuxièmement, l’aspect lié à la marchandisation des données par les entreprises de l’internet, engendre alors la question de la confidentialité des données et des libertés individuelles. Les instances régulatrices, que ce soit à l’échelle de la France ou de l’Union Européenne, tentent d’apporter des réponses afin de protéger l’internaute en mettant en place des actions concernant le droit à l’oubli ou en poursuivant juridiquement Google, par exemple, lorsque l’entreprise ne se conforme pas aux lois en vigueur sur le territoire concerné.Les différents angles d’approche ainsi que la diversité des acteurs étudiés ont nécessité la constitution d’un corpus multidimentionnel afin d’avoir une approche comparative des différents représentations. Ce corpus comprend à la fois des textes inscrits comme les discours politiques, les discours des instances régulatrices, les discours des entreprises de l’internet, plus spécifiquement Google et Facebook ou les discours de presse qui occupent une position méta-discursive puisqu’ils se font l’écho des discours des acteurs précédemment énoncés. Il comprend aussi des discours oraux constitués d’entretiens spécialement réalisés dans le cadre de cette recherche auprès d’individus pris au hasard de la population française. Une analyse quantitative des discours entre 2010 et 2013, période contemporaine à la thèse, a permis d’effectuer un premier tri et de ne sélectionner que les discours les plus pertinents par rapport à nos hypothèses. L’analyse qualitative qui a suivi a été basée sur le cadre théorique précédemment élaboré afin de croiser les représentations des acteurs à propos des données personnelles et mettre en évidence les différentes visions inhérentes à cette question
This thesis, in Communication and Information Sciences, raises the question of the internet personal data protection through the discourses analysis of four actors concerned with this subject: internet companies, authorities regulating, French population and national press. The objective is to understand how, through the discourses of each one of these actors, the question of the jamming of the spheres private and public about the Internet takes shape. It is a question which increases with the development of the Internet, in particular with the multiplication of the social digital network, which gives to the Internet users various opportunities to display their privacy. The multiplication of the interpersonal relationship devices connection is then accompanied by a contemporary dialectical between private and public spheres, not always controlled by concerned people.This interaction between private and public leads to a transfert of the border wich separates the two spheres and can involves some drifts on behalf of specialized companies, such Google and Facebook, toward the aggregation of personal data contents. Indeed, databases are central in the economic system of these companies and gained a commercial value. However, the commercial use as of these data is not necessarily known by the user and can be realized without its agreement, at least in an explicit way. This double questioning related to the jamming of the private and public spheres, i.e., firstly, the individual aspect where the Internet user is incited to reveal personal elements more and more, and, secondly, the related aspect with the selling of the data by the Internet companies, then generates the question of the individual freedom and data confidentiality. The regulating authorities, in France or in European Union, try to provide answers in order to protect the Internet users by setting up actions relating to the right to be forgotten or by prosecuting Google, for example, when the company does not conform to the laws in force on the territory concerned. The various angles of incidence as well as the diversity of the studied actors required the constitution of a multidimentional corpus in order to have a comparative approach of the different representations. This corpus includes texts registered like political discourses, regulating authorities speeches, companies of the Internet speeches, specifically Google and Facebook, or press speeches which occupy a meta-discursive position since they repeat speeches of the actors previously stated. It includes also oral speeches made up of talks especially recorded for this research with some persons taken randomly in the French population. A quantitative analysis of the discourses between 2010 and 2013, contemporary period with the thesis, permit to carry out a first sorting and to select only the most relevant speeches compared to our hypothesis. The qualitative analysis which followed was based on the theoretical framework previously elaborate in order to cross the representations of the actors in connection with the personal data and to highlight the various visions about this question

Style APA, Harvard, Vancouver, ISO itp.

40

El, Zant Samer. "Google matrix analysis of Wikipedia networks". Thesis, Toulouse, INPT, 2018. http://www.theses.fr/2018INPT0046/document.

Pełny tekst źródła

Streszczenie:

Cette thèse s’intéresse à l’analyse du réseau dirigé extrait de la structure des hyperliens de Wikipédia. Notre objectif est de mesurer les interactions liant un sous-ensemble de pages du réseau Wikipédia. Par conséquent, nous proposons de tirer parti d’une nouvelle représentation matricielle appelée matrice réduite de Google ou "reduced Google Matrix". Cette matrice réduite de Google (GR) est définie pour un sous-ensemble de pages donné (c-à-d un réseau réduit).Comme pour la matrice de Google standard, un composant de GR capture la probabilité que deux noeuds du réseau réduit soient directement connectés dans le réseau complet. Une des particularités de GR est l’existence d’un autre composant qui explique la probabilité d’avoir deux noeuds indirectement connectés à travers tous les chemins possibles du réseau entier. Dans cette thèse, les résultats de notre étude de cas nous montrent que GR offre une représentation fiable des liens directs et indirects (cachés). Nous montrons que l’analyse de GR est complémentaire à l’analyse de "PageRank" et peut être exploitée pour étudier l’influence d’une variation de lien sur le reste de la structure du réseau. Les études de cas sont basées sur des réseaux Wikipédia provenant de différentes éditions linguistiques. Les interactions entre plusieurs groupes d’intérêt ont été étudiées en détail : peintres, pays et groupes terroristes. Pour chaque étude, un réseau réduit a été construit. Les interactions directes et indirectes ont été analysées et confrontées à des faits historiques, géopolitiques ou scientifiques. Une analyse de sensibilité est réalisée afin de comprendre l’influence des liens dans chaque groupe sur d’autres noeuds (ex : les pays dans notre cas). Notre analyse montre qu’il est possible d’extraire des interactions précieuses entre les peintres, les pays et les groupes terroristes. On retrouve par exemple, dans le réseau de peintre sissu de GR, un regroupement des artistes par grand mouvement de l’histoire de la peinture. Les interactions bien connues entre les grands pays de l’UE ou dans le monde entier sont également soulignées/mentionnées dans nos résultats. De même, le réseau de groupes terroristes présente des liens pertinents en ligne avec leur idéologie ou leurs relations historiques ou géopolitiques.Nous concluons cette étude en montrant que l’analyse réduite de la matrice de Google est une nouvelle méthode d’analyse puissante pour les grands réseaux dirigés. Nous affirmons que cette approche pourra aussi bien s’appliquer à des données représentées sous la forme de graphes dynamiques. Cette approche offre de nouvelles possibilités permettant une analyse efficace des interactions d’un groupe de noeuds enfoui dans un grand réseau dirigé
This thesis concentrates on the analysis of the large directed network representation of Wikipedia.Wikipedia stores valuable fine-grained dependencies among articles by linking webpages togetherfor diverse types of interactions. Our focus is to capture fine-grained and realistic interactionsbetween a subset of webpages in this Wikipedia network. Therefore, we propose to leverage anovel Google matrix representation of the network called the reduced Google matrix. This reducedGoogle matrix (GR) is derived for the subset of webpages of interest (i.e. the reduced network). Asfor the regular Google matrix, one component of GR captures the probability of two nodes of thereduced network to be directly connected in the full network. But unique to GR, anothercomponent accounts for the probability of having both nodes indirectly connected through allpossible paths in the full network. In this thesis, we demonstrate with several case studies that GRoffers a reliable and meaningful representation of direct and indirect (hidden) links of the reducednetwork. We show that GR analysis is complementary to the well-known PageRank analysis andcan be leveraged to study the influence of a link variation on the rest of the network structure.Case studies are based on Wikipedia networks originating from different language editions.Interactions between several groups of interest are studied in details: painters, countries andterrorist groups. For each study, a reduced network is built, direct and indirect interactions areanalyzed and confronted to historical, geopolitical or scientific facts. A sensitivity analysis isconducted to understand the influence of the ties in each group on other nodes (e.g. countries inour case). From our analysis, we show that it is possible to extract valuable interactions betweenpainters, countries or terrorist groups. Network of painters with GR capture art historical fact sucha painting movement classification. Well-known interactions of countries between major EUcountries or worldwide are underlined as well in our results. Similarly, networks of terrorist groupsshow relevant ties in line with their objective or their historical or geopolitical relationships. Weconclude this study by showing that the reduced Google matrix analysis is a novel powerfulanalysis method for large directed networks. We argue that this approach can find as well usefulapplication for different types of datasets constituted by the exchange of dynamic content. Thisapproach offers new possibilities to analyze effective interactions in a group of nodes embedded ina large directed network

Style APA, Harvard, Vancouver, ISO itp.

41

Corné, Josefine, i Amanda Ullvin. "Prediktiv analys i vården : Hur kan maskininlärningstekniker användas för att prognostisera vårdflöden?" Thesis, KTH, Skolan för teknik och hälsa (STH), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-211286.

Pełny tekst źródła

Streszczenie:

Projektet genomfördes i samarbete med Siemens Healthineers i syfte att utreda möjligheter till att prognostisera vårdflöden. Det genom att undersöka hur big data tillsammans med maskininlärning kan utnyttjas för prediktiv analys. Projektet utgjordes av två fallstudier med mål att, baserat på data från tidigare MRT-undersökningar, förutspå undersökningstider för kommande undersökningar respektive identifiera patienter som riskerar att missa inbokad undersökning. Fallstudierna utfördes med hjälp av programmeringsspråket R och tre olika inbyggda funktioner för maskininlärning användes för att ta fram prediktiva modeller för respektive fallstudie. Resultaten från fallstudierna gav en indikation på att det med en större datamängd av bättre kvalitet skulle vara möjligt att förutspå undersökningstider och vilka patienter som riskerar att missa sin inbokade undersökning. Det talar för att den här typen av prediktiva analyser kan användas för att prognostisera vårdflöden, något som skulle kunna bidra till ökad effektivitet och kortare väntetider i vården.
This project was performed in cooperation with Siemens Healthineers. The project aimed to investigate possibilities to forecast healthcare processes by investigating how big data and machine learning can be used for predictive analytics. The project consisted of two separate case studies. Based on data from previous MRI examinations the aim was to investigate if it is possible to predict duration of MRI examinations and identify potential no show patients. The case studies were performed with the programming language R and three machine learning methods were used to develop predictive models for each case study. The results from the case studies indicate that with a greater amount of data of better quality it would be possible to predict duration of MRI examinations and potential no show patients. The conclusion is that these types of predictive models can be used to forecast healthcare processes. This could contribute to increased effectivity and reduced waiting time in healthcare.

Style APA, Harvard, Vancouver, ISO itp.

42

Barosen, Alexander, i Sadok Dalin. "Analysis and comparison of interfacing, data generation and workload implementation in BigDataBench 4.0 and Intel HiBench 7.0". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-254332.

Pełny tekst źródła

Streszczenie:

One of the major challenges in Big Data is the accurate and meaningful assessment of system performance. Unlike other systems, minor differences in efficiency can escalate to large differences in costs and power consumption. While there are several tools on the marketplace for measuring the performance of Big Data systems, few of them have been explored in-depth. This report investigated the interfacing, data generation and workload implementations of two Big Data benchmarking suites, BigDataBench and Hibench. The purpose of the study was to establish the capabilities of each tool with regards to interfacing, data generation and workload implementation. An exploratory and qualitative approach was used to gather information and analyze each benchmarking tool. Source code, documentation, and reports published by the developers were used as information sources. The results showed that BigDataBench and HiBench were designed similarly with regards to interfacing and data flow during the execution of a workload with the exception of streaming workloads. BigDataBench provided for more realistic data generation while the data generation for HiBench was easier to control. With regards to workload design, the workloads in BigDataBench were designed to be applicable to multiple frameworks while the workloads in HiBench were focused on the Hadoop family. In conclusion, neither of benchmarking suites was superior to the other. They were both designed for different purposes and should be applied on a case-by-case basis.
En av de stora utmaningarna i Big Data är den exakta och meningsfulla bedömningen av systemprestanda. Till skillnad från andra system kan mindre skillnader i effektivitet eskalera till stora skillnader i kostnader och strömförbrukning. Medan det finns flera verktyg på marknaden för att mäta prestanda för Big Data-system, har få av dem undersökts djupgående. I denna rapport undersöktes gränssnittet, datagenereringen och arbetsbelastningen av två Big Data benchmarking-sviter, BigDataBench och HiBench. Syftet med studien var att fastställa varje verktygs kapacitet med hänsyn till de givna kriterierna. Ett utforskande och kvalitativt tillvägagångssätt användes för att samla information och analysera varje benchmarking verktyg. Källkod, dokumentation och rapporter som hade skrivits och publicerats av utvecklarna användes som informationskällor. Resultaten visade att BigDataBench och HiBench utformades på samma sätt med avseende på gränssnitt och dataflöde under utförandet av en arbetsbelastning med undantag för strömmande arbetsbelastningar. BigDataBench tillhandahöll mer realistisk datagenerering medan datagenerering för HiBench var lättare att styra. När det gäller arbetsbelastningsdesign var arbetsbelastningen i BigDataBench utformad för att kunna tillämpas på flera ramar, medan arbetsbelastningen i HiBench var inriktad på Hadoop-familjen. Sammanfattningsvis var ingen av benchmarkingssuperna överlägsen den andra. De var båda utformade för olika ändamål och bör tillämpas från fall till fall.

Style APA, Harvard, Vancouver, ISO itp.

43

Inacio, Eduardo Camilo. "Caracterização e modelagem multivariada do desempenho de sistemas de arquivos paralelos". reponame:Repositório Institucional da UFSC, 2015. https://repositorio.ufsc.br/xmlui/handle/123456789/132478.

Pełny tekst źródła

Streszczenie:

Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015.
Made available in DSpace on 2015-04-29T21:10:29Z (GMT). No. of bitstreams: 1 332968.pdf: 1630035 bytes, checksum: ab750b282530f4ce742e30736aa9d74d (MD5) Previous issue date: 2015
A quantidade de dados digitais gerados diariamente vem aumentando de forma significativa. Por consequência, as aplicações precisam manipular volumes de dados cada vez maiores, dos mais variados formatos e origens, em alta velocidade, sendo essa problemática denominada como Big Data. Uma vez que os dispositivos de armazenamento não acompanharam a evolução de desempenho observada em processadores e memórias principais, esses acabam se tornando os gargalos dessas aplicações. Sistemas de arquivos paralelos são soluções de software que vêm sendo amplamente adotados para mitigar as limitações de entrada e saída (E/S) encontradas nas plataformas computacionais atuais. Contudo, a utilização eficiente dessas soluções de armazenamento depende da compreensão do seu comportamento diante de diferentes condições de uso. Essa é uma tarefa particularmente desafiadora, em função do caráter multivariado do problema, ou seja, do fato de o desempenho geral do sistema depender do relacionamento e da influência de um grande conjunto de variáveis. Nesta dissertação se propõe um modelo analítico multivariado para representar o comportamento do desempenho do armazenamento em sistemas de arquivos paralelos para diferentes configurações e cargas de trabalho. Um extenso conjunto de experimentos, executados em quatro ambientes computacionais reais, foi realizado com o intuito de identificar um número significativo de variáveis relevantes, caracterizar a influência dessas variáveis no desempenho geral do sistema e construir e avaliar o modelo proposto.Como resultado do esforço de caracterização, o efeito de três fatores, não explorados em trabalhos anteriores, é apresentado. Os resultados da avaliação realizada, comparando o comportamento e valores estimados pelo modelo com o comportamento e valores medidos nos ambientes reais para diferentes cenários de uso, demonstraram que o modelo proposto obteve sucesso na representação do desempenho do sistema. Apesar de alguns desvios terem sido encontrados nos valores estimados pelo modelo, considerando o número significativamente maior de cenários de uso avaliados nessa pesquisa em comparação com propostas anteriores encontradas na literatura, a acurácia das predições foi considerada aceitável.

Abstract : The amount of digital data generated dialy has increased significantly.Consequently, applications need to handle increasing volumes of data, in a variety of formats and sources, with high velocity, namely Big Data problem. Since storage devices did not follow the performance evolution observed in processors and main memories, they become the bottleneck of these applications. Parallel file systems are software solutions that have been widely adopted to mitigate input and output (I/O) limitations found in current computing platforms. However, the efficient utilization of these storage solutions depends on the understanding of their behavior in different conditions of use. This is a particularly challenging task, because of the multivariate nature of the problem, namely the fact that the overall performance of the system depends on the relationship and the influence of a large set of variables. This dissertation proposes an analytical multivariate model to represent storage performance behavior in parallel file systems for different configurations and workloads. An extensive set of experiments, executed in four real computing environments, was conducted in order to identify a significant number of relevant variables, to determine the influence of these variables on overall system performance, and to build and evaluate the proposed model. As a result of the characterization effort, the effect of three factors, not explored in previous works, is presented. Results of the model evaluation, comparing the behavior and values estimated by the model with behavior and values measured in real environments for different usage scenarios, showed that the proposed model was successful in system performance representation. Although some deviations were found in the values estimated by the model, considering the significantly higher number of usage scenarios evaluated in this research work compared to previous proposals found in the literature, the accuracy of prediction was considered acceptable.

Style APA, Harvard, Vancouver, ISO itp.

44

Britto, Fernando Perez de. "Perspectivas organizacional e tecnológica da aplicação de analytics nas organizações". Pontifícia Universidade Católica de São Paulo, 2016. https://tede2.pucsp.br/handle/handle/19282.

Pełny tekst źródła

Streszczenie:

Submitted by Filipe dos Santos (fsantos@pucsp.br) on 2016-11-01T17:05:22Z No. of bitstreams: 1 Fernando Perez de Britto.pdf: 2289185 bytes, checksum: c32224fdc1bfd0e47372fe52c8927cff (MD5)
Made available in DSpace on 2016-11-01T17:05:22Z (GMT). No. of bitstreams: 1 Fernando Perez de Britto.pdf: 2289185 bytes, checksum: c32224fdc1bfd0e47372fe52c8927cff (MD5) Previous issue date: 2016-09-12
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
The use of Analytics technologies is gaining prominence in organizations exposed to pressures for greater profitability and efficiency, and to a highly globalized and competitive environment in which cycles of economic growth and recession and cycles of liberalism and interventionism, short or long, are more frequents. However, the use of these technologies is complex and influenced by conceptual, human, organizational and technologicalaspects, the latter especially in relation to the manipulation and analysis of large volumes of data, Big Data. From a bibliographicresearch on the organizational and technological perspectives, this work initially deals with theconcepts and technologies relevant to the use of Analytics in organizations, and then explores issues related to the alignment between business processes and data and information, the assessment of the potential of theuseofAnalytics, the use of Analytics in performance management, in process optimization and as decision support, and the establishment of a continuousimprovement process. Enabling at the enda reflection on the directions, approaches, referrals, opportunities and challenges related to the use of Analytics in organizations
A utilização de tecnologias de Analyticsvem ganhando destaque nas organizações expostas a pressões por maior rentabilidade e eficiência, ea um ambiente altamente globalizado e competitivo no qual ciclos de crescimento econômico e recessão e ciclos de liberalismo e intervencionismo, curtos ou longos, estão mais frequentes. Entretanto, a utilização destas tecnologias é complexa e influenciada por aspectos conceituais, humanos, organizacionais e tecnológicos, este último principalmente com relação à manipulação e análise de grandes volumes de dados, Big Data. A partir de uma pesquisa bibliográfica sobre as perspectivas organizacional e tecnológica, este trabalho trata inicialmente de conceitos e tecnologias relevantes para a utilização de Analyticsnas organizações, eem seguida explora questões relacionadas ao alinhamento entre processos organizacionaise dados e informações, à avaliação de potencial de utilização de Analytics, à utilização de Analyticsem gestão de performance, otimização de processos e como suporte à decisão, e ao estabelecimento de um processo de melhoria contínua.Possibilitandoao finaluma reflexão sobre os direcionamentos, as abordagens, os encaminhamentos, as oportunidades e os desafios relacionados àutilização de Analyticsnas organizações

Style APA, Harvard, Vancouver, ISO itp.

45

Ledieu, Thibault. "Analyse et visualisation de trajectoires de soins par l’exploitation de données massives hospitalières pour la pharmacovigilance". Thesis, Rennes 1, 2018. http://www.theses.fr/2018REN1B032/document.

Pełny tekst źródła

Streszczenie:

Le phénomène de massification des données de santé constitue une opportunité de répondre aux questions des vigilances et de qualité des soins. Dans les travaux effectués au cours de cette thèse, nous présenterons des approches permettant d’exploiter la richesse et le volume des données intra hospitalières pour des cas d’usage de pharmacovigilance et de surveillance de bon usage du médicament. Cette approche reposera sur la modélisation de trajectoires de soins intra hospitalières adaptées aux besoins spécifiques de la pharmacovigilance. Il s’agira, à partir des données d’un entrepôt hospitalier de caractériser les événements d’intérêt et d’identifier un lien entre l’administration de ces produits de santé et l’apparition des effets indésirables, ou encore de rechercher les cas de mésusage du médicament. L’hypothèse posée dans cette thèse est qu’une approche visuelle interactive serait adaptée pour l’exploitation de ces données biomédicales hétérogènes et multi-domaines dans le champ de la pharmacovigilance. Nous avons développé deux prototypes permettant la visualisation et l’analyse des trajectoires de soins. Le premier prototype est un outil de visualisation du dossier patient sous forme de frise chronologique. La deuxième application est un outil de visualisation et fouille d’une cohorte de séquences d’événements. Ce dernier outil repose sur la mise en œuvre d’algorithme d’analyse de séquences (Smith-Waterman, Apriori, GSP) pour la recherche de similarité ou de motifs d’événements récurrents. Ces interfaces homme-machine ont fait l’objet d’études d’utilisabilité sur des cas d’usage tirées de la pratique réelle qui ont prouvé leur potentiel pour un usage en routine
The massification of health data is an opportunity to answer questions about vigilance and quality of care. The emergence of big data in health is an opportunity to answer questions about vigilance and quality of care. In this thesis work, we will present approaches to exploit the diversity and volume of intra-hospital data for pharmacovigilance use and monitoring the proper use of drugs. This approach will be based on the modelling of intra-hospital care trajectories adapted to the specific needs of pharmacovigilance. Using data from a hospital warehouse, it will be necessary to characterize events of interest and identify a link between the administration of these health products and the occurrence of adverse reactions, or to look for cases of misuse of the drug. The hypothesis put forward in this thesis is that an interactive visual approach would be suitable for the exploitation of these heterogeneous and multi-domain biomedical data in the field of pharmacovigilance. We have developed two prototypes allowing the visualization and analysis of care trajectories. The first prototype is a tool for visualizing the patient file in the form of a timeline. The second application is a tool for visualizing and searching a cohort of event sequences The latter tool is based on the implementation of sequence analysis algorithms (Smith-Waterman, Apriori, GSP) for the search for similarity or patterns of recurring events. These human-machine interfaces have been the subject of usability studies on use cases from actual practice that have proven their potential for routine use

Style APA, Harvard, Vancouver, ISO itp.

46

Huguet, Thibault. "La société connectée : contribution aux analyses sociologiques des liens entre technique et société à travers l'exemple des outils médiatiques numériques". Thesis, Montpellier 3, 2017. http://www.theses.fr/2017MON30002/document.

Pełny tekst źródła

Streszczenie:

Initié depuis plusieurs décennies, le développement des techniques numériques marque de son empreinte profonde les esprits et les corps de nos sociétés contemporaines. Plus qu'un simple fait de société, il semble admis que nous assistons aujourd'hui à une véritable « mutation anthropologique ». Cependant, alors que les analyses des liens entre technique et société ont longtemps été marquées par des perspectives déterministes, nous proposons d'explorer dans cette thèse les relations dynamiques étroites qui font qu'une technique est éminemment sociale, et qu'une société est intrinsèquement technique. En adoptant un regard résolument compréhensif, cette recherche entend mettre en évidence les significations et les systèmes de sens qui entourent l'utilisation des outils médiatiques numériques, à une échelle macro-sociale et micro-sociale, pour expliquer causalement la place que nous accordons à cette catégorie spécifique d'objet. Les dynamiques à l’œuvre, tant à un niveau individuel que collectif, sont examinées de manière socio-logique, tour à tour dans une perspective historique, philosophique, économique, politique, sociale, et culturelle. En tant qu'artefacts-symboles de nos sociétés actuelles – objets sociaux totaux –, les médias numériques sont les outils techniques à partir desquels nous organisons la contemporanéité de notre rapport au monde : nous les concevons donc comme un prisme sociologique à partir desquels il est possible d'appréhender la société connectée
Initiated for several decades, the development of the digital technology mark by its deep stamp the minds and the body of our contemporary society. More than a simple social phenomenon, it seems to be generaly agreed that we assist today at a true « anthropological mutation ». Nevertheless, while the analyses of the links between technology and society have been characterized for a long time by some deterministic prospects, we propose to explore in this thesis the dynamic relations which make that a technic is eminently social, and that a society is intrinsically technic. Adhering to a comprehensive approach, this research seeks to highlight the significations and the meaning systems related to the use of digital media tools, at a macro-social and a micro-social scale, to explain causally the importance we ascribed to this specific category of objects. The dynamics at work, both at an individual or collective level, are examinated in a socio-logical way, alternately with an historical, philosophical, economical, political, or socio-cultural point of view. As artefacts-symbols of our present day societies – total social object –, the digital media are the tools upon which we organize the contemporaneity of our relationship with the world : we regard them as a sociological prism from which it possible to grasp the connected society

Style APA, Harvard, Vancouver, ISO itp.

47

Loose, Tobias Sebastian. "Konzept für eine modellgestützte Diagnostik mittels Data Mining am Beispiel der Bewegungsanalyse". Karlsruhe : Univ.-Verl, 2004. http://deposit.d-nb.de/cgi-bin/dokserv?idn=973140607.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

48

Maraun, Douglas. "What can we learn from climate data? : Methods for fluctuation, time/scale and phase analysis". Phd thesis, [S.l.] : [s.n.], 2006. http://deposit.ddb.de/cgi-bin/dokserv?idn=981698980.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

49

Schwarz, Holger. "Integration von Data-Mining und online analytical processing : eine Analyse von Datenschemata, Systemarchitekturen und Optimierungsstrategien /". [S.l. : s.n.], 2003. http://www.bsz-bw.de/cgi-bin/xvms.cgi?SWB10720634.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

50

Wong, Shing-tat, i 黃承達. "Disaggregate analyses of stated preference data for capturing parking choice behavior". Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2006. http://hub.hku.hk/bib/B36393678.

Pełny tekst źródła

Style APA, Harvard, Vancouver, ISO itp.

Oferujemy zniżki na wszystkie plany premium dla autorów, których prace zostały uwzględnione w tematycznych zestawieniach literatury. Skontaktuj się z nami, aby uzyskać unikalny kod promocyjny!