Dissertations / Theses: 'Privacy preserving machine learning'

1

Bozdemir, Beyza. "Privacy-preserving machine learning techniques." Electronic Thesis or Diss., Sorbonne université, 2021. http://www.theses.fr/2021SORUS323.

Full text

Abstract:

L'apprentissage automatique en tant que service (MLaaS) fait référence à un service qui permet aux entreprises de déléguer leurs tâches d'apprentissage automatique à un ou plusieurs serveurs puissants, à savoir des serveurs cloud. Néanmoins, les entreprises sont confrontées à des défis importants pour garantir la confidentialité des données et le respect des réglementations en matière de protection des données. L'exécution de tâches d'apprentissage automatique sur des données sensibles nécessite la conception de nouveaux protocoles garantissant la confidentialité des données pour les techniques d'apprentissage automatique.Dans cette thèse, nous visons à concevoir de tels protocoles pour MLaaS et étudions trois techniques d'apprentissage automatique : les réseaux de neurones, le partitionnement de trajectoires et l'agrégation de données. Dans nos solutions, notre objectif est de garantir la confidentialité des données tout en fournissant un niveau acceptable de performance et d’utilité. Afin de préserver la confidentialité des données, nous utilisons plusieurs techniques cryptographiques avancées : le calcul bipartite sécurisé, le chiffrement homomorphe, le rechiffrement proxy homomorphe ainsi que le chiffrement à seuil et le chiffrement à clé multiples. Nous avons en outre implémenté ces nouveaux protocoles et étudié le compromis entre confidentialité, performance et utilité/qualité pour chacun d’entre eux
Machine Learning as a Service (MLaaS) refers to a service that enables companies to delegate their machine learning tasks to single or multiple untrusted but powerful third parties, namely cloud servers. Thanks to MLaaS, the need for computational resources and domain expertise required to execute machine learning techniques is significantly reduced. Nevertheless, companies face increasing challenges with ensuring data privacy guarantees and compliance with the data protection regulations. Executing machine learning tasks over sensitive data requires the design of privacy-preserving protocols for machine learning techniques.In this thesis, we aim to design such protocols for MLaaS and study three machine learning techniques: Neural network classification, trajectory clustering, and data aggregation under privacy protection. In our solutions, our goal is to guarantee data privacy while keeping an acceptable level of performance and accuracy/quality evaluation when executing the privacy-preserving variants of these machine learning techniques. In order to ensure data privacy, we employ several advanced cryptographic techniques: Secure two-party computation, homomorphic encryption, homomorphic proxy re-encryption, multi-key homomorphic encryption, and threshold homomorphic encryption. We have implemented our privacy-preserving protocols and studied the trade-off between privacy, efficiency, and accuracy/quality evaluation for each of them

APA, Harvard, Vancouver, ISO, and other styles

2

Hesamifard, Ehsan. "Privacy Preserving Machine Learning as a Service." Thesis, University of North Texas, 2020. https://digital.library.unt.edu/ark:/67531/metadc1703277/.

Full text

Abstract:

Machine learning algorithms based on neural networks have achieved remarkable results and are being extensively used in different domains. However, the machine learning algorithms requires access to raw data which is often privacy sensitive. To address this issue, we develop new techniques to provide solutions for running deep neural networks over encrypted data. In this paper, we develop new techniques to adopt deep neural networks within the practical limitation of current homomorphic encryption schemes. We focus on training and classification of the well-known neural networks and convolutional neural networks. First, we design methods for approximation of the activation functions commonly used in CNNs (i.e. ReLU, Sigmoid, and Tanh) with low degree polynomials which is essential for efficient homomorphic encryption schemes. Then, we train neural networks with the approximation polynomials instead of original activation functions and analyze the performance of the models. Finally, we implement neural networks and convolutional neural networks over encrypted data and measure performance of the models.

APA, Harvard, Vancouver, ISO, and other styles

3

Grivet, Sébert Arnaud. "Combining differential privacy and homomorphic encryption for privacy-preserving collaborative machine learning." Electronic Thesis or Diss., université Paris-Saclay, 2023. http://www.theses.fr/2023UPASG037.

Full text

Abstract:

L'objet de cette thèse est la conception de protocoles pour l'entraînement de modèles d'apprentissage automatique avec protection des données d'entraînement. Pour ce faire, nous nous sommes concentrés sur deux outils de confidentialité, la confidentialité différentielle et le chiffrement homomorphe. Alors que la confidentialité différentielle permet de fournir un modèle fonctionnel protégé des attaques sur la confidentialité par les utilisateurs finaux, le chiffrement homomorphe permet d'utiliser un serveur comme intermédiaire totalement aveugle entre les propriétaires des données, qui fournit des ressources de calcul sans aucun accès aux informations en clair. Cependant, ces deux techniques sont de nature totalement différente et impliquent toutes deux leurs propres contraintes qui peuvent interférer : la confidentialité différentielle nécessite généralement l'utilisation d'un bruit continu et non borné, tandis que le chiffrement homomorphe ne peut traiter que des nombres encodés avec un nombre limité de bits. Les travaux présentés visent à faire fonctionner ensemble ces deux outils de confidentialité en gérant leurs interférences et même en les exploitant afin que les deux techniques puissent bénéficier l'une de l'autre.Dans notre premier travail, SPEED, nous étendons le modèle de menace du protocole PATE (Private Aggregation of Teacher Ensembles) au cas d'un serveur honnête mais curieux en protégeant les calculs du serveur par une couche homomorphe. Nous définissons soigneusement quelles opérations sont effectuées homomorphiquement pour faire le moins de calculs possible dans le domaine chiffré très coûteux tout en révélant suffisamment peu d'informations en clair pour être facilement protégé par la confidentialité différentielle. Ce compromis nous contraint à réaliser une opération argmax dans le domaine chiffré, qui, même si elle est raisonnable, reste coûteuse. C'est pourquoi nous proposons SHIELD dans une autre contribution, un opérateur argmax volontairement imprécis, à la fois pour satisfaire la confidentialité différentielle et alléger le calcul homomorphe. La dernière contribution présentée combine la confidentialité différentielle et le chiffrement homomorphe pour sécuriser un protocole d'apprentissage fédéré. Le principal défi de cette combinaison provient de la discrétisation nécessaire du bruit induit par le chiffrement, qui complique l'analyse des garanties de confidentialité différentielle et justifie la conception et l'utilisation d'un nouvel opérateur de quantification qui commute avec l'agrégation
The purpose of this PhD is to design protocols to collaboratively train machine learning models while keeping the training data private. To do so, we focused on two privacy tools, namely differential privacy and homomorphic encryption. While differential privacy enables to deliver a functional model immune to attacks on the training data privacy by end-users, homomorphic encryption allows to make use of a server as a totally blind intermediary between the data owners, that provides computational resource without any access to clear information. Yet, these two techniques are of totally different natures and both entail their own constraints that may interfere: differential privacy generally requires the use of continuous and unbounded noise whereas homomorphic encryption can only deal with numbers encoded with a quite limited number of bits. The presented contributions make these two privacy tools work together by coping with their interferences and even leveraging them so that the two techniques may benefit from each other.In our first work, SPEED, we built on Private Aggregation of Teacher Ensembles (PATE) framework and extend the threat model to deal with an honest but curious server by covering the server computations with a homomorphic layer. We carefully define which operations are realised homomorphically to make as less computation as possible in the costly encrypted domain while revealing little enough information in clear to be easily protected by differential privacy. This trade-off forced us to realise an argmax operation in the encrypted domain, which, even if reasonable, remained expensive. That is why we propose SHIELD in another contribution, an argmax operator made inaccurate on purpose, both to satisfy differential privacy and lighten the homomorphic computation. The last presented contribution combines differential privacy and homomorphic encryption to secure a federated learning protocol. The main challenge of this combination comes from the necessary quantisation of the noise induced by encryption, that complicates the differential privacy analysis and justifies the design and use of a novel quantisation operator that commutes with the aggregation

APA, Harvard, Vancouver, ISO, and other styles

4

Cyphers, Bennett James. "A system for privacy-preserving machine learning on personal data." Thesis, Massachusetts Institute of Technology, 2017. http://hdl.handle.net/1721.1/119518.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2017.
This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.
Cataloged from student-submitted PDF version of thesis.
Includes bibliographical references (pages 81-85).
This thesis describes the design and implementation of a system which allows users to generate machine learning models with their own data while preserving privacy. We approach the problem in two steps. First, we present a framework with which a user can collate personal data from a variety of sources in order to generate machine learning models for problems of the user's choosing. Second, we describe AnonML, a system which allows a group of users to share data privately in order to build models for classification. We analyze AnonML under differential privacy and test its performance on real-world datasets. In tandem, these two systems will help democratize machine learning, allowing people to make the most of their own data without relying on trusted third parties.
by Bennett James Cyphers.
M. Eng.

APA, Harvard, Vancouver, ISO, and other styles

5

Esperança, Pedro M. "Privacy-preserving statistical and machine learning methods under fully homomorphic encryption." Thesis, University of Oxford, 2016. https://ora.ox.ac.uk/objects/uuid:a081311c-b25c-462e-a66b-1e4ac4de5fc2.

Full text

Abstract:

Advances in technology have now made it possible to monitor heart rate, body temperature and sleep patterns; continuously track movement; record brain activity; and sequence DNA in the jungle --- all using devices that fit in the palm of a hand. These and other recent developments have sparked interest in privacy-preserving methods: computational approaches which are able to utilise the data without leaking subjects' personal information. Classical encryption techniques have been used very successfully to protect data in transit and in storage. However, the process of encrypting data also renders it unusable in computation. Recently developed fully homomorphic encryption (FHE) techniques improve on this substantially. Unlike classical methods, which require the data to be decrypted prior to computation, homomorphic methods allow data to be simultaneously stored or transfered securely, and used in computation. However, FHE imposes serious constraints on computation, both arithmetic (e.g., no divisions can be performed) and computational (e.g., multiplications become much slower), rendering traditional statistical algorithms inadequate. In this thesis we develop statistical and machine learning methods for outsourced, privacy-preserving analysis of sensitive information under FHE. Specifically, we tackle two problems: (i) classification, using a semiparametric approach based on the naive Bayes assumption and modeling the class decision boundary directly using an approximation to univariate logistic regression; (ii) regression, using two approaches; an accelerated method for least squares estimation based on gradient descent, and a cooperative framework for Bayesian regression based on recursive Bayesian updating in a multi-party setting. Taking into account the constraints imposed by FHE, we analyse the potential of different algorithmic approaches to provide tractable solutions to these problems and give details on several computational costs and performance trade-offs.

APA, Harvard, Vancouver, ISO, and other styles

6

Zhang, Kevin M. Eng Massachusetts Institute of Technology. "Tiresias : a peer-to-peer platform for privacy preserving machine learning." Thesis, Massachusetts Institute of Technology, 2020. https://hdl.handle.net/1721.1/129840.

Full text

Abstract:

Thesis: M. Eng., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, February, 2020
Cataloged from student-submitted PDF of thesis.
Includes bibliographical references (pages 81-84).
Big technology firms have a monopoly over user data. To remediate this, we propose a data science platform which allows users to collect their personal data and offer computations on them in a differentially private manner. This platform provides a mechanism for contributors to offer computations on their data in a privacy-preserving way and for requesters -- i.e. anyone who can benefit from applying machine learning to the users' data -- to request computations on user data they would otherwise not be able to collect. Through carefully designed differential privacy mechanisms, we can create a platform which gives people control over their data and enables new types of applications.
by Kevin Zhang.
M. Eng.
M.Eng. Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science

APA, Harvard, Vancouver, ISO, and other styles

7

Langelaar, Johannes, and Mattsson Adam Strömme. "Federated Neural Collaborative Filtering for privacy-preserving recommender systems." Thesis, Uppsala universitet, Avdelningen för systemteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-446913.

Full text

Abstract:

In this thesis a number of models for recommender systems are explored, all using collaborative filtering to produce their recommendations. Extra focus is put on two models: Matrix Factorization, which is a linear model and Multi-Layer Perceptron, which is a non-linear model. With an additional purpose of training the models without collecting any sensitive data from the users, both models were implemented with a learning technique that does not require the server's knowledge of the users' data, called federated learning. The federated version of Matrix Factorization is already well-researched, and has proven not to protect the users' data at all; the data is derivable from the information that the users communicate to the server that is necessary for the learning of the model. However, on the federated Multi-Layer Perceptron model, no research could be found. In this thesis, such a model is therefore designed and presented. Arguments are put forth in support of the privacy preservability of the model, along with a proof of the user data not being analytically derivable for the central server. In addition, new ways to further put the protection of the users' data on the test are discussed. All models are evaluated on two different data sets. The first data set contains data on ratings of movies and is called MovieLens 1M. The second is a data set that consists of anonymized fund transactions, provided by the Swedish bank SEB for this thesis. Test results suggest that the federated versions of the models can achieve similar recommendation performance as their non-federated counterparts.

APA, Harvard, Vancouver, ISO, and other styles

8

Dou, Yanzhi. "Toward Privacy-Preserving and Secure Dynamic Spectrum Access." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/81882.

Full text

Abstract:

Dynamic spectrum access (DSA) technique has been widely accepted as a crucial solution to mitigate the potential spectrum scarcity problem. Spectrum sharing between the government incumbents and commercial wireless broadband operators/users is one of the key forms of DSA. Two categories of spectrum management methods for shared use between incumbent users (IUs) and secondary users (SUs) have been proposed, i.e., the server-driven method and the sensing-based method. The server-driven method employs a central server to allocate spectrum resources while considering incumbent protection. The central server has access to the detailed IU operating information, and based on some accurate radio propagation model, it is able to allocate spectrum following a particular access enforcement method. Two types of access enforcement methods -- exclusion zone and protection zone -- have been adopted for server-driven DSA systems in the current literature. The sensing-based method is based on recent advances in cognitive radio (CR) technology. A CR can dynamically identify white spaces through various incumbent detection techniques and reconfigure its radio parameters in response to changes of spectrum availability. The focus of this dissertation is to address critical privacy and security issues in the existing DSA systems that may severely hinder the progress of DSA's deployment in the real world. Firstly, we identify serious threats to users' privacy in existing server-driven DSA designs and propose a privacy-preserving design named P2-SAS to address the issue. P2-SAS realizes the complex spectrum allocation process of protection-zone-based DSA in a privacy-preserving way through Homomorphic Encryption (HE), so that none of the IU or SU operation data would be exposed to any snooping party, including the central server itself. Secondly, we develop a privacy-preserving design named IP-SAS for the exclusion-zone- based server-driven DSA system. We extend the basic design that only considers semi- honest adversaries to include malicious adversaries in order to defend the more practical and complex attack scenarios that can happen in the real world. Thirdly, we redesign our privacy-preserving SAS systems entirely to remove the somewhat- trusted third party (TTP) named Key Distributor, which in essence provides a weak proxy re-encryption online service in P2-SAS and IP-SAS. Instead, in this new system, RE-SAS, we leverage a new crypto system that supports both a strong proxy re-encryption notion and MPC to realize privacy-preserving spectrum allocation. The advantages of RE-SAS are that it can prevent single point of vulnerability due to TTP and also increase SAS's service performance dramatically. Finally, we identify the potentially crucial threat of compromised CR devices to the ambient wireless infrastructures and propose a scalable and accurate zero-day malware detection system called GuardCR to enhance CR network security at the device level. GuardCR leverages a host-based anomaly detection technique driven by machine learning, which makes it autonomous in malicious behavior recognition. We boost the performance of GuardCR in terms of accuracy and efficiency by integrating proper domain knowledge of CR software.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

9

García, Recuero Álvaro. "Discouraging abusive behavior in privacy-preserving decentralized online social networks." Thesis, Rennes 1, 2017. http://www.theses.fr/2017REN1S010/document.

Full text

Abstract:

Le principal objectif de cette thèse est d'évaluer les protocoles qui prennent en considération la protection de la vie privée et qui nécessitent seulement des métadonnées locales pour détecter les comportements malveillants sur les réseaux sociaux décentralisés. En appliquant des techniques d'analyse de réseaux sociaux qui réduisent la quantité de métadonnées sensibles, nous obtenons des résultats acceptables comparé aux techniques qui ne préservent pas la vie privée. De plus, nous prévoyons d'élaborer une série de recommandations pour construire de futurs réseaux sociaux décentralisés qui découragent cette type des comportements abusifs
The main goal of this thesis is to evaluate privacy-preserving protocols to detect abuse in future decentralised online social platforms or microblogging services, where often limited amount of metadata is available to perform data analytics. Taking into account such data minimization, we obtain acceptable results compared to techniques of machine learning that use all metadata available. We draw a series of conclusion and recommendations that will aid in the design and development of a privacy-preserving decentralised social network that discourages abusive behavior

APA, Harvard, Vancouver, ISO, and other styles

10

Ligier, Damien. "Functional encryption applied to privacy-preserving classification : practical use, performances and security." Thesis, Ecole nationale supérieure Mines-Télécom Atlantique Bretagne Pays de la Loire, 2018. http://www.theses.fr/2018IMTA0040/document.

Full text

Abstract:

L'apprentissage automatique (en anglais machine learning) ou apprentissage statistique, a prouvé être un ensemble de techniques très puissantes. La classification automatique en particulier, permettant d'identifier efficacement des informations contenues dans des gros ensembles de données. Cependant, cela lève le souci de la confidentialité des données. C'est pour cela que le besoin de créer des algorithmes d'apprentissage automatique capable de garantir la confidentialité a été mis en avant. Cette thèse propose une façon de combiner certains systèmes cryptographiques avec des algorithmes de classification afin d'obtenir un classifieur que veille à la confidentialité. Les systèmes cryptographiques en question sont la famille des chiffrements fonctionnels. Il s'agit d'une généralisation de la cryptographie à clef publique traditionnelle dans laquelle les clefs de déchiffrement sont associées à des fonctions. Nous avons mené des expérimentations sur cette construction avec un scénario réaliste se servant de la base de données du MNIST composée d'images de digits écrits à la main. Notre système est capable dans ce cas d'utilisation de savoir quel digit est écrit sur une image en ayant seulement un chiffre de l'image. Nous avons aussi étudié la sécurité de cette construction dans un contexte réaliste. Ceci a révélé des risques quant à l'utilisation des chiffrements fonctionnels en général et pas seulement dans notre cas d'utilisation. Nous avons ensuite proposé une méthode pour négocier (dans notre construction) entre les performances de classification et les risques encourus
Machine Learning (ML) algorithms have proven themselves very powerful. Especially classification, enabling to efficiently identify information in large datasets. However, it raises concerns about the privacy of this data. Therefore, it brought to the forefront the challenge of designing machine learning algorithms able to preserve confidentiality.This thesis proposes a way to combine some cryptographic systems with classification algorithms to achieve privacy preserving classifier. The cryptographic system family in question is the functional encryption one. It is a generalization of the traditional public key encryption in which decryption keys are associated with a function. We did some experimentations on that combination on realistic scenario using the MNIST dataset of handwritten digit images. Our system is able in this use case to know which digit is written in an encrypted digit image. We also study its security in this real life scenario. It raises concerns about uses of functional encryption schemes in general and not just in our use case. We then introduce a way to balance in our construction efficiency of the classification and the risks

APA, Harvard, Vancouver, ISO, and other styles

11

Sarmadi, Soheil. "On the Feasibility of Profiling, Forecasting and Authenticating Internet Usage Based on Privacy Preserving NetFlow Logs." Scholar Commons, 2018. https://scholarcommons.usf.edu/etd/7568.

Full text

Abstract:

Understanding Internet user behavior and Internet usage patterns is fundamental in developing future access networks and services that meet technical as well as Internet user needs. User behavior is routinely studied and measured, but with different methods depending on the research discipline of the investigator, and these disciplines rarely cross. We tackle this challenge by developing frameworks that the Internet usage statistics used as the main features in understanding Internet user behaviors, with the purpose of finding a complete picture of the user behavior and working towards a unified analysis methodology. In this dissertation we collected Internet usage statistics via privacy-preserving NetFlow logs of 66 student subjects in a college campus was recorded for a month long period. Once the data is cleaned and split into different groups based on different time windows, we have used Statistical Analysis and we found that Internet usage of each user exhibits statistically-strong correlation with the same user's Internet usage for the same day over multiple weeks while it is statistically different from that of other Internet users. In another attempt we have used Time Series Forecasting in order to forecast future Internet usage based on the previous statistics. Subsequently, using state-of-the-art Machine Learning algorithms, we demonstrate the feasibility of profiling Internet users by looking at their Internet traffic. Specifically, when profiled over a time window of 227-second, subjects can be classified by 93.21% precision accuracy. We conclude that understanding Internet usage behavior is valuable and can help in developing future access networks and services.

APA, Harvard, Vancouver, ISO, and other styles

12

Chatalic, Antoine. "Efficient and privacy-preserving compressive learning." Thesis, Rennes 1, 2020. http://www.theses.fr/2020REN1S030.

Full text

Abstract:

Ce travail de thèse, qui se situe à l'interface entre traitement du signal, informatique et statistiques, vise à l'élaboration de méthodes d'apprentissage automatique à grande échelle et de garanties théoriques associées. Il s'intéresse en particulier à l'apprentissage compressif, un paradigme dans lequel le jeu de données est compressé en un unique vecteur de moments généralisés aléatoires, appelé le sketch et contenant l'information nécessaire pour résoudre de manière approchée la tâche d'apprentissage considérée. Le schéma de compression utilisé permet de tirer profit d'une architecture distribuée ou de traiter des données en flux, et a déjà été utilisé avec succès sur plusieurs tâches d'apprentissage non-supervisé : partitionnement type k-moyennes, modélisation de densité avec modèle de mélange gaussien, analyse en composantes principales. Les contributions de la thèse s'intègrent dans ce cadre de plusieurs manières. D'une part, il est montré qu'en bruitant le sketch, des garanties de confidentialité (différentielle) peuvent être obtenues; des bornes exactes sur le niveau de bruit requis sont données, et une comparaison expérimentale permet d'établir que l'approche proposée est compétitive vis-à-vis d'autres méthodes récentes. Ensuite, le schéma de compression est adapté pour utiliser des matrices aléatoires structurées, qui permettent de réduire significativement les coûts de calcul et rendent possible l'utilisation de méthodes compressives sur des données de grande dimension. Enfin, un nouvel algorithme basé sur la propagation de convictions est proposé pour résoudre la phase d'apprentissage (à partir du sketch) pour le problème de partitionnement type k-moyennes
The topic of this Ph.D. thesis lies on the borderline between signal processing, statistics and computer science. It mainly focuses on compressive learning, a paradigm for large-scale machine learning in which the whole dataset is compressed down to a single vector of randomized generalized moments, called the sketch. An approximate solution of the learning task at hand is then estimated from this sketch, without using the initial data. This framework is by nature suited for learning from distributed collections or data streams, and has already been instantiated with success on several unsupervised learning tasks such as k-means clustering, density fitting using Gaussian mixture models, or principal component analysis. We improve this framework in multiple directions. First, it is shown that perturbing the sketch with additive noise is sufficient to derive (differential) privacy guarantees. Sharp bounds on the noise level required to obtain a given privacy level are provided, and the proposed method is shown empirically to compare favourably with state-of-the-art techniques. Then, the compression scheme is modified to leverage structured random matrices, which reduce the computational cost of the framework and make it possible to learn on high-dimensional data. Lastly, we introduce a new algorithm based on message passing techniques to learn from the sketch for the k-means clustering problem. These contributions open the way for a broader application of the framework

APA, Harvard, Vancouver, ISO, and other styles

13

Nan, Lihao. "Privacy Preserving Representation Learning For Complex Data." Thesis, The University of Sydney, 2019. http://hdl.handle.net/2123/20662.

Full text

Abstract:

Here we consider a common data encryption problem encountered by users who want to disclose some data to gain utility but preserve their private information. Specifically, we consider the inference attack, in which an adversary conducts inference on the disclosed data to gain information about users' private data. Following privacy funnel \cite{makhdoumi2014information}, assuming that the original data $X$ is transformed into $Z$ before disclosing and the log loss is used for both privacy and utility metrics, then the problem can be modeled as finding a mapping $X \rightarrow Z$ that maximizes mutual information between $X$ and $Z$ subject to a constraint that the mutual information between $Z$ and private data $S$ is smaller than a predefined threshold $\epsilon$. In contrast to the original study \cite{makhdoumi2014information}, which only focused on discrete data, we consider the more general and practical setting of continuous and high-dimensional disclosed data (e.g., image data). Most previous work on privacy-preserving representation learning is based on adversarial learning or generative adversarial networks, which has been shown to suffer from the vanishing gradient problem, and it is experimentally difficult to eliminate the relationship with private data $Y$ when $Z$ is constrained to retain more information about $X$. Here we propose a simple but effective variational approach that does not rely on adversarial training. Our experimental results show that our approach is stable and outperforms previous methods in terms of both downstream task accuracy and mutual information estimation.

APA, Harvard, Vancouver, ISO, and other styles

14

Ma, Jianjie. "Learning from perturbed data for privacy-preserving data mining." Online access for everyone, 2006. http://www.dissertations.wsu.edu/Dissertations/Summer2006/j%5Fma%5F080406.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Torfi, Amirsina. "Privacy-Preserving Synthetic Medical Data Generation with Deep Learning." Diss., Virginia Tech, 2020. http://hdl.handle.net/10919/99856.

Full text

Abstract:

Deep learning models demonstrated good performance in various domains such as ComputerVision and Natural Language Processing. However, the utilization of data-driven methods in healthcare raises privacy concerns, which creates limitations for collaborative research. A remedy to this problem is to generate and employ synthetic data to address privacy concerns. Existing methods for artificial data generation suffer from different limitations, such as being bound to particular use cases. Furthermore, their generalizability to real-world problems is controversial regarding the uncertainties in defining and measuring key realistic characteristics. Hence, there is a need to establish insightful metrics (and to measure the validity of synthetic data), as well as quantitative criteria regarding privacy restrictions. We propose the use of Generative Adversarial Networks to help satisfy requirements for realistic characteristics and acceptable values of privacy metrics, simultaneously. The present study makes several unique contributions to synthetic data generation in the healthcare domain. First, we propose a novel domain-agnostic metric to evaluate the quality of synthetic data. Second, by utilizing 1-D Convolutional Neural Networks, we devise a new approach to capturing the correlation between adjacent diagnosis records. Third, we employ ConvolutionalAutoencoders for creating a robust and compact feature space to handle the mixture of discrete and continuous data. Finally, we devise a privacy-preserving framework that enforcesRényi differential privacy as a new notion of differential privacy.
Doctor of Philosophy
Computers programs have been widely used for clinical diagnosis but are often designed with assumptions limiting their scalability and interoperability. The recent proliferation of abundant health data, significant increases in computer processing power, and superior performance of data-driven methods enable a trending paradigm shift in healthcare technology. This involves the adoption of artificial intelligence methods, such as deep learning, to improve healthcare knowledge and practice. Despite the success in using deep learning in many different domains, in the healthcare field, privacy challenges make collaborative research difficult, as working with data-driven methods may jeopardize patients' privacy. To overcome these challenges, researchers propose to generate and utilize realistic synthetic data that can be used instead of real private data. Existing methods for artificial data generation are limited by being bound to special use cases. Furthermore, their generalizability to real-world problems is questionable. There is a need to establish valid synthetic data that overcomes privacy restrictions and functions as a real-world analog for healthcare deep learning data training. We propose the use of Generative Adversarial Networks to simultaneously overcome the realism and privacy challenges associated with healthcare data.

APA, Harvard, Vancouver, ISO, and other styles

16

Chen, Xuhui. "Secure and Privacy-Aware Machine Learning." Case Western Reserve University School of Graduate Studies / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=case1563196765900275.

Full text

APA, Harvard, Vancouver, ISO, and other styles

17

Zhang, Sixiao. "Classifier Privacy in Machine Learning Markets." Case Western Reserve University School of Graduate Studies / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=case1586460332748024.

Full text

APA, Harvard, Vancouver, ISO, and other styles

18

Liu, Menghan. "PULMONARY FUNCTION MONITORING USING PORTABLE ULTRASONOGRAPHY AND PRIVACY-PRESERVING LEARNING." Case Western Reserve University School of Graduate Studies / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=case1481034164747838.

Full text

APA, Harvard, Vancouver, ISO, and other styles

19

Nguyen, Trang Pham Ngoc. "A privacy preserving online learning framework for medical diagnosis applications." Thesis, Edith Cowan University, Research Online, Perth, Western Australia, 2022. https://ro.ecu.edu.au/theses/2503.

Full text

Abstract:

Electronic Health records are an important part of a digital healthcare system. Due to their significance, electronic health records have become a major target for hackers, and hospitals/clinics prefer to keep the records at local sites protected by adequate security measures. This introduces challenges in sharing health records. Sharing health records however, is critical in building an accurate online diagnosis framework. Most local sites have small data sets, and machine learning models developed locally based on small data sets, do not have knowledge about other data sets and learning models used at other sites. The work in this thesis utilizes the framework of coordinating the blockchain technology and online training mechanism in order to address the concerns of privacy and security in a methodical manner. Specifically, it integrates online learning with a permissioned blockchain network, using transaction metadata to broadcast a part of models while keeping patient health information private. This framework can treat different types of machine learning models using the same distributed dataset. The study also outlines the advantages and drawbacks of using blockchain technology to tackle the privacy-preserving predictive modeling problem and to improve interoperability amongst institutions. This study implements the proposed solutions for skin cancer diagnosis as a representative case and shows promising results in preserving security and providing high detection accuracy. The experimentation was done on ISIC dataset, and the results were 98.57, 99.13, 99.17 and 97,18 in terms of precision, accuracy, F1-score and recall, respectively.

APA, Harvard, Vancouver, ISO, and other styles

20

Sitta, Alessandro. "Privacy-Preserving Distributed Optimization via Obfuscated Gradient Tracking." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2021.

Find full text

Abstract:

As the modern world becomes increasingly digitized and interconnected, distributed systems have proven to be effective in the processing of large volumes of data. In this context, optimization techniques have become essential in an extensive range of domains. However, a major concern, regarding the privacy issue in handling sensitive data, has recently emerged. To address this privacy issue we propose a novel consensus-based privacy-preserving distributed optimization algorithm called Obfuscated Gradient Tracking. The algorithm is characterized by a balanced noise insertion method which protects private data from being revealed to others, while not affecting the result’s accuracy. Indeed, we theoretically prove that the introduced perturbations do not condition the convergence properties of the algorithm, which is proven to reach the optimal solution without compromises. Moreover, security against the widely-used honest-but-curious adversary model, is shown. Furthermore, numerical tests are performed to show the effectiveness of the novel algorithm, both in terms of privacy and convergence properties. Numerical results highlight the Obfuscated Gradient Tracking attractiveness, against standard distributed algorithms, when privacy issues are involved. Finally, we present a privacy-preserving distributed Deep Learning application developed using our novel algorithm, with the aim of demonstrating its general applicability.

APA, Harvard, Vancouver, ISO, and other styles

21

Aryasomayajula, Naga Srinivasa Baradwaj. "Machine Learning Models for Categorizing Privacy Policy Text." University of Cincinnati / OhioLINK, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1535633397362514.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Mhanna, Maggie. "Privacy-Preserving Quantization Learning for Distributed Detection with Applications to Smart Meters." Thesis, Université Paris-Saclay (ComUE), 2017. http://www.theses.fr/2017SACLS047/document.

Full text

Abstract:

Cette thèse porte sur quelques problèmes de codage de source dans lesquels on souhaite préserver la confidentialité vis à vis d’une écoute du canal. Dans la première partie, nous fournissons des nouveaux résultats fondamentaux sur le codage de source pour la détection (utilisateur légitime) et la confidentialité (vis à vis d’une écoute du canal) en présence d'informations secondaires aux terminaux de réception. Nous proposons plusieurs nouveaux résultats d'optimisation de la région de débit-erreur-équivocation réalisable, et proposons des algorithmes pratiques pour obtenir des solutions aussi proches que possible de l'optimal, ce qui nécessite la conception de quantificateurs en présence d'un eavesdropper. Dans la deuxième partie, nous étudions le problème de l'estimation sécurisée dans un cadre d'utilité-confidentialité où l'utilisateur recherche soit à extraire les aspects pertinents de données complexes ou bien à les cacher vis à vis d'un eavesdropper potentiel. L'objectif est principalement axé sur l'élaboration d'un cadre général qui combine la théorie de l'information et la théorie de la communication, visant à fournir un nouvel outil pour la confidentialité dans les Smart Grids. D'un point de vue théorique, cette recherche a permis de quantifier les limites fondamentales et donc le compromis entre sécurité et performance (estimation / détection)
This thesis investigates source coding problems in which some secrecy should be ensured with respect to eavesdroppers. In the first part, we provide some new fundamental results on both detection and secrecy oriented source coding in the presence of side information at the receiving terminals. We provide several new results of optimality and single-letter characterization of the achievable rate-error-equivocation region, and propose practical algorithms to obtain solutions that are as close as possible to the optimal, which requires the design of optimal quantization in the presence of an eavesdropper In the second part, we study the problem of secure estimation in a utility-privacy framework where the user is either looking to extract relevant aspects of complex data or hide them from a potential eavesdropper. The objective is mainly centered on the development of a general framework that combines information theory with communication theory, aiming to provide a novel and powerful tool for security in Smart Grids. From a theoretical perspective, this research was able to quantify fundamental limits and thus the tradeoff between security and performance (estimation/detection)

APA, Harvard, Vancouver, ISO, and other styles

23

Rodríguez, Hoyos Ana Fernanda. "Contribution to privacy-enhancing tecnologies for machine learning applications." Doctoral thesis, Universitat Politècnica de Catalunya, 2020. http://hdl.handle.net/10803/669919.

Full text

Abstract:

For some time now, big data applications have been enabling revolutionary innovation in every aspect of our daily life by taking advantage of lots of data generated from the interactions of users with technology. Supported by machine learning and unprecedented computation capabilities, different entities are capable of efficiently exploiting such data to obtain significant utility. However, since personal information is involved, these practices raise serious privacy concerns. Although multiple privacy protection mechanisms have been proposed, there are some challenges that need to be addressed for these mechanisms to be adopted in practice, i.e., to be “usable” beyond the privacy guarantee offered. To start, the real impact of privacy protection mechanisms on data utility is not clear, thus an empirical evaluation of such impact is crucial. Moreover, since privacy is commonly obtained through the perturbation of large data sets, usable privacy technologies may require not only preservation of data utility but also efficient algorithms in terms of computation speed. Satisfying both requirements is key to encourage the adoption of privacy initiatives. Although considerable effort has been devoted to design less “destructive” privacy mechanisms, the utility metrics employed may not be appropriate, thus the wellness of such mechanisms would be incorrectly measured. On the other hand, despite the advent of big data, more efficient approaches are not being considered. Not complying with the requirements of current applications may hinder the adoption of privacy technologies. In the first part of this thesis, we address the problem of measuring the effect of k-anonymous microaggregation on the empirical utility of microdata. We quantify utility accordingly as the accuracy of classification models learned from microaggregated data, evaluated over original test data. Our experiments show that the impact of the de facto microaggregation standard on the performance of machine-learning algorithms is often minor for a variety of data sets. Furthermore, experimental evidence suggests that the traditional measure of distortion in the community of microdata anonymization may be inappropriate for evaluating the utility of microaggregated data. Secondly, we address the problem of preserving the empirical utility of data. By transforming the original data records to a different data space, our approach, based on linear discriminant analysis, enables k-anonymous microaggregation to be adapted to the application domain of data. To do this, first, data is rotated (projected) towards the direction of maximum discrimination and, second, scaled in this direction, penalizing distortion across the classification threshold. As a result, data utility is preserved in terms of the accuracy of machine learned models for a number of standardized data sets. Afterwards, we propose a mechanism to reduce the running time for the k-anonymous microaggregation algorithm. This is obtained by simplifying the internal operations of the original algorithm. Through extensive experimentation over multiple data sets, we show that the new algorithm gets significantly faster. Interestingly, this remarkable speedup factor is achieved with no additional loss of data utility.
Les aplicacions de big data impulsen actualment una accelerada innovació aprofitant la gran quantitat d’informació generada a partir de les interaccions dels usuaris amb la tecnologia. Així, qualsevol entitat és capaç d'explotar eficientment les dades per obtenir utilitat, emprant aprenentatge automàtic i capacitats de còmput sense precedents. No obstant això, sorgeixen en aquest escenari serioses preocupacions pel que fa a la privacitat dels usuaris ja que hi ha informació personal involucrada. Tot i que s'han proposat diversos mecanismes de protecció, hi ha alguns reptes per a la seva adopció en la pràctica, és a dir perquè es puguin utilitzar. Per començar, l’impacte real d'aquests mecanismes en la utilitat de les dades no esta clar, raó per la qual la seva avaluació empírica és important. A més, considerant que actualment es manegen grans volums de dades, una privacitat usable requereix, no només preservació de la utilitat de les dades, sinó també algoritmes eficients en temes de temps de còmput. És clau satisfer tots dos requeriments per incentivar l’adopció de mesures de privacitat. Malgrat que hi ha diversos esforços per dissenyar mecanismes de privacitat menys "destructius", les mètriques d'utilitat emprades no serien apropiades, de manera que aquests mecanismes de protecció podrien estar sent incorrectament avaluats. D'altra banda, tot i l’adveniment del big data, la investigació existent no s’enfoca molt en millorar la seva eficiència. Lamentablement, si els requisits de les aplicacions actuals no es satisfan, s’obstaculitzarà l'adopció de tecnologies de privacitat. A la primera part d'aquesta tesi abordem el problema de mesurar l'impacte de la microagregació k-Gnónima en la utilitat empírica de microdades. Per això, quantifiquem la utilitat com la precisió de models de classificació obtinguts a partir de les dades microagregades. i avaluats sobre dades de prova originals. Els experiments mostren que l'impacte de l’algoritme de rmicroagregació estàndard en el rendiment d’algoritmes d'aprenentatge automàtic és usualment menor per a una varietat de conjunts de dades avaluats. A més, l’evidència experimental suggereix que la mètrica tradicional de distorsió de les dades seria inapropiada per avaluar la utilitat empírica de dades microagregades. Així també estudiem el problema de preservar la utilitat empírica de les dades a l'ésser anonimitzades. Transformant els registres originaIs de dades en un espai de dades diferent, el nostre enfocament, basat en anàlisi de discriminant lineal, permet que el procés de microagregació k-anònima s'adapti al domini d’aplicació de les dades. Per això, primer, les dades són rotades o projectades en la direcció de màxima discriminació i, segon, escalades en aquesta direcció, penalitzant la distorsió a través del llindar de classificació. Com a resultat, la utilitat de les dades es preserva en termes de la precisió dels models d'aprenentatge automàtic en diversos conjunts de dades. Posteriorment, proposem un mecanisme per reduir el temps d'execució per a la microagregació k-anònima. Això s'aconsegueix simplificant les operacions internes de l'algoritme escollit Mitjançant una extensa experimentació sobre diversos conjunts de dades, vam mostrar que el nou algoritme és bastant més ràpid. Aquesta acceleració s'aconsegueix sense que hi ha pèrdua en la utilitat de les dades. Finalment, en un enfocament més aplicat, es proposa una eina de protecció de privacitat d'individus i organitzacions mitjançant l'anonimització de dades sensibles inclosos en logs de seguretat. Es dissenyen diferents mecanismes d'anonimat per implementar-los en base a la definició d'una política de privacitat, en el context d'un projecte europeu que té per objectiu construir un sistema de seguretat unificat.

APA, Harvard, Vancouver, ISO, and other styles

24

PANFILO, DANIELE. "Generating Privacy-Compliant, Utility-Preserving Synthetic Tabular and Relational Datasets Through Deep Learning." Doctoral thesis, Università degli Studi di Trieste, 2022. http://hdl.handle.net/11368/3030920.

Full text

Abstract:

Due tendenze hanno rapidamente ridefinito il panorama dell'intelligenza artificiale (IA) negli ultimi decenni. La prima è il rapido sviluppo tecnologico che rende possibile un'intelligenza artificiale sempre più sofisticata. Dal punto di vista dell'hardware, ciò include una maggiore potenza di calcolo ed una sempre crescente efficienza di archiviazione dei dati. Da un punto di vista concettuale e algoritmico, campi come l'apprendimento automatico hanno subito un'impennata e le sinergie tra l'IA e le altre discipline hanno portato a sviluppi considerevoli. La seconda tendenza è la crescente consapevolezza della società nei confronti dell'IA. Mentre le istituzioni sono sempre più consapevoli di dover adottare la tecnologia dell'IA per rimanere competitive, questioni come la privacy dei dati e la possibilità di spiegare il funzionamento dei modelli di apprendimento automatico sono diventate parte del dibattito pubblico. L'insieme di questi sviluppi genera però una sfida: l'IA può migliorare tutti gli aspetti della nostra vita, dall'assistenza sanitaria alla politica ambientale, fino alle opportunità commerciali, ma poterla sfruttare adeguatamente richiede l'uso di dati sensibili. Purtroppo, le tecniche di anonimizzazione tradizionali non forniscono una soluzione affidabile a suddetta sfida. Non solo non sono sufficienti a proteggere i dati personali, ma ne riducono anche il valore analitico a causa delle inevitabili distorsioni apportate ai dati. Tuttavia, lo studio emergente dei modelli generativi ad apprendimento profondo (MGAP) può costituire un'alternativa più raffinata all'anonimizzazione tradizionale. Originariamente concepiti per l'elaborazione delle immagini, questi modelli catturano le distribuzioni di probabilità sottostanti agli insiemi di dati. Tali distribuzioni possono essere successivamente campionate, fornendo nuovi campioni di dati, non presenti nel set di dati originale. Tuttavia, la distribuzione complessiva degli insiemi di dati sintetici, costituiti da dati campionati in questo modo, è equivalente a quella del set dei dati originali. In questa tesi, verrà analizzato l'uso dei MGAP come tecnologia abilitante per una più ampia adozione dell'IA. A tal scopo, verrà ripercorsa prima di tutto la legislazione sulla privacy dei dati, con particolare attenzione a quella relativa all'Unione Europea. Nel farlo, forniremo anche una panoramica delle tecnologie tradizionali di anonimizzazione dei dati. Successivamente, verrà fornita un'introduzione all'IA e al deep-learning. Per illustrare i meriti di questo campo, vengono discussi due casi di studio: uno relativo alla segmentazione delle immagini ed uno reltivo alla diagnosi del cancro. Si introducono poi i MGAP, con particolare attenzione agli autoencoder variazionali. L'applicazione di questi metodi ai dati tabellari e relazionali costituisce una utile innovazione in questo campo che comporta l’introduzione di tecniche innovative di pre-elaborazione. Infine, verrà valutata la metodologia sviluppata attraverso esperimenti riproducibili, considerando sia l'utilità analitica che il grado di protezione della privacy attraverso metriche statistiche.
Two trends have rapidly been redefining the artificial intelligence (AI) landscape over the past several decades. The first of these is the rapid technological developments that make increasingly sophisticated AI feasible. From a hardware point of view, this includes increased computational power and efficient data storage. From a conceptual and algorithmic viewpoint, fields such as machine learning have undergone a surge and synergies between AI and other disciplines have resulted in considerable developments. The second trend is the growing societal awareness around AI. While institutions are becoming increasingly aware that they have to adopt AI technology to stay competitive, issues such as data privacy and explainability have become part of public discourse. Combined, these developments result in a conundrum: AI can improve all aspects of our lives, from healthcare to environmental policy to business opportunities, but invoking it requires the use of sensitive data. Unfortunately, traditional anonymization techniques do not provide a reliable solution to this conundrum. They are insufficient in protecting personal data, but also reduce the analytic value of data through distortion. However, the emerging study of deep-learning generative models (DLGM) may form a more refined alternative to traditional anonymization. Originally conceived for image processing, these models capture probability distributions underlying datasets. Such distributions can subsequently be sampled, giving new data points not present in the original dataset. However, the overall distribution of synthetic datasets, consisting of data sampled in this manner, is equivalent to that of the original dataset. In our research activity, we study the use of DLGM as an enabling technology for wider AI adoption. To do so, we first study legislation around data privacy with an emphasis on the European Union. In doing so, we also provide an outline of traditional data anonymization technology. We then provide an introduction to AI and deep-learning. Two case studies are discussed to illustrate the field’s merits, namely image segmentation and cancer diagnosis. We then introduce DLGM, with an emphasis on variational autoencoders. The application of such methods to tabular and relational data is novel and involves innovative preprocessing techniques. Finally, we assess the developed methodology in reproducible experiments, evaluating both the analytic utility and the degree of privacy protection through statistical metrics.

APA, Harvard, Vancouver, ISO, and other styles

25

Anderberg, Jesper, and Nazdar Fathullah. "A machine learning approach to enhance the privacy of customers." Thesis, Malmö universitet, Fakulteten för teknik och samhälle (TS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:mau:diva-20629.

Full text

Abstract:

Under ett telefonsamtal mellan en kund och en representant för ett företag utbyts en mängd information. Allt från en kunds namn, identifikationsnummer, hemadress till väderkonversationer och mer vardagliga ämnen. Kunskap om sin kundbas är en viktig del av ett företags verksamhet. Det finns därför ett behov av att analysera samtalet mellan kund och företag, för att utveckla och förbättra den övergripande kundservicen och kundkännedomen. Med nya lagstiftningar som GDPR måste dock särskild hänsyn tas vid lagring av personlig information.I detta arbete, undersöker vi möjligheterna att klassificera data från ett transkriberat röstsamtal med hjälp av två maskininlärnings algoritmer, för att utelämna känslig information.En maskininlärningsmodell implementeras med hjälp av en iterativ systemutvecklingsmetod.Genom att tillämpa Naive Bayes och Support Vector Machine algoritmer klassificeraskänslig data såsom en persons namn och plats. Utvärderingsmetoderna 10-fold crossvalidation, learning curve, classification rapport, och ROC kurva används för att utvärdera systemet. Resultaten visar hur algoritmen når en hög noggrannhet när datasetet innehåller fler datapunkter jämfört med ett dataset med färre antal datapunkter. Slutligen, genom att pre-processera datan ökar algoritmernas noggrannhet.
During a phone call between a customer and a representative for a company, various amountof information is exchanged. Everything from a customer’s name, identification number,and home address, to weather conversations and more generic subjects. Companies knowledgeabout their customers are a vital part of their business. Therefore, a need to analyzethe conversation in the form of transcripts might be necessary to develop and improvethe overall customer service within a company. However, with new legislation like GDPR,special considerations must be taken into account when storing personal information.In this paper we will examine, by using two machine learning algorithms, the possibilitiesof classifying data from a transcribed phone call, to leave out sensitive information. Themachine learning model is built by following an iterative system development method. Byusing the Naive Bayes and Support Vector Machine algorithms, classification of sensitivedata, such a persons name and location, is conducted. Evaluation methods like 10-foldcross-validation, learning curve, classification report, and ROC curve are used to evaluating the system. The results show that the algorithm achieved a higher accuracy when the dataset contains more data samples, compared to a dataset with less number of data samples. Furthermore, by pre-processing the data, the accuracy of the machine learning models increased.

APA, Harvard, Vancouver, ISO, and other styles

26

Lundmark, Magnus, and Carl-Johan Dahlman. "Differential privacy and machine learning: Calculating sensitivity with generated data sets." Thesis, KTH, Data- och elektroteknik, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-209481.

Full text

Abstract:

Privacy has never been more important to maintain in today’s information society. Companies and organizations collect large amounts of data about their users. This information is considered to be valuable due to its statistical usage that provide insight into certain areas such as medicine, economics, or behavioural patterns among individuals. A technique called differential privacy has been developed to ensure that the privacy of individuals are maintained. This enables the ability to create useful statistics while the privacy of the individual is maintained. However the disadvantage of differential privacy is the magnitude of the randomized noise applied to the data in order to hide the individual. This research examined whether it is possible to improve the usability of the privatized result by using machine learning to generate a data set that the noise can be based on. The purpose of the generated data set is to provide a local representation of the underlying data set that is safe to use when calculating the magnitude of the randomized noise. The results of this research has determined that this approach is currently not a feasible solution, but demonstrates possible ways to base further research in order to improve the usability of differential privacy. The research indicates limiting the noise to a lower bound calculated from the underlying data set might be enough to reach all privacy requirements. Furthermore, the accuracy of the machining learning algorithm and its impact on the usability of the noise, was not fully investigated and could be of interest in future studies.
Aldrig tidigare har integritet varit viktigare att upprätthålla än i dagens informationssamhälle, där företag och organisationer samlar stora mängder data om sina användare. Merparten av denna information är sedd som värdefull och kan användas för att skapa statistik som i sin tur kan ge insikt inom områden som medicin, ekonomi eller beteendemönster bland individer. För att säkerställa att en enskild individs integritet upprätthålls har en teknik som heter differential privacy utvecklats. Denna möjliggör framtagandet av användbar statistik samtidigt som individens integritet upprätthålls. Differential privacy har dock en nackdel, och det är storleken på det randomiserade bruset som används för att dölja individen i en fråga om data. Denna undersökning undersökte huruvida detta brus kunde förbättras genom att använda maskininlärning för att generera ett data set som bruset kunde baseras på. Tanken var att den genererade datasetet skulle kunna ge en lokal representation av det underliggande datasetet som skulle vara säker att använda vid beräkning av det randomiserade brusets storlek. Forskningen visar att detta tillvägagångssätt för närvarande inte stöds av resultaten. Storleken på det beräknade bruset var inte tillräckligt stort och resulterade därmed i en oacceptabel mängd läckt information. Forskningen visar emellertid att genom att begränsa bruset till en lägsta nivå som är beräknad från det lokala datasetet möjligtvis kan räcka för att uppfylla alla sekretesskrav. Ytterligare forskning behövs för att säkerställa att detta ger den nödvändiga nivån av integritet. Vidare undersöktes inte noggrannheten hos maskininlärningsalgoritmen och dess inverkan på brusets användbarhet vilket kan vara en inriktning för vidare studier.

APA, Harvard, Vancouver, ISO, and other styles

27

Vu, Xuan-Son. "Privacy-awareness in the era of Big Data and machine learning." Licentiate thesis, Umeå universitet, Institutionen för datavetenskap, 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-162182.

Full text

Abstract:

Social Network Sites (SNS) such as Facebook and Twitter, have been playing a great role in our lives. On the one hand, they help connect people who would not otherwise be connected before. Many recent breakthroughs in AI such as facial recognition [49] were achieved thanks to the amount of available data on the Internet via SNS (hereafter Big Data). On the other hand, due to privacy concerns, many people have tried to avoid SNS to protect their privacy. Similar to the security issue of the Internet protocol, Machine Learning (ML), as the core of AI, was not designed with privacy in mind. For instance, Support Vector Machines (SVMs) try to solve a quadratic optimization problem by deciding which instances of training dataset are support vectors. This means that the data of people involved in the training process will also be published within the SVM models. Thus, privacy guarantees must be applied to the worst-case outliers, and meanwhile data utilities have to be guaranteed. For the above reasons, this thesis studies on: (1) how to construct data federation infrastructure with privacy guarantee in the big data era; (2) how to protect privacy while learning ML models with a good trade-off between data utilities and privacy. To the first point, we proposed different frameworks em- powered by privacy-aware algorithms that satisfied the definition of differential privacy, which is the state-of-the-art privacy-guarantee algorithm by definition. Regarding (2), we proposed different neural network architectures to capture the sensitivities of user data, from which, the algorithm itself decides how much it should learn from user data to protect their privacy while achieves good performance for a downstream task. The current outcomes of the thesis are: (1) privacy-guarantee data federation infrastructure for data analysis on sensitive data; (2) privacy-guarantee algorithms for data sharing; (3) privacy-concern data analysis on social network data. The research methods used in this thesis include experiments on real-life social network dataset to evaluate aspects of proposed approaches. Insights and outcomes from this thesis can be used by both academic and industry to guarantee privacy for data analysis and data sharing in personal data. They also have the potential to facilitate relevant research in privacy-aware representation learning and related evaluation methods.

APA, Harvard, Vancouver, ISO, and other styles

28

Tania, Zannatun Nayem. "Machine Learning with Reconfigurable Privacy on Resource-Limited Edge Computing Devices." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-292105.

Full text

Abstract:

Distributed computing allows effective data storage, processing and retrieval but it poses security and privacy issues. Sensors are the cornerstone of the IoT-based pipelines, since they constantly capture data until it can be analyzed at the central cloud resources. However, these sensor nodes are often constrained by limited resources. Ideally, it is desired to make all the collected data features private but due to resource limitations, it may not always be possible. Making all the features private may cause overutilization of resources, which would in turn affect the performance of the whole system. In this thesis, we design and implement a system that is capable of finding the optimal set of data features to make private, given the device’s maximum resource constraints and the desired performance or accuracy of the system. Using the generalization techniques for data anonymization, we create user-defined injective privacy encoder functions to make each feature of the dataset private. Regardless of the resource availability, some data features are defined by the user as essential features to make private. All other data features that may pose privacy threat are termed as the non-essential features. We propose Dynamic Iterative Greedy Search (DIGS), a greedy search algorithm that takes the resource consumption for each non-essential feature as input and returns the most optimal set of non-essential features that can be private given the available resources. The most optimal set contains the features which consume the least resources. We evaluate our system on a Fitbit dataset containing 17 data features, 4 of which are essential private features for a given classification application. Our results show that we can provide 9 additional private features apart from the 4 essential features of the Fitbit dataset containing 1663 records. Furthermore, we can save 26:21% memory as compared to making all the features private. We also test our method on a larger dataset generated with Generative Adversarial Network (GAN). However, the chosen edge device, Raspberry Pi, is unable to cater to the scale of the large dataset due to insufficient resources. Our evaluations using 1=8th of the GAN dataset result in 3 extra private features with up to 62:74% memory savings as compared to all private data features. Maintaining privacy not only requires additional resources, but also has consequences on the performance of the designed applications. However, we discover that privacy encoding has a positive impact on the accuracy of the classification model for our chosen classification application.
Distribuerad databehandling möjliggör effektiv datalagring, bearbetning och hämtning men det medför säkerhets- och sekretessproblem. Sensorer är hörnstenen i de IoT-baserade rörledningarna, eftersom de ständigt samlar in data tills de kan analyseras på de centrala molnresurserna. Dessa sensornoder begränsas dock ofta av begränsade resurser. Helst är det önskvärt att göra alla insamlade datafunktioner privata, men på grund av resursbegränsningar kanske det inte alltid är möjligt. Att göra alla funktioner privata kan orsaka överutnyttjande av resurser, vilket i sin tur skulle påverka prestanda för hela systemet. I denna avhandling designar och implementerar vi ett system som kan hitta den optimala uppsättningen datafunktioner för att göra privata, med tanke på begränsningar av enhetsresurserna och systemets önskade prestanda eller noggrannhet. Med hjälp av generaliseringsteknikerna för data-anonymisering skapar vi användardefinierade injicerbara sekretess-kodningsfunktioner för att göra varje funktion i datasetet privat. Oavsett resurstillgänglighet definieras vissa datafunktioner av användaren som viktiga funktioner för att göra privat. Alla andra datafunktioner som kan utgöra ett integritetshot kallas de icke-väsentliga funktionerna. Vi föreslår Dynamic Iterative Greedy Search (DIGS), en girig sökalgoritm som tar resursförbrukningen för varje icke-väsentlig funktion som inmatning och ger den mest optimala uppsättningen icke-väsentliga funktioner som kan vara privata med tanke på tillgängliga resurser. Den mest optimala uppsättningen innehåller de funktioner som förbrukar minst resurser. Vi utvärderar vårt system på en Fitbit-dataset som innehåller 17 datafunktioner, varav 4 är viktiga privata funktioner för en viss klassificeringsapplikation. Våra resultat visar att vi kan erbjuda ytterligare 9 privata funktioner förutom de 4 viktiga funktionerna i Fitbit-datasetet som innehåller 1663 poster. Dessutom kan vi spara 26; 21% minne jämfört med att göra alla funktioner privata. Vi testar också vår metod på en större dataset som genereras med Generative Adversarial Network (GAN). Den valda kantenheten, Raspberry Pi, kan dock inte tillgodose storleken på den stora datasetet på grund av otillräckliga resurser. Våra utvärderingar med 1=8th av GAN-datasetet resulterar i 3 extra privata funktioner med upp till 62; 74% minnesbesparingar jämfört med alla privata datafunktioner. Att upprätthålla integritet kräver inte bara ytterligare resurser utan har också konsekvenser för de designade applikationernas prestanda. Vi upptäcker dock att integritetskodning har en positiv inverkan på noggrannheten i klassificeringsmodellen för vår valda klassificeringsapplikation.

APA, Harvard, Vancouver, ISO, and other styles

29

Shaham, Sina. "Location Privacy in the Era of Big Data and Machine Learning." Thesis, The University of Sydney, 2019. https://hdl.handle.net/2123/21689.

Full text

Abstract:

Location data of individuals is one of the most sensitive sources of information that once revealed to ill-intended individuals or service providers, can cause severe privacy concerns. In this thesis, we aim at preserving the privacy of users in telecommunication networks against untrusted service providers as well as improving their privacy in the publication of location datasets. For improving the location privacy of users in telecommunication networks, we consider the movement of users in trajectories and investigate the threats that the query history may pose on location privacy. We develop an attack model based on the Viterbi algorithm termed as Viterbi attack, which represents a realistic privacy threat in trajectories. Next, we propose a metric called transition entropy that helps to evaluate the performance of dummy generation algorithms, followed by developing a robust dummy generation algorithm that can defend users against the Viterbi attack. We compare and evaluate our proposed algorithm and metric on a publicly available dataset published by Microsoft, i.e., Geolife dataset. For privacy preserving data publishing, an enhanced framework for anonymization of spatio-temporal trajectory datasets termed the machine learning based anonymization (MLA) is proposed. The framework consists of a robust alignment technique and a machine learning approach for clustering datasets. The framework and all the proposed algorithms are applied to the Geolife dataset, which includes GPS logs of over 180 users in Beijing, China.

APA, Harvard, Vancouver, ISO, and other styles

30

Romanelli, Marco. "Machine Learning methods for privacy protection: leakage measurement and mechanism design." Doctoral thesis, Università di Siena, 2020. http://hdl.handle.net/11365/1118314.

Full text

Abstract:

In recent years, there has been an increasing involvement of artificial intelligence and machine learning (ML) in countless aspects of our daily lives. In this PhD thesis, we study how notions of information theory and ML can be used to better measure and understand the information leaked by data and / or models, and to design solutions to protect the privacy of the shared information. We first explore the application of ML to estimate the information leakage of a system. We consider a black-box scenario where the system’s internals are either unknown, or too complicated to analyze, and the only available information are pairs of input-output data samples. Previous works focused on counting the frequencies to estimate the input-output conditional probabilities (frequentist approach), however this method is not accurate when the domain of possible outputs is large. To overcome this difficulty, the estimation of the Bayes error of the ideal classifier was recently investigated using ML models and it has been shown to be more accurate thanks to the ability of those models to learn the input-output correspondence. However, the Bayes vulnerability is only suitable to describe one-try attacks. A more general and flexible measure of leakage is the g-vulnerability, which encompasses several different types of adversaries, with different goals and capabilities. We therefore propose a novel ML based approach, that relies on data preprocessing, to perform black-box estimation of the g-vulnerability, formally studying the learnability for all data distributions and evaluating performances in various experimental settings. In the second part of this thesis, we address the problem of obfuscating sensitive information while preserving utility, and we propose a ML approach inspired by the generative adversarial networks paradigm. The idea is to set up two nets: the generator, that tries to produce an optimal obfuscation mechanism to protect the data, and the classifier, that tries to de-obfuscate the data. By letting the two nets compete against each other, the mechanism improves its degree of protection, until an equilibrium is reached. We apply our method to the case of location privacy, and we perform experiments on synthetic data and on real data from the Gowalla dataset. The performance of the obtained obfuscation mechanism is evaluated in terms of the Bayes error, which represents the strongest possible adversary. Finally, we consider that, in classification problems, we try to predict classes observing the values of the features that represent the input samples. Classes and features’ values can be considered respectively as secret input and observable outputs of a system. Therefore, measuring the leakage of such a system is a strategy to tell the most and least informative features apart. Information theory can be considered a useful concept for this task, as the prediction power stems from the correlation, i.e., the mutual information, between features and labels. We compare the Shannon entropy based mutual information to the Rényi min-entropy based one, both from the theoretical and experimental point of view showing that, in general, the two approaches are incomparable, in the sense that, depending on the considered dataset, sometimes the Shannon entropy based method outperforms the Rényi min-entropy based one and sometimes the opposite occurs.

APA, Harvard, Vancouver, ISO, and other styles

31

Carlsson, Robert. "Privacy-Preserved Federated Learning : A survey of applicable machine learning algorithms in a federated environment." Thesis, Uppsala universitet, Institutionen för informationsteknologi, 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-424383.

Full text

Abstract:

There is a potential in the field of medicine and finance of doing collaborative machine learning. These areas gather data which can be used for developing machine learning models that could predict all from sickness in patients to acts of economical crime like fraud. The problem that exists is that the data collected is mostly of confidential nature and should be handled with precaution. This makes the standard way of doing machine learning - gather data at one centralized server - unwanted to achieve. The safety of the data have to be taken into account. In this project we will explore the Federated learning approach of ”bringing the code to the data, instead of data to the code”. It is a decentralized way of doing machine learning where models are trained on connected devices and data is never shared. Keeping the data privacypreserved.

APA, Harvard, Vancouver, ISO, and other styles

32

Zheng, Yao. "Privacy Preservation for Cloud-Based Data Sharing and Data Analytics." Diss., Virginia Tech, 2016. http://hdl.handle.net/10919/73796.

Full text

Abstract:

Data privacy is a globally recognized human right for individuals to control the access to their personal information, and bar the negative consequences from the use of this information. As communication technologies progress, the means to protect data privacy must also evolve to address new challenges come into view. Our research goal in this dissertation is to develop privacy protection frameworks and techniques suitable for the emerging cloud-based data services, in particular privacy-preserving algorithms and protocols for the cloud-based data sharing and data analytics services. Cloud computing has enabled users to store, process, and communicate their personal information through third-party services. It has also raised privacy issues regarding losing control over data, mass harvesting of information, and un-consented disclosure of personal content. Above all, the main concern is the lack of understanding about data privacy in cloud environments. Currently, the cloud service providers either advocate the principle of third-party doctrine and deny users' rights to protect their data stored in the cloud; or rely the notice-and-choice framework and present users with ambiguous, incomprehensible privacy statements without any meaningful privacy guarantee. In this regard, our research has three main contributions. First, to capture users' privacy expectations in cloud environments, we conceptually divide personal data into two categories, i.e., visible data and invisible data. The visible data refer to information users intentionally create, upload to, and share through the cloud; the invisible data refer to users' information retained in the cloud that is aggregated, analyzed, and repurposed without their knowledge or understanding. Second, to address users' privacy concerns raised by cloud computing, we propose two privacy protection frameworks, namely individual control and use limitation. The individual control framework emphasizes users' capability to govern the access to the visible data stored in the cloud. The use limitation framework emphasizes users' expectation to remain anonymous when the invisible data are aggregated and analyzed by cloud-based data services. Finally, we investigate various techniques to accommodate the new privacy protection frameworks, in the context of four cloud-based data services: personal health record sharing, location-based proximity test, link recommendation for social networks, and face tagging in photo management applications. For the first case, we develop a key-based protection technique to enforce fine-grained access control to users' digital health records. For the second case, we develop a key-less protection technique to achieve location-specific user selection. For latter two cases, we develop distributed learning algorithms to prevent large scale data harvesting. We further combine these algorithms with query regulation techniques to achieve user anonymity. The picture that is emerging from the above works is a bleak one. Regarding to personal data, the reality is we can no longer control them all. As communication technologies evolve, the scope of personal data has expanded beyond local, discrete silos, and integrated into the Internet. The traditional understanding of privacy must be updated to reflect these changes. In addition, because privacy is a particularly nuanced problem that is governed by context, there is no one-size-fit-all solution. While some cases can be salvaged either by cryptography or by other means, in others a rethinking of the trade-offs between utility and privacy appears to be necessary.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

33

DEMETRIO, LUCA. "Formalizing evasion attacks against machine learning security detectors." Doctoral thesis, Università degli studi di Genova, 2021. http://hdl.handle.net/11567/1035018.

Full text

Abstract:

Recent work has shown that adversarial examples can bypass machine learning-based threat detectors relying on static analysis by applying minimal perturbations. To preserve malicious functionality, previous attacks either apply trivial manipulations (e.g. padding), potentially limiting their effectiveness, or require running computationally-demanding validation steps to discard adversarial variants that do not correctly execute in sandbox environments. While machine learning systems for detecting SQL injections have been proposed in the literature, no attacks have been tested against the proposed solutions to assess the effectiveness and robustness of these methods. In this thesis, we overcome these limitations by developing RAMEn, a unifying framework that (i) can express attacks for different domains, (ii) generalizes previous attacks against machine learning models, and (iii) uses functions that preserve the functionality of manipulated objects. We provide new attacks for both Windows malware and SQL injection detection scenarios by exploiting the format used for representing these objects. To show the efficacy of RAMEn, we provide experimental results of our strategies in both white-box and black-box settings. The white-box attacks against Windows malware detectors show that it takes only the 2% of the input size of the target to evade detection with ease. To further speed up the black-box attacks, we overcome the issues mentioned before by presenting a novel family of black-box attacks that are both query-efficient and functionality-preserving, as they rely on the injection of benign content, which will never be executed, either at the end of the malicious file, or within some newly-created sections, encoded in an algorithm called GAMMA. We also evaluate whether GAMMA transfers to other commercial antivirus solutions, and surprisingly find that it can evade many commercial antivirus engines. For evading SQLi detectors, we create WAF-A-MoLE, a mutational fuzzer that that exploits random mutations of the input samples, keeping alive only the most promising ones. WAF-A-MoLE is capable of defeating detectors built with different architectures by using the novel practical manipulations we have proposed. To facilitate reproducibility and future work, we open-source our framework and corresponding attack implementations. We conclude by discussing the limitations of current machine learning-based malware detectors, along with potential mitigation strategies based on embedding domain knowledge coming from subject-matter experts naturally into the learning process.

APA, Harvard, Vancouver, ISO, and other styles

34

Mivule, Kato. "An investigation of data privacy and utility using machine learning as a gauge." Thesis, Bowie State University, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=3619387.

Full text

Abstract:

The purpose of this investigation is to study and pursue a user-defined approach in preserving data privacy while maintaining an acceptable level of data utility using machine learning classification techniques as a gauge in the generation of synthetic data sets. This dissertation will deal with data privacy, data utility, machine learning classification, and the generation of synthetic data sets. Hence, data privacy and utility preservation using machine learning classification as a gauge is the central focus of this study. Many organizations that transact in large amounts of data have to comply with state, federal, and international laws to guarantee that the privacy of individuals and other sensitive data is not compromised. Yet at some point during the data privacy process, data loses its utility - a measure of how useful a privatized dataset is to the user of that dataset. Data privacy researchers have documented that attaining an optimal balance between data privacy and utility is an NP-hard challenge, thus an intractable problem. Therefore we propose the classification error gauge (x-CEG) approach, a data utility quantification concept that employs machine learning classification techniques to gauge data utility based on the classification error. In the initial phase of this proposed approach, a data privacy algorithm such as differential privacy, Gaussian noise addition, generalization, and or k-anonymity is applied on a dataset for confidentiality, generating a privatized synthetic data set. The privatized synthetic data set is then passed through a machine learning classifier, after which the classification error is measured. If the classification error is lower or equal to a set threshold, then better utility might be achieved, otherwise, adjustment to the data privacy parameters is made and then the refined synthetic data set is sent to the machine learning classifier; the process repeats until the error threshold is reached. Additionally, this study presents the Comparative x-CEG concept, in which a privatized synthetic data set is passed through a series of classifiers, each of which returns a classification error, and the classifier with the lowest classification error is chosen after parameter adjustments, an indication of better data utility. Preliminary results from this investigation show that fine-tuning parameters in data privacy procedures, for example in the case of differential privacy, and increasing weak learners in the ensemble classifier for instance, might lead to lower classification error, thus better utility. Furthermore, this study explores the application of this approach by employing signal processing techniques in the generation of privatized synthetic data sets and improving data utility. This dissertation presents theoretical and empirical work examining various data privacy and utility methodologies using machine learning classification as a gauge. Similarly this study presents a resourceful approach in the generation of privatized synthetic data sets, and an innovative conceptual framework for the data privacy engineering process.

APA, Harvard, Vancouver, ISO, and other styles

35

Sharma, Sagar. "Towards Data and Model Confidentiality in Outsourced Machine Learning." Wright State University / OhioLINK, 2019. http://rave.ohiolink.edu/etdc/view?acc_num=wright1567529092809275.

Full text

APA, Harvard, Vancouver, ISO, and other styles

36

Petrucci, Edoardo. "A Personalized Privacy Management Framework for Android Applications." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2016.

Find full text

Abstract:

Ogni giorno, l'utente di smartphon e tablet, spesso senza rendersene conto, condivide, tramite varie applicazioni, un'enorme quantità di informazioni. Negli attuali sistemi operativi, l'assenza di meccanismi utili a garantire adeguatamente l'utente, ha spinto questo lavoro di ricerca verso lo sviluppo di un inedito framework.È stato necessario uno studio approfondito dello stato dell'arte di soluzioni con gli stessi obiettivi. Sono stati esaminati sia modelli teorici che pratici, con l'analisi accurata del relativo codice. Il lavoro, in stretto contatto con i colleghi dell'Università Centrale della Florida e la condivisione delle conoscenze con gli stessi, ha portato ad importanti risultati. Questo lavoro ha prodotto un framework personalizzato per gestire la privacy nelle applicazioni mobili che, nello specifico, è stato sviluppato per Android OS e necessita dei permessi di root per poter realizzare il suo funzionamento. Il framework in questione sfrutta le funzionalità offerte dal Xposed Framework, con il risultato di implementare modifiche al sistema operativo, senza dover cambiare il codice di Android o delle applicazioni che eseguono su quest’ultimo. Il framework sviluppato controlla l’accesso da parte delle varie applicazioni in esecuzione verso le informazioni sensibili dell’utente e stima l’importanza che queste informazioni hanno per l’utente medesimo. Le informazioni raccolte dal framework sulle preferenze e sulle valutazioni dell’utente vengono usate per costruire un modello decisionale che viene sfruttato da un algoritmo di machine-learning per migliorare l’interazione del sistema con l’utente e prevedere quelle che possono essere le decisioni dell'utente stesso, circa la propria privacy. Questo lavoro di tesi realizza gli obbiettivi sopra citati e pone un'attenzione particolare nel limitare la pervasività del sistema per la gestione della privacy, nella quotidiana esperienza dell'utente con i dispositivi mobili.

APA, Harvard, Vancouver, ISO, and other styles

37

Haupt, Johannes Sebastian. "Machine Learning for Marketing Decision Support." Doctoral thesis, Humboldt-Universität zu Berlin, 2020. http://dx.doi.org/10.18452/21554.

Full text

Abstract:

Die Digitalisierung der Wirtschaft macht das Customer Targeting zu einer wichtigen Schnittmenge von Marketing und Wirtschaftsinformatik. Marketingtreibende können auf Basis von soziodemografischen und Verhaltensdaten gezielt einzelne Kunden mit personalisierten Botschaften ansprechen. Diese Arbeit erweitert die Perspektive der Forschung im Bereich der modellbasierten Vorhersage von Kundenverhalten durch 1) die Entwicklung und Validierung neuer Methoden des maschinellen Lernens, die explizit darauf ausgelegt sind, die Profitabilität des Customer Targeting im Direktmarketing und im Kundenbindungsmanagement zu optimieren, und 2) die Untersuchung der Datenerfassung mit Ziel des Customer Targeting aus Unternehmens- und Kundensicht. Die Arbeit entwickelt Methoden welche den vollen Umfang von E-Commerce-Daten nutzbar machen und die Rahmenbedingungen der Marketingentscheidung während der Modellbildung berücksichtigen. Die zugrundeliegenden Modelle des maschinellen Lernens skalieren auf hochdimensionale Kundendaten und ermöglichen die Anwendung in der Praxis. Die vorgeschlagenen Methoden basieren zudem auf dem Verständnis des Customer Targeting als einem Problem der Identifikation von Kausalzusammenhängen. Die Modellschätzung sind für die Umsetzung profitoptimierter Zielkampagnen unter komplexen Kostenstrukturen ausgelegt. Die Arbeit adressiert weiterhin die Quantifizierung des Einsparpotenzials effizienter Versuchsplanung bei der Datensammlung und der monetären Kosten der Umsetzung des Prinzips der Datensparsamkeit. Eine Analyse der Datensammlungspraktiken im E-Mail-Direktmarketing zeigt zudem, dass eine Überwachung des Leseverhaltens in der Marketingkommunikation von E-Commerce-Unternehmen ohne explizite Kundenzustimmung weit verbreitet ist. Diese Erkenntnis bildet die Grundlage für ein auf maschinellem Lernen basierendes System zur Erkennung und Löschung von Tracking-Elementen in E-Mails.
The digitization of the economy has fundamentally changed the way in which companies interact with customers and made customer targeting a key intersection of marketing and information systems. Building models of customer behavior at scale requires development of tools at the intersection of data management and statistical knowledge discovery. This dissertation widens the scope of research on predictive modeling by focusing on the intersections of model building with data collection and decision support. Its goals are 1) to develop and validate new machine learning methods explicitly designed to optimize customer targeting decisions in direct marketing and customer retention management and 2) to study the implications of data collection for customer targeting from the perspective of the company and its customers. First, the thesis proposes methods that utilize the richness of e-commerce data, reduce the cost of data collection through efficient experiment design and address the targeting decision setting during model building. The underlying state-of-the-art machine learning models scale to high-dimensional customer data and can be conveniently applied by practitioners. These models further address the problem of causal inference that arises when the causal attribution of customer behavior to a marketing incentive is difficult. Marketers can directly apply the model estimates to identify profitable targeting policies under complex cost structures. Second, the thesis quantifies the savings potential of efficient experiment design and the monetary cost of an internal principle of data privacy. An analysis of data collection practices in direct marketing emails reveals the ubiquity of tracking mechanisms without user consent in e-commerce communication. These results form the basis for a machine-learning-based system for the detection and deletion of tracking elements from emails.

APA, Harvard, Vancouver, ISO, and other styles

38

Darwish, Roba N. Darwish. "A Detailed Study of User Privacy Behavior in Social Media." Kent State University / OhioLINK, 2017. http://rave.ohiolink.edu/etdc/view?acc_num=kent1510704797892479.

Full text

APA, Harvard, Vancouver, ISO, and other styles

39

Dinh, The Canh. "Distributed Algorithms for Fast and Personalized Federated Learning." Thesis, The University of Sydney, 2023. https://hdl.handle.net/2123/30019.

Full text

Abstract:

The significant increase in the number of cutting-edge user equipment (UE) results in the phenomenal growth of the data volume generated at the edge. This shift fuels the booming trend of an emerging technique named Federated Learning. In contrast to traditional methods in which data is collected and processed centrally, FL builds a global model from contributions of UE's model without sending private data then effectively ensures data privacy. However, FL faces challenges in non-identically distributed (non-IID) data, communication cost, and convergence rate. Firstly, we propose first-order optimization FL algorithms named FedApprox and FEDL to improve the convergence rate. We propose FedApprox exploiting proximal stochastic variance-reduced gradient methods and extract insights from convergence conditions via the algorithm’s parameter control. We then propose FEDL to handle heterogeneous UE data and characterize the trade-off between local computation and global communication. Experimentally, FedApprox outperforms vanilla FedAvg while FEDL outperforms FedApprox and FedAvg. Secondly, we consider the communication between edges to be more costly than local computational overhead. We propose DONE, a distributed approximate Newton-type algorithm for communication-efficient federated edge learning. DONE approximates Newton direction using classical Richardson iteration on each edge. Experimentally, DONE attains a comparable performance to Newton’s method and outperforms first-order algorithms. Finally, we address the non-IID issue by proposing pFedMe, a personalized FL algorithm using Moreau envelopes. pFedMe achieves quadratic speedup for strongly convex and sublinear speedup of order 2/3 for smooth nonconvex objectives. We then propose FedU, a Federated Multitask Learning algorithm using Laplacian regularization to leverage the relationships among the users' models. Experimentally, pFedMe excels FedAvg and Per-FedAvg while FedU outperforms pFedMe and MOCHA.

APA, Harvard, Vancouver, ISO, and other styles

40

Bahrak, Behnam. "Ex Ante Approaches for Security, Privacy, and Enforcement in Spectrum Sharing." Diss., Virginia Tech, 2013. http://hdl.handle.net/10919/24720.

Full text

Abstract:

Cognitive radios (CRs) are devices that are capable of sensing the spectrum and using its free portions in an opportunistic manner. The free spectrum portions are referred to as white spaces or spectrum holes. It is widely believed that CRs are one of the key enabling technologies for realizing a new regulatory spectrum management paradigm, viz. dynamic spectrum access (DSA). CRs often employ software-defined radio (SDR) platforms that are capable of executing artificial intelligence (AI) algorithms to reconfigure their transmission/reception (TX/RX) parameters to communicate efficiently while avoiding interference with licensed (a.k.a. primary or incumbent) users and unlicensed (a.k.a. secondary or cognitive) users. When different stakeholders share a common resource, such as the case in spectrum sharing, security, privacy, and enforcement become critical considerations that affect the welfare of all stakeholders. Recent advances in radio spectrum access technologies, such as CRs, have made spectrum sharing a viable option for significantly improving spectrum utilization efficiency. However, those technologies have also contributed to exacerbating the difficult problems of security, privacy and enforcement. In this dissertation, we review some of the critical security and privacy threats that impact spectrum sharing. We also discuss ex ante (preventive) approaches which mitigate the security and privacy threats and help spectrum enforcement.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

41

Minelli, Michele. "Fully homomorphic encryption for machine learning." Thesis, Paris Sciences et Lettres (ComUE), 2018. http://www.theses.fr/2018PSLEE056/document.

Full text

Abstract:

Le chiffrement totalement homomorphe permet d’effectuer des calculs sur des données chiffrées sans fuite d’information sur celles-ci. Pour résumer, un utilisateur peut chiffrer des données, tandis qu’un serveur, qui n’a pas accès à la clé de déchiffrement, peut appliquer à l’aveugle un algorithme sur ces entrées. Le résultat final est lui aussi chiffré, et il ne peut être lu que par l’utilisateur qui possède la clé secrète. Dans cette thèse, nous présentons des nouvelles techniques et constructions pour le chiffrement totalement homomorphe qui sont motivées par des applications en apprentissage automatique, en portant une attention particulière au problème de l’inférence homomorphe, c’est-à-dire l’évaluation de modèles cognitifs déjà entrainé sur des données chiffrées. Premièrement, nous proposons un nouveau schéma de chiffrement totalement homomorphe adapté à l’évaluation de réseaux de neurones artificiels sur des données chiffrées. Notre schéma atteint une complexité qui est essentiellement indépendante du nombre de couches dans le réseau, alors que l’efficacité des schéma proposés précédemment dépend fortement de la topologie du réseau. Ensuite, nous présentons une nouvelle technique pour préserver la confidentialité du circuit pour le chiffrement totalement homomorphe. Ceci permet de cacher l’algorithme qui a été exécuté sur les données chiffrées, comme nécessaire pour protéger les modèles propriétaires d’apprentissage automatique. Notre mécanisme rajoute un coût supplémentaire très faible pour un niveau de sécurité égal. Ensemble, ces résultats renforcent les fondations du chiffrement totalement homomorphe efficace pour l’apprentissage automatique, et représentent un pas en avant vers l’apprentissage profond pratique préservant la confidentialité. Enfin, nous présentons et implémentons un protocole basé sur le chiffrement totalement homomorphe pour le problème de recherche d’information confidentielle, c’est-à-dire un scénario où un utilisateur envoie une requête à une base de donnée tenue par un serveur sans révéler cette requête
Fully homomorphic encryption enables computation on encrypted data without leaking any information about the underlying data. In short, a party can encrypt some input data, while another party, that does not have access to the decryption key, can blindly perform some computation on this encrypted input. The final result is also encrypted, and it can be recovered only by the party that possesses the secret key. In this thesis, we present new techniques/designs for FHE that are motivated by applications to machine learning, with a particular attention to the problem of homomorphic inference, i.e., the evaluation of already trained cognitive models on encrypted data. First, we propose a novel FHE scheme that is tailored to evaluating neural networks on encrypted inputs. Our scheme achieves complexity that is essentially independent of the number of layers in the network, whereas the efficiency of previously proposed schemes strongly depends on the topology of the network. Second, we present a new technique for achieving circuit privacy for FHE. This allows us to hide the computation that is performed on the encrypted data, as is necessary to protect proprietary machine learning algorithms. Our mechanism incurs very small computational overhead while keeping the same security parameters. Together, these results strengthen the foundations of efficient FHE for machine learning, and pave the way towards practical privacy-preserving deep learning. Finally, we present and implement a protocol based on homomorphic encryption for the problem of private information retrieval, i.e., the scenario where a party wants to query a database held by another party without revealing the query itself

APA, Harvard, Vancouver, ISO, and other styles

42

Sperandio, Ricardo Carlini. "Time series retrieval using DTW-preserving shapelets." Thesis, Rennes 1, 2019. http://www.theses.fr/2019REN1S061.

Full text

Abstract:

L'établissement de la similarité entre séries temporelles est au cœur de nombreuses tâches d'analyse de données. Les mesures permettant d'établir des similitudes entre les séries temporelles sont spécifiques en ce sens qu'elles doivent pouvoir prendre en compte les différences entre les valeurs constituant la série, ainsi que les distorsions selon l'axe du temps. La mesure de similarité la plus répandue est la mesure Dynamic Time Warping (DTW). Cependant, son calcul est coûteux et son application à des séries temporelles nombreuses et/ou très longues est difficile en pratique. Malgré de nombreuses contributions visant l'accélération de la DTW, réussir son passage à l'échelle de la DTW reste une difficulté majeure. Le travail présenté dans cette thèse s'appuie sur l'idée de transformer les séries temporelles à l'aide de shapelets. Il montre comment des shapelets préservant les mesures DTW peuvent être utilisées dans le contexte spécifique de la recherches de séries temporelles similaires à une série utilisée comme requête, et cela dans un contexte grande échelle. Il s’agit de plonger les séries temporelles dans un espace euclidien construit de telle manière que les distances entre les séries selon la métrique DTW s’y trouvent préservées. Ce manuscrit apporte des contributions majeures : (1) il explique comment les shapelets préservant la DTW peuvent être utilisées dans le contexte spécifique de la recherche de séries temporelles similaires ; (2) il propose des stratégies de sélection de ces shapelets pour faire face à l’échelle, c’est-à-dire pour traiter une collection extrêmement vaste de séries temporelles ; (3) il explique en détail comment gérer les séries temporelles univariées et multivariées, couvrant ainsi tout le spectre des problèmes de recherches et facilitant la moise au point d'applications très diverses. Le coeur de la contribution présentée dans ce manuscrit permet de compenser facilement la complexité du processus de plongement par un jeu sur la précision de la recherche. Des expérimentations utilisant les jeux de données UCR et UEA démontrent l’amélioration considérable des performances par rapport aux techniques de pointe
Establishing the similarity of time series is at the core of many data mining tasks such as time series classification, time series clustering, time series retrieval, among others. Metrics to establish similarities between time series are specific in the sense that they must be able to take into account the differences in the values making the series as well as distortions along the timelines. The most popular similarity metric is the Dynamic Time Warping (DTW) measure. However, it is costly to compute, and using it against numerous and/or very long time series is difficult in practice. There has been numerous attempts to accelerate the DTW, yet, scaling DTW remains a major difficulty. An elegant research direction proposes to change the representation of time series such that it is much cheaper to establish similarities. This typically relies on an embedding process where vectorial representations of time series are constructed, allowing then to estimate their similarity using e.g. L2 distances, much faster to compute than DTW. Naturally, the quality of this representation largely depends on the embedding process, and the family of contributions relying on the concept of shapelets prove to work particularly well. Shapelets, and the transform operation materializing the embedding process, were originally proposed for time series classification. Shapelets are independent subsequences extracted or learned from time series to form discriminatory features. Shapelets are used to transform time series in high dimensional (Euclidean) vectors. Recently, it was proposed to embed time series into an Euclidean space such that the distance in this embedded space well approximates the true DTW. This contribution targets time series clustering. The work presented in this Ph.D. manuscript builds on the idea of transforming time series using shapelets. It shows how shapelets that preserve DTW measures can be used in the specific context of large scale time series retrieval. This manuscript is making major contributions: (1) it explains how DTW-preserving shapelets can be used in the specific context of time series retrieval; (2) it proposes some shapelet selection strategies in order to cope with scale, that is, in order to deal with extremely large collection of time series; (3) it details how to handle both univariate and multivariate time series, hence covering the whole spectrum of time series retrieval problems. The core of the contribution presented in this manuscript allows to easily trade-off the complexity of the transformation against the accuracy of the retrieval. Experiments using the UCR and the UEA datasets demonstrate the vast performance improvements compared to state of the art techniques

APA, Harvard, Vancouver, ISO, and other styles

43

Babina, Chiara. "Privacy nel contesto location-based services." Bachelor's thesis, Alma Mater Studiorum - Università di Bologna, 2018. http://amslaurea.unibo.it/16199/.

Full text

Abstract:

Al giorno d'oggi la privacy degli utenti in rete è messa a rischio: siti web, applicazioni per smartphones, social networks, memorizzano dati che se assemblati possono descrivere le abitudini di un utente, le sue peculiarità, i suoi spostamenti, insomma la sua identità. L'obiettivo della mia tesi è stato quello di dimostrare quanto può essere dedotto su un utente partendo da semplici dati, quali la posizione e l'attività dello spostamento da una posizione all'altra e il tempo, raccolti nella cronologia di Google Maps. In questo lavoro è stato creato un modello attraverso l'uso di tecniche di Machine Learning per andare ad analizzare i pattern, in termini di spostamenti e di attività, che caratterizzano un utente nella sua vita quotidiana.

APA, Harvard, Vancouver, ISO, and other styles

44

Spolaor, Riccardo. "Security and Privacy Threats on Mobile Devices through Side-Channels Analysis." Doctoral thesis, Università degli studi di Padova, 2018. http://hdl.handle.net/11577/3426796.

Full text

Abstract:

In recent years, mobile devices (such as smartphones and tablets) have become essential tools in everyday life for billions of people all around the world. Users continuously carry such devices with them and use them for daily communication activities and social network interactions. Hence, such devices contain a huge amount of private and sensitive information. For this reason, mobile devices become popular targets of attacks. In most attack settings, the adversary aims to take local or remote control of a device to access user sensitive information. However, such violations are not easy to carry out since they need to leverage a vulnerability of the system or a careless user (i.e., install a malware app from an unreliable source). A different approach that does not have these shortcomings is the side-channels analysis. In fact, side-channels are physical phenomenon that can be measured from both inside or outside a device. They are mostly due to the user interaction with a mobile device, but also to the context in which the device is used, hence they can reveal sensitive user information such as identity and habits, environment, and operating system itself. Hence, this approach consists of inferring private information that is leaked by a mobile device through a side-channel. Besides, side-channel information is also extremely valuable to enforce security mechanisms such as user authentication, intrusion and information leaks detection. This dissertation investigates novel security and privacy challenges on the analysis of side-channels of mobile devices. This thesis is composed of three parts, each focused on a different side-channel: (i) the usage of network traffic analysis to infer user private information; (ii) the energy consumption of mobile devices during battery recharge as a way to identify a user and as a covert channel to exfiltrate data; and (iii) the possible security application of data collected from built-in sensors in mobile devices to authenticate the user and to evade sandbox detection by malware. In the first part of this dissertation, we consider an adversary who is able to eavesdrop the network traffic of the device on the network side (e.g., controlling a WiFi access point). The fact that the network traffic is often encrypted makes the attack even more challenging. Our work proves that it is possible to leverage machine learning techniques to identify user activity and apps installed on mobile devices analyzing the encrypted network traffic they produce. Such insights are becoming a very attractive data gathering technique for adversaries, network administrators, investigators and marketing agencies. In the second part of this thesis, we investigate the analysis of electric energy consumption. In this case, an adversary is able to measure with a power monitor the amount of energy supplied to a mobile device. In fact, we observed that the usage of mobile device resources (e.g., CPU, network capabilities) directly impacts the amount of energy retrieved from the supplier, i.e., USB port for smartphones, wall-socket for laptops. Leveraging energy traces, we are able to recognize a specific laptop user among a group and detect intruders (i.e., user not belonging to the group). Moreover, we show the feasibility of a covert channel to exfiltrate user data which relies on temporized energy consumption bursts. In the last part of this dissertation, we present a side-channel that can be measured within the mobile device itself. Such channel consists of data collected from the sensors a mobile device is equipped with (e.g., accelerometer, gyroscope). First, we present DELTA, a novel tool that collects data from such sensors, and logs user and operating system events. Then, we develop MIRAGE, a framework that relies on sensors data to enhance sandboxes against malware analysis evasion.
Negli ultimi anni, i dispositivi mobili (come smartphone e tablet) sono diventati strumenti essenziali nella vita di tutti i giorni per miliardi di persone in tutto il mondo. Gli utenti utilizzano continuamente tali dispositivi per le attività quotidiane di comunicazione e le interazioni dei social network. Quindi, tali dispositivi contengono un'enorme quantità di informazioni private e sensibili. Per questo motivo, i dispositivi mobili diventano popolari bersagli di attacchi. Nella maggior parte degli attacchi ai dispositivi mobili, l'avversario ha l'obiettivo di prendere il controllo locale o remoto del dispositivo, per accedere alle informazioni sensibili dell'utente. Tuttavia, tali violazioni non sono facili da portare a termine poiché devono sfruttare una vulnerabilità del sistema o un utente distratto (ad esempio, installare un'app malware da una fonte inaffidabile). Un approccio diverso che non ha queste carenze è l'analisi dei canali laterali. In effetti, i canali laterali sono fenomeni fisici misurabili dall'interno o dall'esterno di un dispositivo. Sono principalmente dovuti all'interazione dell'utente con un dispositivo mobile, ma anche al contesto in cui viene utilizzato il dispositivo, quindi possono rivelare informazioni private tra cui l'identità e abitudini, ambiente e sistema operativo stesso. Quindi, questo approccio consiste nel dedurre informazioni private che sono trapelate da un dispositivo mobile attraverso un canale laterale. Inoltre, le informazioni sul canale laterale sono estremamente preziose per rafforzare i meccanismi di sicurezza come l'autenticazione dell'utente, l'intrusione e il rilevamento di furto di informazioni. Questa tesi studia le nuove sfide relative alla sicurezza e alla privacy nell'analisi dei canali secondari dei dispositivi mobili. Questa tesi è composta da tre parti, ognuna focalizzata su un canale laterale diverso: (i) l'uso dell'analisi del traffico di rete per dedurre le informazioni private dell'utente; (ii) il consumo di energia dei dispositivi mobili durante la ricarica della batteria come mezzo per identificare un utente e come canale nascosto per estrarre i dati; e (iii) l'eventuale applicazione di sicurezza dei dati raccolti dai sensori integrati nei dispositivi mobili per autenticare l'utente e per evitare il rilevamento di sandbox da parte di malware. Nella prima parte di questa tesi, consideriamo un avversario in grado di intercettare il traffico di rete del dispositivo sul lato della rete (ad esempio, controllando un punto di accesso WiFi). Il fatto che il traffico di rete sia spesso crittografato rende l'attacco ancora più impegnativo. Il nostro lavoro dimostra che è possibile sfruttare le tecniche di machine learning per identificare le attività degli utenti e le app installate sui dispositivi mobili analizzando il traffico di rete crittografato che producono. Queste informazioni stanno diventando una tecnica di raccolta dati molto attraente per avversari, amministratori di rete, investigatori e agenzie di marketing. Nella seconda parte di questa tesi, esaminiamo l'analisi del consumo di energia elettrica. In questo caso, un avversario è in grado di misurare con un monitor di potenza la quantità di energia fornita a un dispositivo mobile. Infatti, abbiamo osservato che l'utilizzo delle risorse del dispositivo mobile (ad es. CPU, capacità di rete) influisce direttamente sulla quantità di energia erogata, ovvero dalla porta USB per smartphone o dalla presa a muro per laptop. Sfruttando le tracce di energia, siamo in grado di riconoscere uno specifico utente di laptop in un gruppo e a rilevare potenziali intrusi (ad esempio, l'utente non appartenente al gruppo). Inoltre, mostriamo la fattibilità di un canale nascosto per estrarre dati dell'utente che si basa su picchi temporizzati di consumo di energia. Nell'ultima parte di questa tesi, presentiamo un canale laterale che può essere misurato all'interno del dispositivo mobile stesso. Tale canale è costituito da dati raccolti dai sensori di cui è dotato un dispositivo mobile (ad es. accelerometro, giroscopio). Innanzitutto, presentiamo DELTA, un nuovo strumento che raccoglie i dati da tali sensori e registra gli eventi degli utenti e del sistema operativo. Dopodichè presentiamo MIRAGE, un framework che si basa sui dati dei sensori per migliorare le sandbox contro l'evasione delle analisi malware.

APA, Harvard, Vancouver, ISO, and other styles

45

Rekanar, Kaavya. "Text Classification of Legitimate and Rogue online Privacy Policies : Manual Analysis and a Machine Learning Experimental Approach." Thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-13363.

Full text

APA, Harvard, Vancouver, ISO, and other styles

46

Alisic, Rijad. "Privacy of Sudden Events in Cyber-Physical Systems." Licentiate thesis, KTH, Reglerteknik, 2021. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-299845.

Full text

Abstract:

Cyberattacks against critical infrastructures has been a growing problem for the past couple of years. These infrastructures are a particularly desirable target for adversaries, due to their vital importance in society. For instance, a stop in the operation of a critical infrastructure could result in a crippling effect on a nation's economy, security or public health. The reason behind this increase is that critical infrastructures have become more complex, often being integrated with a large network of various cyber components. It is through these cyber components that an adversary is able to access the system and conduct their attacks. In this thesis, we consider methods which can be used as a first line of defence against such attacks for Cyber-Physical Systems (CPS). Specifically, we start by studying how information leaks about a system's dynamics helps an adversary to generate attacks that are difficult to detect. In many cases, such attacks can be detrimental to a CPS since they can drive the system to a breaking point without being detected by the operator that is tasked to secure the system. We show that an adversary can use small amounts of data procured from information leaks to generate these undetectable attacks. In particular, we provide the minimal amount of information that is needed in order to keep the attack hidden even if the operator tries to probe the system for attacks. We design defence mechanisms against such information leaks using the Hammersley-Chapman-Robbins lower bound. With it, we study how information leakage could be mitigated through corruption of the data by injection of measurement noise. Specifically, we investigate how information about structured input sequences, which we call events, can be obtained through the output of a dynamical system and how this leakage depends on the system dynamics. For example, it is shown that a system with fast dynamical modes tends to disclose more information about an event compared to a system with slower modes. However, a slower system leaks information over a longer time horizon, which means that an adversary who starts to collect information long after the event has occured might still be able to estimate it. Additionally, we show how sensor placements can affect the information leak. These results are then used to aid the operator to detect privacy vulnerabilities in the design of a CPS. Based on the Hammersley-Chapman-Robbins lower bound, we provide additional defensive mechanisms that can be deployed by an operator online to minimize information leakage. For instance, we propose a method to modify the structured inputs in order to maximize the usage of the existing noise in the system. This mechanism allows us to explicitly deal with the privacy-utility trade-off, which is of interest when optimal control problems are considered. Finally, we show how the adversary's certainty of the event increases as a function of the number of samples they collect. For instance, we provide sufficient conditions for when their estimation variance starts to converge to its final value. This information can be used by an operator to estimate when possible attacks from an adversary could occur, and change the CPS before that, rendering the adversary's collected information useless.
De senaste åren har cyberanfall mot kritiska infrastructurer varit ett växande problem. Dessa infrastrukturer är särskilt utsatta för cyberanfall, eftersom de uppfyller en nödvändig function för att ett samhälle ska fungera. Detta gör dem till önskvärda mål för en anfallare. Om en kritisk infrastruktur stoppas från att uppfylla sin funktion, då kan det medföra förödande konsekvenser för exempelvis en nations ekonomi, säkerhet eller folkhälsa. Anledningen till att mängden av attacker har ökat beror på att kritiska infrastrukturer har blivit alltmer komplexa eftersom de numera ingår i stora nätverk dör olika typer av cyberkomponenter ingår. Det är just genom dessa cyberkomponenter som en anfallare kan få tillgång till systemet och iscensätta cyberanfall. I denna avhandling utvecklar vi metoder som kan användas som en första försvarslinje mot cyberanfall på cyberfysiska system (CPS). Vi med att undersöka hur informationsläckor om systemdynamiken kan hjälpa en anfallare att skapa svårupptäckta attacker. Oftast är sådana attacker förödande för CPS, eftersom en anfallare kan tvinga systemet till en bristningsgräns utan att bli upptäcka av operatör vars uppgift är att säkerställa systemets fortsatta funktion. Vi bevisar att en anfallare kan använda relativt små mängder av data för att generera dessa svårupptäckta attacker. Mer specifikt så härleder ett uttryck för den minsta mängd information som krävs för att ett anfall ska vara svårupptäckt, även för fall då en operatör tar till sig metoder för att undersöka om systemet är under attack. I avhandlingen konstruerar vi försvarsmetoder mot informationsläcker genom Hammersley-Chapman-Robbins olikhet. Med denna olikhet kan vi studera hur informationsläckan kan dämpas genom att injicera brus i datan. Specifikt så undersöker vi hur mycket information om strukturerade insignaler, vilket vi kallar för händelser, till ett dynamiskt system som en anfallare kan extrahera utifrån dess utsignaler. Dessutom kollar vi på hur denna informationsmängd beror på systemdynamiken. Exempelvis så visar vi att ett system med snabb dynamik läcker mer information jämfört med ett långsammare system. Däremot smetas informationen ut över ett längre tidsintervall för långsammare system, vilket leder till att anfallare som börjar tjuvlyssna på ett system långt efter att händelsen har skett kan fortfarande uppskatta den. Dessutom så visar vi jur sensorplaceringen i ett CPS påverkar infromationsläckan. Dessa reultat kan användas för att bistå en operatör att analysera sekretessen i ett CPS. Vi använder även Hammersley-Chapman-Robbins olikhet för att utveckla försvarslösningar mot informationsläckor som kan användas \textit{online}. Vi föreslår modifieringar till den strukturella insignalen så att systemets befintliga brus utnyttjas bättre för att gömma händelsen. Om operatören har andra mål den försöker uppfylla med styrningen så kan denna metod användas för att styra avvängingen mellan sekretess och operatorns andra mål. Slutligen så visar vi hur en anfallares uppskattning av händelsen förbättras som en funktion av mängden data får tag på. Operatorn kan använda informationen för att ta reda på när anfallaren kan tänka sig vara redo att anfalla systemet, och därefter ändra systemet innan detta sker, vilket gör att anfallarens information inte längre är användbar.

QC 20210820

APA, Harvard, Vancouver, ISO, and other styles

47

Baier, Lucas [Verfasser], and G. [Akademischer Betreuer] Satzger. "Concept Drift Handling in Information Systems: Preserving the Validity of Deployed Machine Learning Models / Lucas Baier ; Betreuer: G. Satzger." Karlsruhe : KIT-Bibliothek, 2021. http://d-nb.info/1241189250/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

48

Wang, Yu-Xiang. "New Paradigms and Optimality Guarantees in Statistical Learning and Estimation." Research Showcase @ CMU, 2017. http://repository.cmu.edu/dissertations/1113.

Full text

Abstract:

Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. Within the field of machine learning itself, there had been quite a number of paradigm shifts caused by the explosion of data size, computing power, modeling tools, and the new ways people collect, share, and make use of data sets. Data privacy, for instance, was much less of a problem before the availability of personal information online that could be used to identify users in anonymized data sets. Images, videos, as well as observations generated over a social networks, often have highly localized structures, that cannot be captured by standard nonparametric models. Moreover, the “common task framework” that is adopted by many sub- disciplines of AI has made it possible for many people to collaboratively and repeated work on the same data set, leading to implicit overfitting on public benchmarks. In addition, data collected in many internet services, e.g., web search and targeted ads, are not iid, but rather feedbacks specific to the deployed algorithm. This thesis presents technical contributions under a number of new mathematical frameworks that are designed to partially address these new paradigms. • Firstly, we consider the problem of statistical learning with privacy constraints. Under Vapnik’s general learning setting and the formalism of differential privacy (DP), we establish simple conditions that characterizes the private learnability, which reveals a mixture of positive and negative insight. We then identify generic methods that reuses existing randomness to effectively solve private learning in practice; and discuss weaker notions of privacy that allows for more favorable privacy-utility tradeoff. • Secondly, we develop a few generalizations of trend filtering, a locally-adaptive nonparametric regression technique that is minimax in 1D, to the multivariate setting and to graphs. We also study specific instances of the problems, e.g., total variation denoising on d-dimensional grids more closely and the results reveal interesting statistical computational trade-offs. • Thirdly, we investigate two problems in sequential interactive learning: a) off- policy evaluation in contextual bandits, that aims to use data collected from one algorithm to evaluate the performance of a different algorithm; b) the problem of adaptive data analysis, that uses randomization to prevent adversarial data analysts from a form of “p-hacking” through multiple steps of sequential data access. In the above problems, we will provide not only performance guarantees of algorithms but also certain notions of optimality. Whenever applicable, careful empirical studies on synthetic and real data are also included.

APA, Harvard, Vancouver, ISO, and other styles

49

Hathurusinghe, Rajitha. "Building a Personally Identifiable Information Recognizer in a Privacy Preserved Manner Using Automated Annotation and Federated Learning." Thesis, Université d'Ottawa / University of Ottawa, 2020. http://hdl.handle.net/10393/41011.

Full text

Abstract:

This thesis explores the training of a deep neural network based named entity recognizer in an end-to-end privacy preserved setting where dataset creation and model training happen in an environment with minimal manual interventions. With the improvement of accuracy in Deep Learning Models for practical tasks, a rising concern is satisfying the demand for training data for these models amidst the concerns on the data privacy. Several scenarios of data protection are suggested in the recent past due to public concerns hence the legal guidelines to enforce them. A promising new development is the decentralized model training on isolated datasets, which eliminates the compromises of privacy upon providing data to a centralized entity. However, in this federated setting curating the data source is still a privacy risk mostly in unstructured data sources such as text. We explore the feasibility of automatic dataset annotation for a Named Entity Recognition (NER) task and training a deep learning model with it in two federated learning settings. We explore the feasibility of utilizing a dataset created in this manner for fine-tuning a stateof- the-art deep learning language model for the downstream task of named entity recognition. We also explore this novel setting of deep learning NLP model and federated learning for its deviation from the classical centralized setting. We created an automatically annotated dataset containing around 80,000 sentences, a manual human annotated test set and tools to extend the dataset with more manual annotations. We observed the noise from automated annotation can be overcome to a level by increasing the dataset size. We also contributed to the federated learning framework with state-of-the-art NLP model developments. Overall, our NER model achieved around 0.80 F1-score for recognition of entities in sentences.

APA, Harvard, Vancouver, ISO, and other styles

50

Tout, Hicham Refaat. "Measuring the Impact of email Headers on the Predictive Accuracy of Machine Learning Techniques." NSUWorks, 2013. http://nsuworks.nova.edu/gscis_etd/325.

Full text

Abstract:

The majority of documented phishing attacks have been carried by email, yet few studies have measured the impact of email headers on the predictive accuracy of machine learning techniques in detecting email phishing attacks. Research has shown that the inclusion of a limited subset of email headers as features in training machine learning algorithms to detect phishing attack did increase the predictive accuracy of these learning algorithms. The same research also recommended further investigation of the impact of including an expanded set of email headers on the predictive accuracy of machine learning algorithms. In addition, research has shown that the cost of misclassifying legitimate emails as phishing attacks--false positives--was far higher than that of misclassifying phishing emails as legitimate--false negatives, while the opposite was true in the case of fraud detection. Consequently, they recommended that cost sensitive measures be taken in order to further improve the weighted predictive accuracy of machine learning algorithms. Motivated by the potentially high impact of the inclusion of email headers on the predictive accuracy of machines learning algorithms and the significance of enabling cost-sensitive measures as part of the learning process, the goal of this research was to quantify the impact of including an extended set of email headers and to investigate the impact of imposing penalty as part of the learning process on the number of false positives. It was believed that if email headers were included and cost-sensitive measures were taken as part of the learning process, than the overall weighted, predictive accuracy of the machine learning algorithm would be improved. The results showed that adding email headers as features did improve the overall predictive accuracy of machine learning algorithms and that cost-sensitive measure taken as part of the learning process did result in lower false positives.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Privacy preserving machine learning'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles