Dissertations / Theses: 'Data detection'

1

Weis, Melanie. "Duplicate detection in XML data." Duisburg Köln WiKu, 2007. http://d-nb.info/987676849/04.

APA, Harvard, Vancouver, ISO, and other styles

2

Cao, Lei. "Outlier Detection In Big Data." Digital WPI, 2016. https://digitalcommons.wpi.edu/etd-dissertations/82.

Abstract:

The dissertation focuses on scaling outlier detection to work both on huge static as well as on dynamic streaming datasets. Outliers are patterns in the data that do not conform to the expected behavior. Outlier detection techniques are broadly applied in applications ranging from credit fraud prevention, network intrusion detection to stock investment tactical planning. For such mission critical applications, a timely response often is of paramount importance. Yet the processing of outlier detection requests is of high algorithmic complexity and resource consuming. In this dissertation we investigate the challenges of detecting outliers in big data -- in particular caused by the high velocity of streaming data, the big volume of static data and the large cardinality of the input parameter space for tuning outlier mining algorithms. Effective optimization techniques are proposed to assure the responsiveness of outlier detection in big data. In this dissertation we first propose a novel optimization framework called LEAP to continuously detect outliers over data streams. The continuous discovery of outliers is critical for a large range of online applications that monitor high volume continuously evolving streaming data. LEAP encompasses two general optimization principles that utilize the rarity of the outliers and the temporal priority relationships among stream data points. Leveraging these two principles LEAP not only is able to continuously deliver outliers with respect to a set of popular outlier models, but also provides near real-time support for processing powerful outlier analytics workloads composed of large numbers of outlier mining requests with various parameter settings. Second, we develop a distributed approach to efficiently detect outliers over massive-scale static data sets. In this big data era, as the volume of the data advances to new levels, the power of distributed compute clusters must be employed to detect outliers in a short turnaround time. In this research, our approach optimizes key factors determining the efficiency of distributed data analytics, namely, communication costs and load balancing. In particular we prove the traditional frequency-based load balancing assumption is not effective. We thus design a novel cost-driven data partitioning strategy that achieves load balancing. Furthermore, we abandon the traditional one detection algorithm for all compute nodes approach and instead propose a novel multi-tactic methodology which adaptively selects the most appropriate algorithm for each node based on the characteristics of the data partition assigned to it. Third, traditional outlier detection systems process each individual outlier detection request instantiated with a particular parameter setting one at a time. This is not only prohibitively time-consuming for large datasets, but also tedious for analysts as they explore the data to hone in on the most appropriate parameter setting or on the desired results. We thus design an interactive outlier exploration paradigm that is not only able to answer traditional outlier detection requests in near real-time, but also offers innovative outlier analytics tools to assist analysts to quickly extract, interpret and understand the outliers of interest. Our experimental studies including performance evaluation and user studies conducted on real world datasets including stock, sensor, moving object, and Geolocation datasets confirm both the effectiveness and efficiency of the proposed approaches.

APA, Harvard, Vancouver, ISO, and other styles

3

Abghari, Shahrooz. "Data Modeling for Outlier Detection." Licentiate thesis, Blekinge Tekniska Högskola, Institutionen för datalogi och datorsystemteknik, 2018. http://urn.kb.se/resolve?urn=urn:nbn:se:bth-16580.

Full text

Abstract:

This thesis explores the data modeling for outlier detection techniques in three different application domains: maritime surveillance, district heating, and online media and sequence datasets. The proposed models are evaluated and validated under different experimental scenarios, taking into account specific characteristics and setups of the different domains. Outlier detection has been studied and applied in many domains. Outliers arise due to different reasons such as fraudulent activities, structural defects, health problems, and mechanical issues. The detection of outliers is a challenging task that can reveal system faults, fraud, and save people's lives. Outlier detection techniques are often domain-specific. The main challenge in outlier detection relates to modeling the normal behavior in order to identify abnormalities. The choice of model is important, i.e., an incorrect choice of data model can lead to poor results. This requires a good understanding and interpretation of the data, the constraints, and the requirements of the problem domain. Outlier detection is largely an unsupervised problem due to unavailability of labeled data and the fact that labeled data is expensive. We have studied and applied a combination of both machine learning and data mining techniques to build data-driven and domain-oriented outlier detection models. We have shown the importance of data preprocessing as well as feature selection in building suitable methods for data modeling. We have taken advantage of both supervised and unsupervised techniques to create hybrid methods. For example, we have proposed a rule-based outlier detection system based on open data for the maritime surveillance domain. Furthermore, we have combined cluster analysis and regression to identify manual changes in the heating systems at the building level. Sequential pattern mining for identifying contextual and collective outliers in online media data have also been exploited. In addition, we have proposed a minimum spanning tree clustering technique for detection of groups of outliers in online media and sequence data. The proposed models have been shown to be capable of explaining the underlying properties of the detected outliers. This can facilitate domain experts in narrowing down the scope of analysis and understanding the reasons of such anomalous behaviors. We have also investigated the reproducibility of the proposed models in similar application domains.
Scalable resource-efficient systems for big data analytics

APA, Harvard, Vancouver, ISO, and other styles

4

Payne, Timothy Myles. "Remote detection using fused data /." Title page, abstract and table of contents only, 1994. http://web4.library.adelaide.edu.au/theses/09PH/09php3465.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Forstén, Andreas. "Unsupervised Anomaly Detection in Receipt Data." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-215161.

Full text

Abstract:

With the progress of data handling methods and computing power comes the possibility of automating tasks that are not necessarily handled by humans. This study was done in cooperation with a company that digitalizes receipts for companies. We investigate the possibility of automating the task of finding anomalous receipt data, which could automate the work of receipt auditors. We study both anomalous user behaviour and individual receipts. The results indicate that automation is possible, which may reduce the necessity of human inspection of receipts.
Med de framsteg inom datahantering och datorkraft som gjorts så kommer också möjligheten att automatisera uppgifter som ej nödvändigtvis utförs av människor. Denna studie gjordes i samarbete med ett företag som digitaliserar företags kvitton. Vi undersöker möjligheten att automatisera sökandet av avvikande kvittodata, vilket kan avlasta revisorer. Vti studerar både avvikande användarbeteenden och individuella kvitton. Resultaten indikerar att automatisering är möjligt, vilket kan reducera behovet av mänsklig inspektion av kvitton

APA, Harvard, Vancouver, ISO, and other styles

6

Tian, Xuwen, and 田旭文. "Data-driven textile flaw detection methods." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 2012. http://hdl.handle.net/10722/196091.

Full text

Abstract:

This research develops three efficient textile flaw detection methods to facilitate automated textile inspection for the textile-related industries. Their novelty lies in detecting flaws with knowledge directly extracted from textile images, unlike existing methods which detect flaws with empirically specified texture features. The first two methods treat textile flaw detection as a texture classification problem, and consider that defect-free images of a textile fabric normally possess common latent images, called basis-images. The inner product of a basis-image and an image acquired from this fabric is a feature value of this fabric image. As the defect-free images are similar, their feature values gather in a cluster, whose boundary can be determined by using the feature values of known defect-free images. A fabric image is considered defect-free, if its feature values lie within this boundary. These methods extract the basis-images from known defect-free images in a training process, and require less consideration than existing methods on the degree of matching of a textile to the texture features specified for the textile. One method uses matrix singular value decomposition (SVD) to extract these basis-images containing the spatial relationship of pixels in rows or in columns. The alternative method uses tensor decomposition to find the relationship of pixels in both rows and columns within each training image and the common relationship of these training images. Tensor decomposition is found to be superior to matrix SVD in finding the basis-images needed to represent these defect-free images, because extracting and decomposing the tri-lateral relationship usually generates better basis-images. The third method solves the textile flaw detection problem by means of texture segmentation, and is suitable for online detection because it does not require texture features specified by experience or found from known defect-free images. The method detects the presence of flaws by using the contrast between regions in the feature images of a textile image. These feature images are the output of a filter bank consisting of Gabor filters with scales and rotations. This method selects the feature image with maximal image contrast, and partitions this image into regions with morphological watershed transform to facilitate faster searching of defect-free regions and to remove isolated pixels with exceptional feature values. Regions with no flaws have similar statistics, e.g. similar means. Regions with significantly dissimilar statistics may contain flaws and are removed iteratively from the set which initially contains all regions. Removing regions uses the thresholds determined by using Neyman-Pearson criterion and updated along with the remaining regions in the set. This procedure continues until the set only contains defect-free regions. The occurrence of the removed regions indicates the presence of flaws whose extents are decided by pixel classification using the thresholds derived from the defect-free regions. A prototype textile inspection system is built to demonstrate the automatic textile inspection process. The developed methods are proved reliable and effective by testing them with a variety of defective textile images. These methods also have several advantages, e.g. less empirical knowledge of textiles is needed for selecting texture features.
published_or_final_version
Industrial and Manufacturing Systems Engineering
Doctoral
Doctor of Philosophy

APA, Harvard, Vancouver, ISO, and other styles

7

Siddiqui, Muazzam. "DATA MINING METHODS FOR MALWARE DETECTION." Doctoral diss., University of Central Florida, 2008. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/2783.

Full text

Abstract:

This research investigates the use of data mining methods for malware (malicious programs) detection and proposed a framework as an alternative to the traditional signature detection methods. The traditional approaches using signatures to detect malicious programs fails for the new and unknown malwares case, where signatures are not available. We present a data mining framework to detect malicious programs. We collected, analyzed and processed several thousand malicious and clean programs to find out the best features and build models that can classify a given program into a malware or a clean class. Our research is closely related to information retrieval and classification techniques and borrows a number of ideas from the field. We used a vector space model to represent the programs in our collection. Our data mining framework includes two separate and distinct classes of experiments. The first are the supervised learning experiments that used a dataset, consisting of several thousand malicious and clean program samples to train, validate and test, an array of classifiers. In the second class of experiments, we proposed using sequential association analysis for feature selection and automatic signature extraction. With our experiments, we were able to achieve as high as 98.4% detection rate and as low as 1.9% false positive rate on novel malwares.
Ph.D.
Other
Sciences
Modeling and Simulation PhD

APA, Harvard, Vancouver, ISO, and other styles

8

Mohd, Ali Azliza. "Anomalous behaviour detection using heterogeneous data." Thesis, Lancaster University, 2018. http://eprints.lancs.ac.uk/125026/.

Full text

Abstract:

Anomaly detection is one of the most important methods to process and find abnormal data, as this method can distinguish between normal and abnormal behaviour. Anomaly detection has been applied in many areas such as the medical sector, fraud detection in finance, fault detection in machines, intrusion detection in networks, surveillance systems for security, as well as forensic investigations. Abnormal behaviour can give information or answer questions when an investigator is performing an investigation. Anomaly detection is one way to simplify big data by focusing on data that have been grouped or clustered by the anomaly detection method. Forensic data usually consists of heterogeneous data which have several data forms or types such as qualitative or quantitative, structured or unstructured, and primary or secondary. For example, when a crime takes place, the evidence can be in the form of various types of data. The combination of all the data types can produce rich information insights. Nowadays, data has become ‘big’ because it is generated every second of every day and processing has become time-consuming and tedious. Therefore, in this study, a new method to detect abnormal behaviour is proposed using heterogeneous data and combining the data using data fusion technique. Vast challenge data and image data are applied to demonstrate the heterogeneous data. The first contribution in this study is applying the heterogeneous data to detect an anomaly. The recently introduced anomaly detection technique which is known as Empirical Data Analytics (EDA) is applied to detect the abnormal behaviour based on the data sets. Standardised eccentricity (a newly introduced within EDA measure offering a new simplified form of the well-known Chebyshev Inequality) can be applied to any data distribution. Then, the second contribution is applying image data. The image data is processed using pre-trained deep learning network, and classification is done using a support vector machine (SVM). After that, the last contribution is combining anomaly result from heterogeneous data and image recognition using new data fusion technique. There are five types of data with three different modalities and different dimensionalities. The data cannot be simply combined and integrated. Therefore, the new data fusion technique first analyses the abnormality in each data type separately and determines the degree of suspicious between 0 and 1 and sums up all the degrees of suspicion data afterwards. This method is not intended to be a fully automatic system that resolves investigations, which would likely be unacceptable in any case. The aim is rather to simplify the role of the humans so that they can focus on a small number of cases to be looked in more detail. The proposed approach does simplify the processing of such huge amounts of data. Later, this method can assist human experts in their investigations and making final decisions.

APA, Harvard, Vancouver, ISO, and other styles

9

Pellissier, Muriel. "Anomaly detection technique for sequential data." Thesis, Grenoble, 2013. http://www.theses.fr/2013GRENM078/document.

Full text

Abstract:

De nos jours, beaucoup de données peuvent être facilement accessibles. Mais toutes ces données ne sont pas utiles si nous ne savons pas les traiter efficacement et si nous ne savons pas extraire facilement les informations pertinentes à partir d'une grande quantité de données. Les techniques de détection d'anomalies sont utilisées par de nombreux domaines afin de traiter automatiquement les données. Les techniques de détection d'anomalies dépendent du domaine d'application, des données utilisées ainsi que du type d'anomalie à détecter.Pour cette étude nous nous intéressons seulement aux données séquentielles. Une séquence est une liste ordonnée d'objets. Pour de nombreux domaines, il est important de pouvoir identifier les irrégularités contenues dans des données séquentielles comme par exemple les séquences ADN, les commandes d'utilisateur, les transactions bancaires etc.Cette thèse présente une nouvelle approche qui identifie et analyse les irrégularités de données séquentielles. Cette technique de détection d'anomalies peut détecter les anomalies de données séquentielles dont l'ordre des objets dans les séquences est important ainsi que la position des objets dans les séquences. Les séquences sont définies comme anormales si une séquence est presque identique à une séquence qui est fréquente (normale). Les séquences anormales sont donc les séquences qui diffèrent légèrement des séquences qui sont fréquentes dans la base de données.Dans cette thèse nous avons appliqué cette technique à la surveillance maritime, mais cette technique peut être utilisée pour tous les domaines utilisant des données séquentielles. Pour notre application, la surveillance maritime, nous avons utilisé cette technique afin d'identifier les conteneurs suspects. En effet, de nos jours 90% du commerce mondial est transporté par conteneurs maritimes mais seulement 1 à 2% des conteneurs peuvent être physiquement contrôlés. Ce faible pourcentage est dû à un coût financier très élevé et au besoin trop important de ressources humaines pour le contrôle physique des conteneurs. De plus, le nombre de conteneurs voyageant par jours dans le monde ne cesse d'augmenter, il est donc nécessaire de développer des outils automatiques afin d'orienter le contrôle fait par les douanes afin d'éviter les activités illégales comme les fraudes, les quotas, les produits illégaux, ainsi que les trafics d'armes et de drogues. Pour identifier les conteneurs suspects nous comparons les trajets des conteneurs de notre base de données avec les trajets des conteneurs dits normaux. Les trajets normaux sont les trajets qui sont fréquents dans notre base de données.Notre technique est divisée en deux parties. La première partie consiste à détecter les séquences qui sont fréquentes dans la base de données. La seconde partie identifie les séquences de la base de données qui diffèrent légèrement des séquences qui sont fréquentes. Afin de définir une séquence comme normale ou anormale, nous calculons une distance entre une séquence qui est fréquente et une séquence aléatoire de la base de données. La distance est calculée avec une méthode qui utilise les différences qualitative et quantitative entre deux séquences
Nowadays, huge quantities of data can be easily accessible, but all these data are not useful if we do not know how to process them efficiently and how to extract easily relevant information from a large quantity of data. The anomaly detection techniques are used in many domains in order to help to process the data in an automated way. The anomaly detection techniques depend on the application domain, on the type of data, and on the type of anomaly.For this study we are interested only in sequential data. A sequence is an ordered list of items, also called events. Identifying irregularities in sequential data is essential for many application domains like DNA sequences, system calls, user commands, banking transactions etc.This thesis presents a new approach for identifying and analyzing irregularities in sequential data. This anomaly detection technique can detect anomalies in sequential data where the order of the items in the sequences is important. Moreover, our technique does not consider only the order of the events, but also the position of the events within the sequences. The sequences are spotted as anomalous if a sequence is quasi-identical to a usual behavior which means if the sequence is slightly different from a frequent (common) sequence. The differences between two sequences are based on the order of the events and their position in the sequence.In this thesis we applied this technique to the maritime surveillance, but this technique can be used by any other domains that use sequential data. For the maritime surveillance, some automated tools are needed in order to facilitate the targeting of suspicious containers that is performed by the customs. Indeed, nowadays 90% of the world trade is transported by containers and only 1-2% of the containers can be physically checked because of the high financial cost and the high human resources needed to control a container. As the number of containers travelling every day all around the world is really important, it is necessary to control the containers in order to avoid illegal activities like fraud, quota-related, illegal products, hidden activities, drug smuggling or arm smuggling. For the maritime domain, we can use this technique to identify suspicious containers by comparing the container trips from the data set with itineraries that are known to be normal (common). A container trip, also called itinerary, is an ordered list of actions that are done on containers at specific geographical positions. The different actions are: loading, transshipment, and discharging. For each action that is done on a container, we know the container ID and its geographical position (port ID).This technique is divided into two parts. The first part is to detect the common (most frequent) sequences of the data set. The second part is to identify those sequences that are slightly different from the common sequences using a distance-based method in order to classify a given sequence as normal or suspicious. The distance is calculated using a method that combines quantitative and qualitative differences between two sequences

APA, Harvard, Vancouver, ISO, and other styles

10

Al-Bataineh, Hussien Suleiman. "Islanding Detection Using Data Mining Techniques." Thesis, North Dakota State University, 2015. https://hdl.handle.net/10365/27634.

Full text

Abstract:

Connection of the distributed generators (DGs), poses new challenges for operation and management of the distribution system. An important issue is that of islanding, where a part of the system gets disconnected from the DG. This thesis explores the use of several data-mining, and machine learning techniques to detect islanding. Several cases of islanding and non- islanding are simulated with a standard test-case: the IEEE 13 bus test distribution system. Different types of DGs are connected to the system and disturbances are introduced. Several classifiers are tested for their effectiveness in identifying islanded conditions under different scenarios. The simulation results show that the random forest classifier consistently outperforms the other methods for a diverse set of operating conditions, within an acceptable time after the onset of islanding. These results strengthen the case for machine-driven based tools for quick and accurate detection of islanding in microgrids.

APA, Harvard, Vancouver, ISO, and other styles

11

Huang, Yuzhou. "Duplicate detection in XML Web data /." View abstract or full-text, 2009. http://library.ust.hk/cgi/db/thesis.pl?CSED%202009%20HUANG.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

GHORBANI, SONIYA. "Anomaly Detection in Electricity Consumption Data." Thesis, Högskolan i Halmstad, Akademin för informationsteknologi, 2017. http://urn.kb.se/resolve?urn=urn:nbn:se:hh:diva-35011.

Full text

Abstract:

Distribution grids play an important role in delivering electricityto end users. Electricity customers would like to have a continuouselectricity supply without any disturbance. For customerssuch as airports and hospitals electricity interruption may havedevastating consequences. Therefore, many electricity distributioncompanies are looking for ways to prevent power outages.Sometimes the power outages are caused from the grid sidesuch as failure in transformers or a break down in power cablesbecause of wind. And sometimes the outages are caused bythe customers such as overload. In fact, a very high peak inelectricity consumption and irregular load profile may causethese kinds of failures.In this thesis, we used an approach consisting of two mainsteps for detecting customers with irregular load profile. In thefirst step, we create a dictionary based on all common load profileshapes using daily electricity consumption for one-monthperiod. In the second step, the load profile shapes of customersfor a specific week are compared with the load patterns in thedictionary. If the electricity consumption for any customer duringthat week is not similar to any of the load patterns in thedictionary, it will be grouped as an anomaly. In this case, loadprofile data are transformed to symbols using Symbolic AggregateapproXimation (SAX) and then clustered using hierarchicalclustering.The approach is used to detect anomaly in weekly load profileof a data set provided by HEM Nät, a power distributioncompany located in the south of Sweden.

APA, Harvard, Vancouver, ISO, and other styles

13

Alkharboush, Nawaf Abdullah H. "A data mining approach to improve the automated quality of data." Thesis, Queensland University of Technology, 2014. https://eprints.qut.edu.au/65641/1/Nawaf%20Abdullah%20H_Alkharboush_Thesis.pdf.

Full text

Abstract:

This thesis describes the development of a robust and novel prototype to address the data quality problems that relate to the dimension of outlier data. It thoroughly investigates the associated problems with regards to detecting, assessing and determining the severity of the problem of outlier data; and proposes granule-mining based alternative techniques to significantly improve the effectiveness of mining and assessing outlier data.

APA, Harvard, Vancouver, ISO, and other styles

14

Thomas, Kim. "Incident detection on arterials using neural network data fusion of simulated probe vehicle and loop detector data /." [St. Lucia, Qld.], 2005. http://www.library.uq.edu.au/pdfserve.php?image=thesisabs/absthe18433.pdf.

Full text

APA, Harvard, Vancouver, ISO, and other styles

15

Frascarelli, Antonio Ezio. "Object Detection." Thesis, Mälardalens högskola, Inbyggda system, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:mdh:diva-28259.

Full text

Abstract:

During the last two decades the interest about computer vision raised steadily with multiple applications in fields like medical care, automotive, entertainment, retail, industrial, and security. Objectdetection is part of the recognition problem, which is the most important scope of the computervision environment.The target of this thesis work is to analyse and propose a solution for object detection in a real timedynamic environment. RoboCup@Home will be the benchmarking event for this system, which willbe equipped on a robot competing in the 2018 event. The system has to be robust and fast enoughto allow the robot to react to each environment change in a reasonable amount of time.The input hardware used to achieve such system comprise of a Microsoft Kinect, which providesan high definition camera and fast and reliable 3D scanner. Through the study and analysis ofstate-of-the-art algorithms regarding machine vision and object recognition, the more suitable oneshave been tested to optimise the execution on the targeted hardware. Porting of the application toan embedded platform is discussed.

APA, Harvard, Vancouver, ISO, and other styles

16

Zhang, Ji. "Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy." University of Southern Queensland, Faculty of Sciences, 2008. http://eprints.usq.edu.au/archive/00005645/.

Full text

Abstract:

[Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality.Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliersin high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional streamdata is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams.In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensionaldata streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set oftop sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding theoutlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliersfor high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high-dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc.

APA, Harvard, Vancouver, ISO, and other styles

17

Li, Lishuai. "Anomaly detection in airline routine operations using flight data recorder data." Thesis, Massachusetts Institute of Technology, 2013. http://hdl.handle.net/1721.1/82498.

Full text

Abstract:

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 2013.
This thesis was scanned as part of an electronic thesis pilot project.
Cataloged from PDF version of thesis.
Includes bibliographical references (p. 141-145).
In order to improve safety in current air carrier operations, there is a growing emphasis on proactive safety management systems. These systems identify and mitigate risks before accidents occur. This thesis develops a new anomaly detection approach using routine operational data to support proactive safety management. The research applies cluster analysis to detect abnormal flights based on Flight Data Recorder (FDR) data. Results from cluster analysis are provided to domain experts to verify operational significance of such anomalies and associated safety hazards. Compared with existing methods, the cluster-based approach is capable of identifying new types of anomalies that were previously unaccounted for. It can help airlines detect early signs of performance deviation, identify safety degradation, deploy predictive maintenance, and train staff accordingly. The first part of the detection approach employs data-mining algorithms to identify flights of interest from FDR data. These data are transformed into a high-dimensional space for cluster analysis, where normal patterns are identified in clusters while anomalies are detected as outliers. Two cluster-based anomaly detection algorithms were developed to explore different transformation techniques: ClusterAD-Flight and ClusterAD-Data Sample. The second part of the detection approach is domain expert review. The review process is to determine whether detected anomalies are operationally significant and whether they represent safety risks. Several data visualization tools were developed to support the review process which can be otherwise labor-intensive: the Flight Parameter Plots can present raw FDR data in informative graphics; The Flight Abnormality Visualization can help domain experts quickly locate the source of such anomalies. A number of evaluation studies were conducted using airline FDR data. ClusterAD-Flight and ClusterAD-Data Sample were compared with Exceedance Detection, the current method in use by airlines, and MKAD, another anomaly detection algorithm developed at NASA, using a dataset of 25519 A320 flights. An evaluation of the entire detection approach was conducted with domain experts using a dataset of 10,528 A320 flights. Results showed that both cluster-based detection algorithms were able to identify operationally significant anomalies that beyond the capacities of current methods. Also, domain experts confirmed that the data visualization tools were effective in supporting the review process.
by Lishuai Li.
Ph.D.

APA, Harvard, Vancouver, ISO, and other styles

18

Draisbach, Uwe, Felix Naumann, Sascha Szott, and Oliver Wonneberg. "Adaptive windows for duplicate detection." Universität Potsdam, 2012. http://opus.kobv.de/ubp/volltexte/2012/5300/.

Full text

Abstract:

Duplicate detection is the task of identifying all groups of records within a data set that represent the same real-world entity, respectively. This task is difficult, because (i) representations might differ slightly, so some similarity measure must be defined to compare pairs of records and (ii) data sets might have a high volume making a pair-wise comparison of all records infeasible. To tackle the second problem, many algorithms have been suggested that partition the data set and compare all record pairs only within each partition. One well-known such approach is the Sorted Neighborhood Method (SNM), which sorts the data according to some key and then advances a window over the data comparing only records that appear within the same window. We propose several variations of SNM that have in common a varying window size and advancement. The general intuition of such adaptive windows is that there might be regions of high similarity suggesting a larger window size and regions of lower similarity suggesting a smaller window size. We propose and thoroughly evaluate several adaption strategies, some of which are provably better than the original SNM in terms of efficiency (same results with fewer comparisons).
Duplikaterkennung beschreibt das Auffinden von mehreren Datensätzen, die das gleiche Realwelt-Objekt repräsentieren. Diese Aufgabe ist nicht trivial, da sich (i) die Datensätze geringfügig unterscheiden können, so dass Ähnlichkeitsmaße für einen paarweisen Vergleich benötigt werden, und (ii) aufgrund der Datenmenge ein vollständiger, paarweiser Vergleich nicht möglich ist. Zur Lösung des zweiten Problems existieren verschiedene Algorithmen, die die Datenmenge partitionieren und nur noch innerhalb der Partitionen Vergleiche durchführen. Einer dieser Algorithmen ist die Sorted-Neighborhood-Methode (SNM), welche Daten anhand eines Schlüssels sortiert und dann ein Fenster über die sortierten Daten schiebt. Vergleiche werden nur innerhalb dieses Fensters durchgeführt. Wir beschreiben verschiedene Variationen der Sorted-Neighborhood-Methode, die auf variierenden Fenstergrößen basieren. Diese Ansätze basieren auf der Intuition, dass Bereiche mit größerer und geringerer Ähnlichkeiten innerhalb der sortierten Datensätze existieren, für die entsprechend größere bzw. kleinere Fenstergrößen sinnvoll sind. Wir beschreiben und evaluieren verschiedene Adaptierungs-Strategien, von denen nachweislich einige bezüglich Effizienz besser sind als die originale Sorted-Neighborhood-Methode (gleiches Ergebnis bei weniger Vergleichen).

APA, Harvard, Vancouver, ISO, and other styles

19

Hajimohammadi, Hamid Reza. "Classification of Data Series at Vehicle Detection." Thesis, Uppsala University, Department of Information Technology, 2009. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-111163.

Full text

Abstract:

This paper purposes a new, simple and lightweight approach of previously studied algorithms that can be used for extracting of feature vectors that in turn enables one to classify a vehicle based on its magnetic signature shape.This algorithm is called ASWA that stands for Adaptive Spectral and Wavelet Analysis and it is a combination of features of a signal extracted by both of the spectral and wavelet analysis algorithms. The performance of classifiers using this feature vectors is compared to another feature vectors consisting of features extracted by Fourier transform and pattern information of the signal extracted by Hill-Pattern algorithm (CFTHP). By using ASWA-based feature vectors, there have been improvements in all of classification algorithms results such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM) and Probabilistic Neural Networks (PNN). However, the best improvement rate achieved using an ASWA-Based feature vectors in K-NN algorithm. The correct rate of the classifier using CFTHP-based feature vectors was 39.82 %, which have improved to 69.93 % by using ASWA. This is corresponding an overall improvement by 76 % in correct classification rates.

APA, Harvard, Vancouver, ISO, and other styles

20

Mackie, Shona. "Exploiting weather forecast data for cloud detection." Thesis, University of Edinburgh, 2009. http://hdl.handle.net/1842/4350.

Full text

Abstract:

Accurate, fast detection of clouds in satellite imagery has many applications, for example Numerical Weather Prediction (NWP) and climate studies of both the atmosphere and of the Earth’s surface temperature. Most operational techniques for cloud detection rely on the differences between observations of cloud and of clear-sky being more or less constant in space and in time. In reality, this is not the case - different clouds have different spectral properties, and different cloud types are more or less likely in different places and at different times, depending on atmospheric conditions and on the Earth’s surface properties. Observations of clear sky also vary in space and time, depending on atmospheric and surface conditions, and on the presence or absence of aerosol particles. The Bayesian approach adopted in this project allows pixel-specific physical information (for example from NWP) to be used to predict pixel-specific observations of clear sky. A physically-based, spatially- and temporally-specific probability that each pixel contains a cloud observation is then calculated. An advantage of this approach is that identification of ambiguously classed pixels from a probabilistic result is straightforward, in contrast to the binary result generally produced by operational techniques. This project has developed and validated the Bayesian approach to cloud detection, and has extended the range of applications for which it is suitable, achieving skills scores that match or exceed those achieved by operational methods in every case. High temperature gradients can make observations of clear sky around ocean fronts, particularly at thermal wavelengths, appear similar to cloud observations. To address this potential source of ambiguous cloud detection results, a region of imagery acquired by the AATSR sensor which was noted to contain some ocean fronts, was selected. Pixels in the region were clustered according to their spectral properties with the aim of separating pixels that correspond to different thermal regimes of the ocean. The mean spectral properties of pixels in each cluster were then processed using the Bayesian cloud detection technique and the resulting posterior probability of clear then assigned to individual pixels. Several clustering methods were investigated, and the most appropriate, which allowed pixels to be associated with multiple clusters, with a normalized vector of ‘membership strengths’, was used to conduct a case study. The distribution of final calculated probabilities of clear became markedly more bimodal when clustering was included, indicating fewer ambiguous classifications, but at the cost of some single pixel clouds being missed. While further investigations could provide a solution to this, the computational expense of the clustering method made this impractical to include in the work of this project. This new Bayesian approach to cloud detection has been successfully developed by this project to a point where it has been released under public license. Initially designed as a tool to aid retrieval of sea surface temperature from night-time imagery, this project has extended the Bayesian technique to be suitable for imagery acquired over land as well as sea, and for day-time as well as for night-time imagery. This was achieved using the land surface emissivity and surface reflectance parameter products available from the MODIS sensor. This project added a visible Radiative Transfer Model (RTM), developed at University of Edinburgh, and a kernel-based surface reflectance model, adapted here from that used by the MODIS sensor, to the cloud detection algorithm. In addition, the cloud detection algorithm was adapted to be more flexible, making its implementation for data from the SEVIRI sensor straightforward. A database of ‘difficult’ cloud and clear targets, in which a wide range of both spatial and temporal locations was represented, was provided by M´et´eo-France and used in this work to validate the extensions made to the cloud detection scheme and to compare the skill of the Bayesian approach with that of operational approaches. For night land and sea imagery, the Bayesian technique, with the improvements and extensions developed by this project, achieved skills scores 10% and 13% higher than M´et´eo-France respectively. For daytime sea imagery, the skills scores were within 1% of each other for both approaches, while for land imagery the Bayesian method achieved a 2% higher skills score. The main strength of the Bayesian technique is the physical basis of the differentiation between clear and cloud observations. Using NWP information to predict pixel-specific observations for clear-sky is relatively straightforward, but making such predictions for cloud observations is more complicated. The technique therefore relies on an empirical distribution rather than a pixel-specific prediction for cloud observations. To try and address this, this project developed a means of predicting cloudy observations through the fast forward-modelling of pixel-specific NWP information. All cloud fields in the pixel-specific NWP data were set to 0, and clouds were added to the profile at discrete intervals through the atmosphere, with cloud water- and ice- path (cwp, cip) also set to values spaced exponentially at discrete intervals up to saturation, and with cloud pixel fraction set to 25%, 50%, 75% and 100%. Only single-level, single-phase clouds were modelled, with the justification that the resulting distribution of predicted observations, once smoothed through considerations of uncertainties, is likely to include observations that would correspond to multi-phase and multi-level clouds. A fast RTM was run on the profile information for each of these individual clouds and cloud altitude-, cloud pixel fraction- and channel-specific relationships between cwp (and similarly cip) and predicted observations were calculated from the results of the RTM. These relationships were used to infer predicted observations for clouds with cwp/cip values other than those explicitly forward modelled. The parameters used to define the relationships were interpolated to define relationships for predicted observations of cloud at 10m vertical intervals through the atmosphere, with pixel coverage ranging from 25% to 100% in increments of 1%. A distribution of predicted cloud observations is then achieved without explicit forward-modelling of an impractical number of atmospheric states. Weights are applied to the representation of individual clouds within the final Probability Density Function (PDF) in order to make the distribution of predicted observations realistic, according to the pixel-specific NWP data, and to distributions seen in a global reference dataset of NWP profiles from the European Centre for Medium Range Weather Forecasting (ECMWF). The distribution is then convolved with uncertainties in forward-modelling, in the NWP data, and with sensor noise to create the final PDF in observation space, from which the conditional probability that the pixel observation corresponds to a cloud observation can be read. Although the relatively fast computational implementation of the technique was achieved, the results are disappointingly poor for the SEVIRI-acquired dataset, provided by M´et´eo-France, against which validation was carried out. This is thought to be explained by both the uncertainties in the NWP data, and the forward-modelling dependence on those uncertainties, being poorly understood, and treated too optimistically in the algorithm. Including more errors in the convolution introduces the problem of quantifying those errors (a non-trivial task), and would increase the processing time, making implementation impractical. In addition, if the uncertianties considered are too high then a PDF flatter than the empirical distribution currently used would be produced, making the technique less useful.

APA, Harvard, Vancouver, ISO, and other styles

21

Penny, Kay Isabella. "Multivariate outlier detection in laboratory safety data." Thesis, University of Aberdeen, 1995. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.282687.

Full text

Abstract:

Clinical laboratory safety data consist of a wide range of biochemical and haematological variables which are collected to monitor the safety of a new treatment during a clinical trial. Although the data are multivariate, testing for abnormal measurements is usually done for only one variable at a time. A Monte Carlo simulation study is described, which compares 16 methods, some of which are new, for detecting multivariate outliers with a view to finding patients with an unusual set of laboratory measurements at a follow-up assessment. Multivariate normal and bootstrap simulations are used to create data sets of various dimensions. Both symmetrical and asymmetrical contamination are considered in this study. The results indicate that in addition to the routine univariate methods, it is desirable to run a battery of multivariable methods on laboratory safety data in an attempt to highlight possible outliers. Mahalanobis distance is a well-known criterion which is included in the study. Appropriate critical values when testing for a single multivariate outlier using Mahalanobis Distance are derived in this thesis, and the jack-knifed Mahalanobis distance is also discussed. Finally, the presence of missing data in laboratory safety data sets is the motivation behind a study which compares eight multiple imputation methods. The multiple imputation study is described, and the performance of two outlier detection methods in the presence of three different proportions of missing data are discussed. Measures are introduced for assessing the accuracy of the missing data results, depending on which method of analysis is used.

APA, Harvard, Vancouver, ISO, and other styles

22

Grover, Vikas. "Crime prediction and detection with data mining." Thesis, University of Portsmouth, 2009. https://researchportal.port.ac.uk/portal/en/theses/crime-prediction-and-detection-with-data-mining(51a8e1ce-3841-4288-adb2-a4e9bc6748e3).html.

Full text

Abstract:

Data mining technologies have been used by marketers to provide personalisation. In other words, the exact placement of the right offer to the right person at the right time. The police can apply this technique for providing the right inquiry to the right perpetrators at the right time, before or after person has committed a crime. The aim of this Thesis is to use data mining in operational policing for crime prediction and detection. Crime data contains rich information. However, it is inconsistent, incomplete and noisy thus making it difficult to get any useful information from it. The goal of this Thesis is to use data mining techniques on Police data, which could be used for analysis while making Police strategies to reduce the crime activities. Volume crimes (such as robbery) are difficult to analyse because of their high number and similarity between their Modus Operandi (MO). The methodological approach developed in this Thesis will help Police analysts to attribute undetected crimes to known offenders who may be responsible, with 72.9% to 93.57% accuracy, for committing the crime. The results obtained are encouraging, which demonstrating that supervised (MLP, and C5.0) and unsupervised techniques (SOM) in combination give greater accuracy compared to the existing Police methods. The same data mining technologies can be used with 53.47% to 58.77% accuracy, for predicting spatial -tempora I features of crime hit by prolific offender's network. With the time series, we were able to predict next month's volume of crimes on the top ten spatial spots with 76.4% accuracy.

APA, Harvard, Vancouver, ISO, and other styles

23

Wong, Kuo-Hsiung Hanson 1977. "Artifact detection in physiological parameter trend data." Thesis, Massachusetts Institute of Technology, 2003. http://hdl.handle.net/1721.1/87874.

Full text

Abstract:

Thesis (M.Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.
Includes bibliographical references (leaves 94-95).
by Kuo-Hsiung Hanson Wong.
M.Eng.

APA, Harvard, Vancouver, ISO, and other styles

24

Jin, Jiakun. "A Multivariate Data Stream Anomaly Detection Framework." Thesis, KTH, Skolan för elektro- och systemteknik (EES), 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-194202.

Full text

Abstract:

High speed stream anomaly detection is an important technology used in many industry applications such as monitoring system health, detecting financial fraud, monitoring customer's unusual behavior and so on. In those scenarios multivariate data arrives in high speed, and needs to be calculated in real-time. Since solutions for high speed multivariate stream anomaly detection are still under development, the objective of this thesis is introducing a framework for testing different anomaly detection algorithms.Multivariate anomaly detection, usually includes two major steps: point anomaly detection and stream anomaly detection. Point anomaly detection is used to transfer multivariate feature data into anomaly score according to the recent stream of data. The stream anomaly detectors are used to detect stream anomalies based on the recent anomaly scores generated from previous point anomaly detector. This thesis presents a flexible framework that allows the easy integration and evaluation of different data sources, point and stream anomaly detection algorithms. To demonstrate the capabilities of the framework, we consider different scenarios with generators of artificial data, real industry data sets and time series data, point anomaly detectors of PYISC, SVM and LOF, stream anomaly detectors of DDM, CUSUM and FCWM. The evaluation results show that for point anomaly detectors, PYISC and LOF perform well when the distributions of features are known, SVM performs well even when the distributions of features are not known. For the stream anomaly detectors, DDM has some possibilities to get false anomaly detection, CUSUM has some possibilities to get failed when the stream anomalies increase slowly, while FCWM performs best with very low possibilities to get failed.
Höghastighet ström anomali detektion är en viktig teknik som används i många industriella tillämpningar såsom övervakningssystem för hälsa, upptäckande av ekonomiska bedrägerier, övervakning av kundernas ovanliga beteende och så vidare. I dessa scenarier kommer multivariat data i hög hastighet, och måste beräknas i realtid. Eftersom lösningar för höghastighet multivariat ström anomali detektion är fortfarande under utveckling, är syftet med denna avhandling att införa en ramverk för att testa olika anomali algoritmer. Multivariat anomali detektion har oftast två viktiga steg: att upptäcka punkt-avvikelser och att upptäcka ström-avvikelser. Punkt- anomali detektorer används för att överföra multivariat data i anomali poäng enligt den senaste tidens dataström. Ström anomali detektorer används för att detektera ström avvikelser baserade på den senaste tidens anomali poäng genererade från föregående punkt anomali detektoren. Denna avhandling presenterar ett flexibelt ramverk som möjlig gör enkel integration och utvärdering av olika datakällor, punkt och ström anomali detektorer. För att demonstrera ramverkets kapabiliteteter, betraktar vi olika scenarier med datageneratorer av konstgjorda data, verkliga industri data och tidsseriedata; punkt anomali detektorer PYISC, SVM och Löf, och ström anomali detektorer DDM, CUSUM och FCWM. Utvärderingsresultaten visar att för punkt anomali detektor har PYISC och LOF bra prestanda när datafördelningen är kända, men SVM fungerar bra även när fördelningarna inte är kända. För ström anomali detektor har DDM vissa sannolikhet att få falskt upptäcka avvikelser, och CUSUM vissa sannolikhet att misslycka när avvikelser ökar långsamt. FCWM fungerar bäst med mycket låga sannolikhet för misslyckande.

APA, Harvard, Vancouver, ISO, and other styles

25

Cheng, Long. "Program Anomaly Detection Against Data-Oriented Attacks." Diss., Virginia Tech, 2018. http://hdl.handle.net/10919/84937.

Full text

Abstract:

Memory-corruption vulnerability is one of the most common attack vectors used to compromise computer systems. Such vulnerabilities could lead to serious security problems and would remain an unsolved problem for a long time. Existing memory corruption attacks can be broadly classified into two categories: i) control-flow attacks and ii) data-oriented attacks. Though data-oriented attacks are known for a long time, the threats have not been adequately addressed due to the fact that most previous defense mechanisms focus on preventing control-flow exploits. As launching a control-flow attack becomes increasingly difficult due to many deployed defenses against control-flow hijacking, data-oriented attacks are considered an appealing attack technique for system compromise, including the emerging embedded control systems. To counter data-oriented attacks, mitigation techniques such as memory safety enforcement and data randomization can be applied in different stages over the course of an attack. However, attacks are still possible because currently deployed defenses can be bypassed. This dissertation explores the possibility of defeating data-oriented attacks through external monitoring using program anomaly detection techniques. I start with a systematization of current knowledge about exploitation techniques of data-oriented attacks and the applicable defense mechanisms. Then, I address three research problems in program anomaly detection against data-oriented attacks. First, I address the problem of securing control programs in Cyber-Physical Systems (CPS) against data-oriented attacks. I describe a new security methodology that leverages the event-driven nature in characterizing CPS control program behaviors. By enforcing runtime cyber-physical execution semantics, our method detects data-oriented exploits when physical events are inconsistent with the runtime program behaviors. Second, I present a statistical program behavior modeling framework for frequency anomaly detection, where frequency anomaly is the direct consequence of many non-control-data attacks. Specifically, I describe two statistical program behavior models, sFSA and sCFT, at different granularities. Our method combines the local and long-range models to improve the robustness against data-oriented attacks and significantly increase the difficulties that an attack bypasses the anomaly detection system. Third, I focus on defending against data-oriented programming (DOP) attacks using Intel Processor Trace (PT). DOP is a recently proposed advanced technique to construct expressive non-control data exploits. I first demystify the DOP exploitation technique and show its complexity and rich expressiveness. Then, I design and implement the DeDOP anomaly detection system, and demonstrate its detection capability against the real-world ProFTPd DOP attack.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

26

Yan, Yizhou. "Contextual Outlier Detection from Heterogeneous Data Sources." Digital WPI, 2020. https://digitalcommons.wpi.edu/etd-dissertations/598.

Full text

Abstract:

The dissertation focuses on detecting contextual outliers from heterogeneous data sources. Modern sensor-based applications such as Internet of Things (IoT) applications and autonomous vehicles are generating a huge amount of heterogeneous data including not only the structured multi-variate data points, but also other complex types of data such as time-stamped sequence data and image data. Detecting outliers from such data sources is critical to diagnose and fix malfunctioning systems, prevent cyber attacks, and save human lives. The outlier detection techniques in the literature typically are unsupervised algorithms with a pre-defined logic, such as, to leverage the probability density at each point to detect outliers. Our analysis of the modern applications reveals that this rigid probability density-based methodology has severe drawbacks. That is, low probability density objects are not necessarily outliers, while the objects with relatively high probability densities might in fact be abnormal. In many cases, the determination of the outlierness of an object has to take the context in which this object occurs into consideration. Within this scope, my dissertation focuses on four research innovations, namely techniques and system for scalable contextual outlier detection from multi-dimensional data points, contextual outlier pattern detection from sequence data, contextual outlier image detection from image data sets, and lastly an integrative end-to-end outlier detection system capable of doing automatic outlier detection, outlier summarization and outlier explanation. 1. Scalable Contextual Outlier Detection from Multi-dimensional Data. Mining contextual outliers from big datasets is a computational expensive process because of the complex recursive kNN search used to define the context of each point. In this research, leveraging the power of distributed compute clusters, we design distributed contextual outlier detection strategies that optimize the key factors determining the efficiency of local outlier detection, namely, to localize the kNN search while still ensuring the load balancing. 2. Contextual Outlier Detection from Sequence Data. For big sequence data, such as messages exchanged between devices and servers and log files measuring complex system behaviors over time, outliers typically occur as a subsequence of symbolic values (or sequential pattern), in which each individual value itself may be completely normal. However, existing sequential pattern mining semantics tend to mis-classify outlier patterns as typical patterns due to ignoring the context in which the pattern occurs. In this dissertation, we present new context-aware pattern mining semantics and then design efficient mining strategies to support these new semantics. In addition, methodologies that continuously extract these outlier patterns from sequence streams are also developed. 3. Contextual Outlier Detection from Image Data. An image classification system not only needs to accurately classify objects from target classes, but also should safely reject unknown objects that belong to classes not present in the training data. Here, the training data defines the context of the classifier and unknown objects then correspond to contextual image outliers. Although the existing Convolutional Neural Network (CNN) achieves high accuracy when classifying known objects, the sum operation on multiple features produced by the convolutional layers causes an unknown object being classified to a target class with high confidence even if it matches some key features of a target class only by chance. In this research, we design an Unknown-aware Deep Neural Network (UDN for short) to detect contextual image outliers. The key idea of UDN is to enhance existing Convolutional Neural Network (CNN) to support a product operation that models the product relationship among the features produced by convolutional layers. This way, missing a single key feature of a target class will greatly reduce the probability of assigning an object to this class. To further improve the performance of our UDN at detecting contextual outliers, we propose an information-theoretic regularization strategy that incorporates the objective of rejecting unknowns into the learning process of UDN. 4. An End-to-end Integrated Outlier Detection System. Although numerous detection algorithms proposed in the literature, there is no one approach that brings the wealth of these alternate algorithms to bear in an integrated infrastructure to support versatile outlier discovery. In this work, we design the first end-to-end outlier detection service that integrates outlier-related services including automatic outlier detection, outlier summarization and explanation, human guided outlier detector refinement within one integrated outlier discovery paradigm. Experimental studies including performance evaluation and user studies conducted on benchmark outlier detection datasets and real world datasets including Geolocation, Lighting, MNIST, CIFAR and the Log file datasets confirm both the effectiveness and efficiency of the proposed approaches and systems.

APA, Harvard, Vancouver, ISO, and other styles

27

Martignano, Anna. "Real-time Anomaly Detection on Financial Data." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2020. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281832.

Full text

Abstract:

This work presents an investigation of tailoring Network Representation Learning (NRL) for an application in the Financial Industry. NRL approaches are data-driven models that learn how to encode graph structures into low-dimensional vector spaces, which can be further exploited by downstream Machine Learning applications. They can potentially bring a lot of benefits in the Financial Industry since they extract in an automatic way features that can provide useful input regarding graph structures, called embeddings. Financial transactions can be represented as a network, and through NRL, it is possible to extract embeddings that reflect the intrinsic inter-connected nature of economic relationships. Such embeddings can be used for several purposes, among which Anomaly Detection to fight financial crime.This work provides a qualitative analysis over state-of-the-art NRL models, which identifies Graph Convolutional Network (ConvGNN) as the most suitable category of approaches for Financial Industry but with a certain need for further improvement. Financial Industry poses additional challenges when modelling a NRL solution. Despite the need of having a scalable solution to handle real-world graph with considerable dimensions, it is necessary to take into consideration several characteristics: transactions graphs are inherently dynamic since every day new transactions are executed and nodes can be heterogeneous. Besides, everything is further complicated by the need to have updated information in (near) real-time due to the sensitivity of the application domain. For these reasons, GraphSAGE has been considered as a base for the experiments, which is an inductive ConvGNN model. Two variants of GraphSAGE are presented: a dynamic variant whose weights evolve accordingly with the input sequence of graph snapshots, and a variant specifically meant to handle bipartite graphs. These variants have been evaluated by applying them to real-world data and leveraging the generated embeddings to perform Anomaly Detection. The experiments demonstrate that leveraging these variants leads toimagecomparable results with other state-of-the-art approaches, but having the advantage of being suitable to handle real-world financial data sets.
Detta arbete presenterar en undersökning av tillämpningar av Network Representation Learning (NRL) inom den finansiella industrin. Metoder inom NRL möjliggör datadriven kondensering av grafstrukturer till lågdimensionella och lätthanterliga vektorer.Dessa vektorer kan sedan användas i andra maskininlärningsuppgifter. Närmare bestämt, kan metoder inom NRL underlätta hantering av och informantionsutvinning ur beräkningsintensiva och storskaliga grafer inom den finansiella sektorn, till exempel avvikelsehantering bland finansiella transaktioner. Arbetet med data av denna typ försvåras av det faktum att transaktionsgrafer är dynamiska och i konstant förändring. Utöver detta kan noderna, dvs transaktionspunkterna, vara vitt skilda eller med andra ord härstamma från olika fördelningar.I detta arbete har Graph Convolutional Network (ConvGNN) ansetts till den mest lämpliga lösningen för nämnda tillämpningar riktade mot upptäckt av avvikelser i transaktioner. GraphSAGE har använts som utgångspunkt för experimenten i två olika varianter: en dynamisk version där vikterna uppdateras allteftersom nya transaktionssekvenser matas in, och en variant avsedd särskilt för bipartita (tvådelade) grafer. Dessa varianter har utvärderats genom användning av faktiska datamängder med avvikelsehantering som slutmål.

APA, Harvard, Vancouver, ISO, and other styles

28

Patcha, Animesh. "Network Anomaly Detection with Incomplete Audit Data." Diss., Virginia Tech, 2006. http://hdl.handle.net/10919/28334.

Full text

Abstract:

With the ever increasing deployment and usage of gigabit networks, traditional network anomaly detection based intrusion detection systems have not scaled accordingly. Most, if not all, systems deployed assume the availability of complete and clean data for the purpose of intrusion detection. We contend that this assumption is not valid. Factors like noise in the audit data, mobility of the nodes, and the large amount of data generated by the network make it difficult to build a normal traffic profile of the network for the purpose of anomaly detection. From this perspective, the leitmotif of the research effort described in this dissertation is the design of a novel intrusion detection system that has the capability to detect intrusions with high accuracy even when complete audit data is not available. In this dissertation, we take a holistic approach to anomaly detection to address the threats posed by network based denial-of-service attacks by proposing improvements in every step of the intrusion detection process. At the data collection phase, we have implemented an adaptive sampling scheme that intelligently samples incoming network data to reduce the volume of traffic sampled, while maintaining the intrinsic characteristics of the network traffic. A Bloom filters based fast flow aggregation scheme is employed at the data pre-processing stage to further reduce the response time of the anomaly detection scheme. Lastly, this dissertation also proposes an expectation-maximization algorithm based anomaly detection scheme that uses the sampled audit data to detect intrusions in the incoming network traffic.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

29

Salzwedel, Jason Paul. "Anomaly detection in a mobile data network." Master's thesis, Faculty of Science, 2019. http://hdl.handle.net/11427/31202.

Full text

Abstract:

The dissertation investigated the creation of an anomaly detection approach to identify anomalies in the SGW elements of a LTE network. Unsupervised techniques were compared and used to identify and remove anomalies in the training data set. This “cleaned” data set was then used to train an autoencoder in an semi-supervised approach. The resultant autoencoder was able to indentify normal observations. A subsequent data set was then analysed by the autoencoder. The resultant reconstruction errors were then compared to the ground truth events to investigate the effectiveness of the autoencoder’s anomaly detection capability.

APA, Harvard, Vancouver, ISO, and other styles

30

Pyon, Yoon Soo. "Variant Detection Using Next Generation Sequencing Data." Case Western Reserve University School of Graduate Studies / OhioLINK, 2013. http://rave.ohiolink.edu/etdc/view?acc_num=case1347053645.

Full text

APA, Harvard, Vancouver, ISO, and other styles

31

Sperl, Ryan E. "Hierarchical Anomaly Detection for Time Series Data." Wright State University / OhioLINK, 2020. http://rave.ohiolink.edu/etdc/view?acc_num=wright1590709752916657.

Full text

APA, Harvard, Vancouver, ISO, and other styles

32

ZANONI, MARCO. "Data mining techniques for design pattern detection." Doctoral thesis, Università degli Studi di Milano-Bicocca, 2012. http://hdl.handle.net/10281/31515.

Full text

Abstract:

The main objective of design pattern detection is to gain better comprehension of a software system, and of the kind of problems addressed during the development of the system itself. Design patterns have informal specifications, leading to many implementation variants caused by the subjective interpretation of the pattern by developers. This thesis applies a supervised classification approach to make the detection more subjective, bringing to developers the patterns they want to find, ranked by a confidence value.

APA, Harvard, Vancouver, ISO, and other styles

33

曾偉明 and Wai-ming Peter Tsang. "Computer aided ultrasonic flaw detection and characterization." Thesis, The University of Hong Kong (Pokfulam, Hong Kong), 1987. http://hub.hku.hk/bib/B31231007.

Full text

APA, Harvard, Vancouver, ISO, and other styles

34

Peng, Qinmu. "Visual attention: saliency detection and gaze estimation." HKBU Institutional Repository, 2015. https://repository.hkbu.edu.hk/etd_oa/207.

Full text

Abstract:

Visual attention is an important characteristic in the human vision system, which is capable of allocating the cognitive resources to the selected information. Many researchers are attracted to the study of this mechanism in the human vision system and have achieved a wide range of successful applications. Generally, there are two tasks encountered in the visual attention research including visual saliency detection and gaze estimation. The former is normally described as distinctiveness or prominence as a result of a visual stimulus. Given images or videos as input, saliency detection methods try to simulate the mechanism of human vision system, predicting and locating the salient parts in them. While the later involves physical device to track the eye movement and estimate the gaze points. As for saliency detection, it is an effective technique for studying and mimicking the mechanism of the human vision system. Most of saliency models can predict the visual saliency with the boundary or the rough location of the true salient object, but miss the appearance or shape information. Besides, they pay little attention to the image quality problem such as low-resolution or noises. To handle these problems, in this thesis, we propose to model the visual saliency from local and global perspectives for better detection of the visual saliency. The combination of the local and global saliency scheme employing different visual cues can make fully use of their respective advantages to compute the saliency. Compared with existing models, the proposed method can provide better saliency with more appearance and shape information, and can work well even in the low-resolution or noisy images. The experimental results demonstrate the superiority of the proposed algorithm. Next, video saliency detection is another issue for the visual saliency computation. Numerous works have been proposed to extract the video saliency for the tasks of object detection. However, one might not be able to obtain desirable saliency for inferring the region of foreground objects when the video presents low contrast or complicated background. Thus, this thesis develops a salient object detection approach with less demanding assumption, which gives higher detection performance. The method computes the visual saliency in each frame using a weighted multiple manifold ranking algorithm. It then computes motion cues to estimate the motion saliency and localization prior. By adopting a new energy function, the data term depends on the visual saliency and localization prior; and the smoothness term depends on the constraint in time and space. Compared to existing methods, our approach automatically segments the persistent foreground object while preserving the potential shape. We apply our method to challenging benchmark videos, and show competitive or better results than the existing counterparts. Additionally, to address the problem of gaze estimation, we present a low cost and efficient approach to obtain the gaze point. As opposed to eye gaze estimation techniques requiring specific hardware, e.g. infrared high-resolution camera and infrared light sources, as well as a cumbersome calibration process. We concentrate on visible-imaging and present an approach for gaze estimation using a web camera in a desktop environment. We combine intensity energy and edge strength to locate the iris center and utilize the piecewise eye corner detector to detect the eye corner. To compensate for head movement causing gaze error, we adopt a sinusoidal head model (SHM) to simulate the 3D head shape, and propose an adaptive weighted facial features embedded in the pose from the orthography and scaling with iterations algorithm (AWPOSIT), whereby the head pose can be estimated. Consequently, the gaze estimation is obtained by the integration of the eye vector and head movement information. The proposed method is not sensitive to the light conditions, and the experimental results show the efficacy of the proposed approach

APA, Harvard, Vancouver, ISO, and other styles

35

Shtarkalev, Bogomil Iliev. "Single data set detection for multistatic Doppler radar." Thesis, University of Edinburgh, 2015. http://hdl.handle.net/1842/10556.

Full text

Abstract:

The aim of this thesis is to develop and analyse single data set (SDS) detection algorithms that can utilise the advantages of widely-spaced (statistical) multiple-input multiple-output (MIMO) radar to increase their accuracy and performance. The algorithms make use of the observations obtained from multiple space-time adaptive processing (STAP) receivers and focus on covariance estimation and inversion to perform target detection. One of the main interferers for a Doppler radar has always been the radar’s own signal being reflected off the surroundings. The reflections of the transmitted waveforms from the ground and other stationary or slowly-moving objects in the background generate observations that can potentially raise false alarms. This creates the problem of searching for a target in both additive white Gaussian noise (AWGN) and highly-correlated (coloured) interference. Traditional STAP deals with the problem by using target-free training data to study this environment and build its characteristic covariance matrix. The data usually comes from range gates neighbouring the cell under test (CUT). In non-homogeneous or non-stationary environments, however, this training data may not reflect the statistics of the CUT accurately, which justifies the need to develop SDS methods for radar detection. The maximum likelihood estimation detector (MLED) and the generalised maximum likelihood estimation detector (GMLED) are two reduced-rank STAP algorithms that eliminate the need for training data when mapping the statistics of the background interference. The work in this thesis is largely based on these two algorithms. The first work derives the optimal maximum likelihood (ML) solution to the target detection problem when the MLED and GMLED are used in a multistatic radar scenario. This application assumes that the spatio-temporal Doppler frequencies produces in the individual bistatic STAP pairs of the MIMO system are ideally synchronised. Therefore the focus is on providing the multistatic outcome to the target detection problem. It is shown that the derived MIMO detectors possess the desirable constant false alarm rate (CFAR) property. Gaussian approximations to the statistics of the multistatic MLED and GMLED are derived in order to provide a more in-depth analysis of the algorithms. The viability of the theoretical models and their approximations are tested against a numerical simulation of the systems. The second work focuses on the synchronisation of the spatio-temporal Doppler frequency data from the individual bistatic STAP pairs in the multistatic MLED scenario. It expands the idea to a form that could be implemented in a practical radar scenario. To reduce the information shared between the bistatic STAP channels, a data compression method is proposed that extracts the significant contributions of the MLED likelihood function before transmission. To perform the inter-channel synchronisation, the Doppler frequency data is projected into the space of potential target velocities where the multistatic likelihood is formed. Based on the expected structure of the velocity likelihood in the presence of a target, a modification to the multistatic MLED is proposed. It is demonstrated through numerical simulations that the proposed modified algorithm performs better than the basic multistatic MLED while having the benefit of reducing the data exchange in the MIMO radar system.

APA, Harvard, Vancouver, ISO, and other styles

36

Kutzner, Kendy. "Processing MODIS Data for Fire Detection in Australia." Thesis, Universitätsbibliothek Chemnitz, 2002. http://nbn-resolving.de/urn:nbn:de:bsz:ch1-200200831.

Full text

Abstract:

The aim of this work was to use remote sensing data from the MODIS instrument of the Terra satellite to detect bush fires in Australia. This included preprocessing the demodulator output, bit synchronization and reassembly of data packets. IMAPP was used to do the geolocation and data calibration. The fire detection used a combination of fixed threshold techniques with difference tests and background comparisons. The results were projected in a rectangular latidue/longitude map to remedy the bow tie effect. Algorithms were implemented in C and Matlab. It proved to be possible to detect fires in the available data. The results were compared with fire detection done done by NASA and fire detections based on other sensors and found to be very similar
Das Ziel dieser Arbeit war die Nutzung von Fernerkundungsdaten des MODIS Instruments an Bord des Satelliten Terra zur Erkennung von Buschfeuern in Australien. Das schloss die Vorverarbeitung der Daten vom Demodulator, die Bitsynchronisation und die Umpacketierung der Daten ein. IMAPP wurde genutzt um die Daten zu kalibrieren und zu geolokalisieren. Die Feuererkennung bedient sich einer Kombination von absoluten Schwellwerttests, Differenztests und Vergleichen mit dem Hintergrund. Die Ergebnisse wurden in eine rechteckige Laengen/Breitengradkarte projiziert um dem BowTie Effekt entgegenzuwirken. Die benutzten Algrorithmen wurden in C und Matlab implementiert. Es zeigte sich, dass es moeglich ist in den verfuegbaren Daten Feuer zu erkennen. Die Ergebnisse wurden mit Feuererkennungen der NASA und Feuererkennung die auf anderen Sensoren basieren verglichen und fuer sehr aehnlich befunden

APA, Harvard, Vancouver, ISO, and other styles

37

Svedberg, Oskar. "Automatic detection of ULF waves in Cluster data." Thesis, KTH, Rymd- och plasmafysik, 2007. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-91550.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Almutairi, Abdulrazaq Z. "Improving intrusion detection systems using data mining techniques." Thesis, Loughborough University, 2016. https://dspace.lboro.ac.uk/2134/21313.

Full text

Abstract:

Recent surveys and studies have shown that cyber-attacks have caused a lot of damage to organisations, governments, and individuals around the world. Although developments are constantly occurring in the computer security field, cyber-attacks still cause damage as they are developed and evolved by hackers. This research looked at some industrial challenges in the intrusion detection area. The research identified two main challenges; the first one is that signature-based intrusion detection systems such as SNORT lack the capability of detecting attacks with new signatures without human intervention. The other challenge is related to multi-stage attack detection, it has been found that signature-based is not efficient in this area. The novelty in this research is presented through developing methodologies tackling the mentioned challenges. The first challenge was handled by developing a multi-layer classification methodology. The first layer is based on decision tree, while the second layer is a hybrid module that uses two data mining techniques; neural network, and fuzzy logic. The second layer will try to detect new attacks in case the first one fails to detect. This system detects attacks with new signatures, and then updates the SNORT signature holder automatically, without any human intervention. The obtained results have shown that a high detection rate has been obtained with attacks having new signatures. However, it has been found that the false positive rate needs to be lowered. The second challenge was approached by evaluating IP information using fuzzy logic. This approach looks at the identity of participants in the traffic, rather than the sequence and contents of the traffic. The results have shown that this approach can help in predicting attacks at very early stages in some scenarios. However, it has been found that combining this approach with a different approach that looks at the sequence and contents of the traffic, such as event- correlation, will achieve a better performance than each approach individually.

APA, Harvard, Vancouver, ISO, and other styles

39

Ying, Yeqiu. "Synchronization and data detection in wireless sensor networks." Thesis, University of Leeds, 2007. http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.485187.

Full text

Abstract:

Wireless'sensor networks (WSNs) have been envisioned as one of the most 'important emerging technologies that can greatly impact the world. With the recent advancement in both electronics and wireless communication networks, implementing WSNs in practical applications has become feasible and can be expected.in the near future. However, current communications protocols are not suitable for use in WSNs due to the unique characteristics and the system constraints such as low power consumption, and low computational and hardware complexity. '\. In this thesis, we focus on the physical (PRY) layer design issues including . transmission medium selection, and transceiver design. Specifically, we first study a WSN architecture with a centralized topology. Motivated by the factor that if not properly treated, carrier frequency offset (CFO) and multipath channel can cause great degradation of data detection performance in conventional carrier-based radio systems (narrow-band and wide-band systems), we address CFO and channel estimation for multiple slave sensor nodes. Relying on a unique TDMA-like training head pattern, the joint multi-user CFO and channel estimation problem can be easily decoupled. Furthermore, the joint CFO and channel estimation for each slave sensor can also be treated separately without significant performance degradation. Different CFO and channel estimators are derived and compared. Optimal training design, specifically the pilot symbols placement, for burst transmission systems is also investigated, and an equal-preamble-postamble (EPP) placement scheme is shown to be optimal. In the second half of the thesis, the emerging ultra-wideband (UWB) radio technology is investigated in the context of WSNs. We believe that this new radio technology is a strong candidate for WSN applications d?e to its unique advantages. The modulation ~d receiver schemes are stqdied and block-coded modulation and a novel noncoherent receiver are proposed for impulse radio (IR) UWB systems. The critical challenge of timing synchronization for IR-UWB signals is also studied, and a new code-assisted synchronization scheme is proposed. This semi-analog based synchronization scheme enables the usage of both coherent and noncoherent receivers, and can be executed under either blind or data-aided mode. In conclusion, this research work is expected to favorably impact the theory, design and implementation of communication transceivers for practical W8Ns.

APA, Harvard, Vancouver, ISO, and other styles

40

Prestberg, Lars. "Automatisk sammanställning av mätbara data : Intrusion detection system." Thesis, Mittuniversitetet, Avdelningen för informations- och kommunikationssystem, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-28254.

Full text

Abstract:

Projektet utförs på IT-säkerhetsbolaget i Skandinavien AB, en del i deras utbud är ett Cyberlarm där delar skall automatiseras för att kunna presentera information till kunder på ett smidigare sätt. Syftet är att kunna erbjuda kunder mer valuta för pengarna vilket samtidigt innebär ett extra säljargument för produkten. Cyberlarmet är förenklat ett Intrusion Detection System som läser av trafik på ett nätverk och larmar operatören om något suspekt sker på nätet. Utifrån databasen som all information sparas i skapas grafer och tabeller som en översikt av nätet, denna information skall skickas till kunder på veckobasis, vilket sker genom ett Python-script samt ett antal open-source programvaror. Resultatet visar att det automatiserade sättet att utföra uppgiften tar 5,5% av tiden det tog att skapa en levererad grafsida med orginalmetoden. Mot den föreslagna manuella metoden, för tre sensorer, tog den automatiserade metoden 11% av tiden. När endast skapandet av pdf utfördes låg den automatiserade metoden på 82,1% respektive 69,7% av den manuella tiden för en respektive tre sensorer.

APA, Harvard, Vancouver, ISO, and other styles

41

SZEKÉR, MÁTÉ. "Spatio-temporal outlier detection in streaming trajectory data." Thesis, KTH, Skolan för datavetenskap och kommunikation (CSC), 2014. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-155739.

Full text

Abstract:

This thesis investigates the problem of detecting spatiotemporalanomalies in streamed trajectory data using both supervised and unsupervised algorithms. Anomaly detection can be understood as an unsupervised classification problem which requires the knowledge of the normal course of events or how the anomalies manifest themselves. To this end, an algorithm is proposed to identify the normative pattern in a streamed dataset. A non-parametric algorithm based on SVM is proposed for classifying trajectories basedon the explicit geometric properties alone. A parametric algorithm based on dynamic Markov Chains is presented for analysing trajectories based on their semantics. Two methods are proposed to fade the Markov Chains so that new behaviours can be modelled and obsolete behaviours can be forgotten. Both the non-parametric and parametric approaches are evaluated using both a synthetic and a real-life dataset. Fading the Markov Chains turns out to be essential in order to accurately detect anomalies in a dynamic dataset.

APA, Harvard, Vancouver, ISO, and other styles

42

Purwar, Yashasvi. "Data based abnormality detection." Master's thesis, 2011. http://hdl.handle.net/10048/1858.

Full text

Abstract:

Data based abnormality detection is a growing research field focussed on extracting information from feature rich data. They are considered to be non-intrusive and non-destructive in nature which gives them a clear advantage over conventional methods. In this study, we explore different streams of data based anomalies detection. We propose extension and revisions to existing valve stiction detection algorithm supported with industrial case study. We also explored the area of image analysis and proposed a complete solution for Malaria diagnosis. The proposed method is tested over images provided by pathology laboratory at Alberta Health Service. We also address the robustness and practicality of the solution proposed.
Process Control

APA, Harvard, Vancouver, ISO, and other styles

43

"Detection statistics for multichannel data." Research Laboratory of Electronics, Massachusetts Institute of Technology, 1989. http://hdl.handle.net/1721.1/4202.

Full text

Abstract:

Tae Hong Joo.
Also issued as Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1989.
Includes bibliographical references (p. 109-113).
Research supported by the Advanced Research Projects Agency monitored by the Office of Naval Research, the National Science Foundation, Sanders Associates, Inc., and the Amoco Foundation.

APA, Harvard, Vancouver, ISO, and other styles

44

Kurt, Mehmet Necip. "Data-Driven Quickest Change Detection." Thesis, 2020. https://doi.org/10.7916/d8-yz99-3e67.

Full text

Abstract:

The quickest change detection (QCD) problem is to detect abrupt changes in a sensing environment as quickly as possible in real time while limiting the risk of false alarm. Statistical inference about the monitored stochastic process is performed through observations acquired sequentially over time. After each observation, QCD algorithm either stops and declares a change or continues to have a further observation in the next time interval. There is an inherent tradeoff between speed and accuracy in the decision making process. The design goal is to optimally balance the average detection delay and the false alarm rate to have a timely and accurate response to abrupt changes. The objective of this thesis is to investigate effective and scalable QCD approaches for real-world data streams. The classical QCD framework is model-based, that is, statistical data model is assumed to be known for both the pre- and post-change cases. However, real-world data often exhibit significant challenges for data modeling such as high dimensionality, complex multivariate nature, lack of parametric models, unknown post-change (e.g., attack or anomaly) patterns, and complex temporal correlation. Further, in some cases, data is privacy-sensitive and distributed over a system, and it is not fully available to QCD algorithm. This thesis addresses these challenges and proposes novel data-driven QCD approaches that are robust to data model mismatch and hence widely applicable to a variety of practical settings. In Chapter 2, online cyber-attack detection in the smart power grid is formulated as a partially observable Markov decision process (POMDP) problem based on the QCD framework. A universal robust online cyber-attack detection algorithm is proposed using the model-free reinforcement learning (RL) for POMDPs. In Chapter 3, online anomaly detection for big data streams is studied where the nominal (i.e., pre-change) and anomalous (i.e., post-change) high-dimensional statistical data models are unknown. A data-driven solution approach is proposed, where firstly a set of useful univariate summary statistics is computed from a nominal dataset in an offline phase and next, online summary statistics are evaluated for a persistent deviation from the nominal statistics. In Chapter 4, a generic data-driven QCD procedure is proposed, called DeepQCD, that learns the change detection rule directly from the observed raw data via deep recurrent neural networks. With sufficient amount of training data including both pre- and post-change samples, DeepQCD can effectively learn the change detection rule for all complex, high-dimensional, and temporally correlated data streams. Finally, in Chapter 5, online privacy-preserving anomaly detection is studied in a setting where the data is distributed over a network and locally sensitive to each node, and its statistical model is unknown. A data-driven differentially private distributed detection scheme is proposed, which infers network-wide anomalies based on the perturbed and encrypted statistics received from nodes. Furthermore, analytical privacy-security tradeoff in the network-wide anomaly detection problem is investigated.

APA, Harvard, Vancouver, ISO, and other styles

45

Pienaar, Abel Jacobus. "Fraud detection using data mining." Thesis, 2014. http://hdl.handle.net/10210/9112.

Full text

Abstract:

M.Com. (Computer Auditing)
Fraud is a major problem in South Africa and the world and organisations lose millions each year to fraud not being detected. Organisations can deal with the fraud that is known to them, but undetected fraud is a problem. There is a need for management, external- and internal auditors to detect fraud within an organisation. There is a further need for an integrated fraud detection model to assist managers and auditors to detect fraud. A literature study was done of authoritative textbooks and other literature on fraud detection and data mining, including the Knowledge Discovery Process in databases and a model was developed that will assist the manager and auditor to detect fraud in an organisation by using a technology called data mining which makes the process of fraud detection more efficient and effective.

APA, Harvard, Vancouver, ISO, and other styles

46

Sharma, Khushboo. "Outlier Detection for Categorical Data." Thesis, 2017. http://ethesis.nitrkl.ac.in/8836/1/2017_MT_KSharma.pdf.

Full text

Abstract:

Outlier detection or anomaly detection is a very important process to detect instances with unexpected behavior that occurs in a given system. From many years, outlier detection has gained a significant consideration due to its applications in various areas such as credit cards fraud in banking sector, illegal access in networking field, data analysis in medical field, weather prediction etc. Till now, many techniques have been developed to detect outliers. However, most existing techniques focus on numerical data and they can not be applied directly for categorical data because of the difficulty of defining a meaningful similarity measure for categorical data. Also, high dimensional categorical data impose significant challenges due to their unique data discreteness. To handle this type of data we can use entropy related measures. The concept of entropy is developed over the probabilistic explanation of data distribution which quantifies the variation or diversity of a discrete variable. For outlier detection, we applied a simple and effective ranking based algorithm based on entropy and mutual information, and we also analyzed corresponding time complexity of the algorithm. Experimental results on car evaluation data set and two other data sets demonstrate the effectiveness and efficiency of our algorithm.

APA, Harvard, Vancouver, ISO, and other styles

47

Sousa, Maria Inês Neves de. "Data mining for anomaly detection in maritime traffic data." Master's thesis, 2018. http://hdl.handle.net/10400.26/25059.

Full text

Abstract:

For the past few years, oceans have become once again, an important means of communication and transport. In fact, traffic density throughout the globe has suffered a substantial growth, which has risen some concerns. With this expansion, the need to achieve a high Maritime Situational Awareness (MSA) is imperative. At the present time, this need may be more easily fulfilled thanks to the vast amount of data available regarding maritime traffic. However, this brings in another issue: data overload. Currently, there are so many data sources, so many data to obtain information from, that the operators cannot handle it. There is a pressing need for systems that help to sift through all the data, analysing and correlating, helping in this way the decision making process. In this dissertation, the main goal is to use different sources of data in order to detect anomalies and contribute to a clear Recognised Maritime Picture (RMP). In order to do so, it is necessary to know what types of data exist and which ones are available for further analysis. The data chosen for this dissertation was Automatic Identification System (AIS) and Monitorização Contínua das Atividades da Pesca (MONICAP) data, also known as Vessel Monitoring System (VMS) data. In order to store 1 year worth of AIS and MONICAP data, a PostgreSQL database was created. To analyse and draw conclusions from the data, a data mining tool was used, namely, Orange. Tests were conducted in order to assess the correlation between data sources and find anomalies. The importance of data correlation has never been so important and with this dissertation the aim is to show that there is a simple and effective way to get answers from great amounts of data.
Nos últimos anos, os oceanos tornaram-se, mais uma vez, um importante meio de comunicação e transporte. De facto, a densidade de tráfego global sofreu um crescimento substancial, o que levantou algumas preocupações. Com esta expansão, a necessidade de atingir um elevado Conhecimento Situacional Marítimo (CSM) é imperativa. Hoje em dia, esta necessidade pode ser satisfeita mais facilmente graças à vasta quantidade de dados disponíveis de tráfego marítimo. No entanto, isso leva a outra questão: sobrecarga de dados. Atualmente existem tantas fontes de dados, tantos dados dos quais extrair informação, que os operadores não conseguem acompanhar. Existe uma necessidade premente para sistemas que ajudem a escrutinar todos os dados, analisando e correlacionando, contribuindo desta maneira ao processo de tomada de decisão. Nesta dissertação, o principal objetivo é usar diferentes fontes de dados para detetar anomalias e contribuir para uma clara Recognised Maritime Picture (RMP). Para tal, é necessário saber que tipos de dados existem e quais é que se encontram disponíveis para análise posterior. Os dados escolhidos para esta dissertação foram dados Automatic Identification System (AIS) e dados de Monitorização Contínua das Atividades da Pesca (MONICAP), também conhecidos como dados de Vessel Monitoring System (VMS). De forma a armazenar dados correspondentes a um ano de AIS e MONICAP, foi criada uma base de dados em PostgreSQL. Para analisar e retirar conclusões, foi utilizada uma ferramenta de data mining, nomeadamente, o Orange. De modo a que pudesse ser avaliada a correlação entre fontes de dados e serem detetadas anomalias foram realizados vários testes. A correlação de dados nunca foi tão importante e pretende-se com esta dissertação mostrar que existe uma forma simples e eficaz de obter respostas de grandes quantidades de dados

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Kai-Wei, and 陳凱威. "Data Visualization Applied for Anomaly Detection in Intrusion Detection Systems." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/bf3996.

Full text

Abstract:

碩士
國立臺灣大學
電信工程學研究所
106
An intrusion detection system (IDS) is a device or software application that detects attacks by the features extract from network traffic, packets, security logs, etc, to monitor malicious activities or policy violations. IDS could fall into two categories: signature-based and anomaly-based. Signature-based IDS extracts features from past anomaly behaviors to build a database for further analysis and detection. Anomaly-based IDS build the malicious behavior model from the relationship between features and labels of dataset by machine learning algorithm, to identify the content is anomaly or not. Anomaly-base IDS can detect unknown behavior, but the accuracy and false positive performs worse than signatured-based IDS. In this paper, we combine the concept of Data Visualization and Convolutional Neural Network to build a model for anomaly-based IDS by transform the dataset into images by data visualization algorithm to train the convolutional neural network model. The detection accuracy for NSL-KDD TEST+ dataset contained unknown attacks can reach 81.84%. The minimum false positive rate of the models could be　 reduce to 17.83%, and the hardware computation requirements of the training and testing procedure are compared with the well-known EM clustering method. Finally, besides of the information security field, other research fields could apply this method as long as the contents of the dataset are complete enough, which demonstrates the versatility and future development.

APA, Harvard, Vancouver, ISO, and other styles

49

Jayannavar, Prashant A. "Community Detection in Networks." Thesis, 2013. http://ethesis.nitrkl.ac.in/4755/1/109CS0148.pdf.

Full text

Abstract:

An important property of networks/graphs modeling complex systems is the property of community structure, in which nodes are joined together in tightly knit groups (communities or clusters), between which there are only looser connections. The problem of detecting and extracting communities from such graphs has been the subject of intense investigations in recent years. This problem is very hard and not yet satisfactorily solved. In this project we explore and work on this community detection problem. We frame the problem as an optimization problem and hence explore the use of Genetic Algorithms (GAs) in solving the same. We have studied, analyzed and implemented several existing algorithms including standard ones and GA-based ones. The standard algorithms include the Girvan-Newman Algorithm and the Label Propagation Algorithm by Raghavan et al. while the GA-based one is Tasgin et al.s algorithm. We have also designed a new GA-based algorithm for the problem. We present a comparative performance (accuracy + efficiency) analysis of these algorithms (new + existing) to gain insights into the problem and reveal the advantages of our proposed algorithm over existing algorithms. We have also created some artificial datasets (based on standard existing algorithms like the one for LFR graphs) for the purpose of the analysis and have acquired some real-world datasets (like Zachary's karate club network, Lusseau's network of bottlenose dolphins, etc.) too.

APA, Harvard, Vancouver, ISO, and other styles

50

Ma, Li-Yu, and 馬莉芋. "Data Mining For Network Intrusion Detection." Thesis, 2005. http://ndltd.ncl.edu.tw/handle/a98a4f.

Full text

Abstract:

碩士
銘傳大學
資訊管理學系碩士班
93
According to survey of CERT, the rate of cyber attacks has been more than doubling every year in recent times. It has become increasingly important to make our information systems safely. The huge and variable information of network couldn’t be determined by human. And the hacker always can find new way of attack. How could we find this connect is attack or not? And stop it before being attacked. We use DARPA data set and data mining technology training kind of attack model. In the testing data set head, it includes new attack behaviors which never appear in the training data set to ensure our network intrusion model effective making information systems of corporation safely.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Data detection'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles