Dissertations / Theses: 'Hidden Data Mining'

1

Liu, Tantan. "Data Mining over Hidden Data Sources." The Ohio State University, 2012. http://rave.ohiolink.edu/etdc/view?acc_num=osu1343313341.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Dharmavaram, Sirisha. "Mining Biomedical Data for Hidden Relationship Discovery." Thesis, University of North Texas, 2019. https://digital.library.unt.edu/ark:/67531/metadc1538709/.

Full text

Abstract:

With an ever-growing number of publications in the biomedical domain, it becomes likely that important implicit connections between individual concepts of biomedical knowledge are overlooked. Literature based discovery (LBD) is in practice for many years to identify plausible associations between previously unrelated concepts. In this paper, we present a new, completely automatic and interactive system that creates a graph-based knowledge base to capture multifaceted complex associations among biomedical concepts. For a given pair of input concepts, our system auto-generates a list of ranked subgraphs uncovering possible previously unnoticed associations based on context information. To rank these subgraphs, we implement a novel ranking method using the context information obtained by performing random walks on the graph. In addition, we enhance the system by training a Neural Network Classifier to output the likelihood of the two concepts being likely related, which provides better insights to the end user.

APA, Harvard, Vancouver, ISO, and other styles

3

Liu, Zhenjiao. "Incomplete multi-view data clustering with hidden data mining and fusion techniques." Electronic Thesis or Diss., Institut polytechnique de Paris, 2023. http://www.theses.fr/2023IPPAS011.

Full text

Abstract:

Le regroupement de données multivues incomplètes est un axe de recherche majeur dans le domaines de l'exploration de données et de l'apprentissage automatique. Dans les applications pratiques, nous sommes souvent confrontés à des situations où seule une partie des données modales peut être obtenue ou lorsqu'il y a des valeurs manquantes. La fusion de données est une méthode clef pour l'exploration d'informations multivues incomplètes. Résoudre le problème de l'extraction d'informations multivues incomplètes de manière ciblée, parvenir à une collaboration flexible entre les vues visibles et les vues cachées partagées, et améliorer la robustesse sont des défis. Cette thèse se concentre sur trois aspects : l'exploration de données cachées, la fusion collaborative et l'amélioration de la robustesse du regroupement. Les principales contributions sont les suivantes:1) Exploration de données cachées pour les données multi-vues incomplètes : les algorithmes existants ne peuvent pas utiliser pleinement l'observation des informations dans et entre les vues, ce qui entraîne la perte d'une grande quantité d'informations. Nous proposons donc un nouveau modèle de regroupement multi-vues incomplet IMC-NLT (Incomplete Multi-view Clustering Based on NMF and Low-Rank Tensor Fusion) basé sur la factorisation de matrices non négatives et la fusion de tenseurs de faible rang. IMC-NLT utilise d'abord un tenseur de faible rang pour conserver les caractéristiques des vues avec une dimension unifiée. En utilisant une mesure de cohérence, IMC-NLT capture une représentation cohérente à travers plusieurs vues. Enfin, IMC-NLT intègre plusieurs apprentissages dans un modèle unifié afin que les informations cachées puissent être extraites efficacement à partir de vues incomplètes. Des expériences sur cinq jeux de données ont validé les performances d'IMC-NLT.2) Fusion collaborative pour les données multivues incomplètes : notre approche pour résoudre ce problème est le regroupement multivues incomplet par représentation à faible rang. L'algorithme est basé sur une représentation éparse de faible rang et une représentation de sous-espace, dans laquelle les données manquantes sont complétées en utilisant les données d'une modalité et les données connexes d'autres modalités. Pour améliorer la stabilité des résultats de clustering pour des données multi-vues avec différents degrés de manquants, CCIM-SLR utilise le modèle Γ-norm, qui est une méthode de représentation à faible rang ajustable. CCIM-SLR peut alterner entre l'apprentissage de la vue cachée partagée, la vue visible et les partitions de clusters au sein d'un cadre d'apprentissage collaboratif. Un algorithme itératif avec convergence garantie est utilisé pour optimiser la fonction objective proposée.3) Amélioration de la robustesse du regroupement pour les données multivues incomplètes : nous proposons une fusion de la convolution graphique et des goulots d'étranglement de l'information (apprentissage de la représentation multivues incomplète via le goulot d'étranglement de l'information). Nous introduisons la théorie du goulot d'étranglement de l'information afin de filtrer les données parasites contenant des détails non pertinents et de ne conserver que les éléments les plus pertinents. Nous intégrons les informations sur la structure du graphe basées sur les points d'ancrage dans les informations sur le graphe local. Le modèle intègre des représentations multiples à l'aide de goulets d'étranglement de l'information, réduisant ainsi l'impact des informations redondantes dans les données. Des expériences approfondies sont menées sur plusieurs ensembles de données du monde réel, et les résultats démontrent la supériorité de IMRL-AGI. Plus précisément, IMRL-AGI montre des améliorations significatives dans la précision du clustering et de la classification, même en présence de taux élevés de données manquantes par vue (par exemple, 10,23 % et 24,1% respectivement sur l'ensemble de données ORL)
Incomplete multi-view data clustering is a research direction that attracts attention in the fields of data mining and machine learning. In practical applications, we often face situations where only part of the modal data can be obtained or there are missing values. Data fusion is an important method for incomplete multi-view information mining. Solving incomplete multi-view information mining in a targeted manner, achieving flexible collaboration between visible views and shared hidden views, and improving the robustness have become quite challenging. This thesis focuses on three aspects: hidden data mining, collaborative fusion, and enhancing the robustness of clustering. The main contributions are as follows:1. Hidden data mining for incomplete multi-view data: existing algorithms cannot make full use of the observation of information within and between views, resulting in the loss of a large amount of valuable information, and so we propose a new incomplete multi-view clustering model IMC-NLT (Incomplete Multi-view Clustering Based on NMF and Low-Rank Tensor Fusion) based on non-negative matrix factorization and low-rank tensor fusion. IMC-NLT first uses a low-rank tensor to retain view features with a unified dimension. Using a consistency measure, IMC-NLT captures a consistent representation across multiple views. Finally, IMC-NLT incorporates multiple learning into a unified model such that hidden information can be extracted effectively from incomplete views. We conducted comprehensive experiments on five real-world datasets to validate the performance of IMC-NLT. The overall experimental results demonstrate that the proposed IMC-NLT performs better than several baseline methods, yielding stable and promising results.2. Collaborative fusion for incomplete multi-view data: our approach to address this issue is Incomplete Multi-view Co-Clustering by Sparse Low-Rank Representation (CCIM-SLR). The algorithm is based on sparse low-rank representation and subspace representation, in which jointly missing data is filled using data within a modality and related data from other modalities. To improve the stability of clustering results for multi-view data with different missing degrees, CCIM-SLR uses the Γ-norm model, which is an adjustable low-rank representation method. CCIM-SLR can alternate between learning the shared hidden view, visible view, and cluster partitions within a co-learning framework. An iterative algorithm with guaranteed convergence is used to optimize the proposed objective function. Compared with other baseline models, CCIM-SLR achieved the best performance in the comprehensive experiments on the five benchmark datasets, particularly on those with varying degrees of incompleteness.3. Enhancing the clustering robustness for incomplete multi-view data: we offer a fusion of graph convolution and information bottlenecks (Incomplete Multi-view Representation Learning Through Anchor Graph-based GCN and Information Bottleneck - IMRL-AGI). First, we introduce the information bottleneck theory to filter out the noise data with irrelevant details and retain only the most relevant feature items. Next, we integrate the graph structure information based on anchor points into the local graph information of the state fused into the shared information representation and the information representation learning process of the local specific view, a process that can balance the robustness of the learned features and improve the robustness. Finally, the model integrates multiple representations with the help of information bottlenecks, reducing the impact of redundant information in the data. Extensive experiments are conducted on several real-world datasets, and the results demonstrate the superiority of IMRL-AGI. Specifically, IMRL-AGI shows significant improvements in clustering and classification accuracy, even in the presence of high view missing rates (e.g. 10.23% and 24.1% respectively on the ORL dataset)

APA, Harvard, Vancouver, ISO, and other styles

4

Peng, Yingli. "Improvement of Data Mining Methods on Falling Detection and Daily Activities Recognition." Thesis, Mittuniversitetet, Avdelningen för informations- och kommunikationssystem, 2015. http://urn.kb.se/resolve?urn=urn:nbn:se:miun:diva-25521.

Full text

Abstract:

With the growing phenomenon of an aging population, an increasing numberof older people are living alone for domestic and social reasons. Based on thisfact, falling accidents become one of the most important factors in threateningthe lives of the elderly. Therefore, it is necessary to set up an application to de-tect the daily activities of the elderly. However, falling detection is difficult to recognize because the "falling" motion is an instantaneous motion and easy to confuse with others.In this thesis, three data mining methods were employed on wearable sensors' value; first which contains the continuous data set concerning eleven activities of daily living, and then an analysis of the different results was performed. Not only could the fall be detected, but other activities could also be classified. In detail, three methods including Back Propagation Neural Network, Support Vector Machine and Hidden Markov Model are applied separately to train the data set.What highlights the project is that a new idea is put forward, the aim of which is to design a methodology of accurate classification in the time-series data set. The proposed approach, which includes obtaining of classifier parts and the application parts allows the generalization of classification. The preliminary results indicate that the new method achieves the high accuracy of classification,and significantly performs better than other data mining methods in this experiment.

APA, Harvard, Vancouver, ISO, and other styles

5

Yang, Yimin. "Exploring Hidden Coherent Feature Groups and Temporal Semantics for Multimedia Big Data Analysis." FIU Digital Commons, 2015. http://digitalcommons.fiu.edu/etd/2254.

Full text

Abstract:

Thanks to the advanced technologies and social networks that allow the data to be widely shared among the Internet, there is an explosion of pervasive multimedia data, generating high demands of multimedia services and applications in various areas for people to easily access and manage multimedia data. Towards such demands, multimedia big data analysis has become an emerging hot topic in both industry and academia, which ranges from basic infrastructure, management, search, and mining to security, privacy, and applications. Within the scope of this dissertation, a multimedia big data analysis framework is proposed for semantic information management and retrieval with a focus on rare event detection in videos. The proposed framework is able to explore hidden semantic feature groups in multimedia data and incorporate temporal semantics, especially for video event detection. First, a hierarchical semantic data representation is presented to alleviate the semantic gap issue, and the Hidden Coherent Feature Group (HCFG) analysis method is proposed to capture the correlation between features and separate the original feature set into semantic groups, seamlessly integrating multimedia data in multiple modalities. Next, an Importance Factor based Temporal Multiple Correspondence Analysis (i.e., IF-TMCA) approach is presented for effective event detection. Specifically, the HCFG algorithm is integrated with the Hierarchical Information Gain Analysis (HIGA) method to generate the Importance Factor (IF) for producing the initial detection results. Then, the TMCA algorithm is proposed to efficiently incorporate temporal semantics for re-ranking and improving the final performance. At last, a sampling-based ensemble learning mechanism is applied to further accommodate the imbalanced datasets. In addition to the multimedia semantic representation and class imbalance problems, lack of organization is another critical issue for multimedia big data analysis. In this framework, an affinity propagation-based summarization method is also proposed to transform the unorganized data into a better structure with clean and well-organized information. The whole framework has been thoroughly evaluated across multiple domains, such as soccer goal event detection and disaster information management.

APA, Harvard, Vancouver, ISO, and other styles

6

Sajeva, Lisa. "Predizione del tempo rimanente di vita di un impianto mediante Hidden Markow Model." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2017. http://amslaurea.unibo.it/13846/.

Full text

Abstract:

In this thesis we investigate the main methods used in the literature for the automation of conditio-base maintenance and then see a pratical application concerning bearing system. In the specifics we first analyze the row signal of vibration decomposing whit a wavelet packet transform then, we select the best level and index in term of characteristics. For create a model of failure we use the method of Hidden Markov Model. At least we compare the model generated with other level and index of decomposition to demonstrate that our choice was the best.

APA, Harvard, Vancouver, ISO, and other styles

7

Vitali, Federico. "Map-Matching su Piattaforma BigData." Master's thesis, Alma Mater Studiorum - Università di Bologna, 2019. http://amslaurea.unibo.it/18089/.

Full text

Abstract:

Nell'ambito dell'analisi dei dati di movimento atto all'estrazione di informazioni utili, il map matching ha l'obiettivo di proiettare i punti GPS generati dagli oggetti in movimento sopra i segmenti stradali in modo da rappresentare l'attuale posizione degli oggetti. Fino ad ora, il map matching è stato sfruttato in ambiti come l'analisi del traffico, l'estrazione dei percorsi frequenti e la predizione della posizione degli oggetti, oltre a rappresentare un'importante fase di pre-processing nell'intero procedimento di trajectory mining. Sfortunatamente, le implementazioni allo stato dell'arte degli algoritmi di map matching sono tutte sequenziali o inefficienti. In questa tesi viene quindi proposto un algoritmo il quale si basa su di un algoritmo sequenziale conosciuto per la sua accuratezza ed efficienza il quale viene completamente riformulato in maniera distribuita in modo tale da raggiungere anche un elevata scalabilità nel caso di utilizzo con i big data. Inoltre, viene migliorata la robustezza dell'algoritmo, il quale è basato sull'Hidden Markov Model di primo ordine, introducendo una strategia per gestire i possibili buchi di informazione che si possono venire a creare tra i segmenti stradali assegnati. Infatti, il problema può accadere in caso di campionamento variabile dei punti GPS in aree urbane con un elevata frammentazione dei segmenti stradali. L'implementazione è basata su Apache Spark e testata su un dataset di oltre 7.8 milioni di punti GPS nella città di Milano.

APA, Harvard, Vancouver, ISO, and other styles

8

Eng, Catherine. "Développement de méthodes de fouille de données basées sur les modèles de Markov cachés du second ordre pour l'identification d'hétérogénéités dans les génomes bactériens." Thesis, Nancy 1, 2010. http://www.theses.fr/2010NAN10041/document.

Full text

Abstract:

Les modèles de Markov d’ordre 2 (HMM2) sont des modèles stochastiques qui ont démontré leur efficacité dans l’exploration de séquences génomiques. Cette thèse explore l’intérêt de modèles de différents types (M1M2, M2M2, M2M0) ainsi que leur couplage à des méthodes combinatoires pour segmenter les génomes bactériens sans connaissances a priori du contenu génétique. Ces approches ont été appliquées à deux modèles bactériens afin d’en valider la robustesse : Streptomyces coelicolor et Streptococcus thermophilus. Ces espèces bactériennes présentent des caractéristiques génomiques très distinctes (composition, taille du génome) en lien avec leur écosystème spécifique : le sol pour les S. coelicolor et le milieu lait pour S. thermophilus
Second-order Hidden Markov Models (HMM2) are stochastic processes with a high efficiency in exploring bacterial genome sequences. Different types of HMM2 (M1M2, M2M2, M2M0) combined to combinatorial methods were developed in a new approach to discriminate genomic regions without a priori knowledge on their genetic content. This approach was applied on two bacterial models in order to validate its achievements: Streptomyces coelicolor and Streptococcus thermophilus. These bacterial species exhibit distinct genomic traits (base composition, global genome size) in relation with their ecological niche: soil for S. coelicolor and dairy products for S. thermophilus. In S. coelicolor, a first HMM2 architecture allowed the detection of short discrete DNA heterogeneities (5-16 nucleotides in size), mostly localized in intergenic regions. The application of the method on a biologically known gene set, the SigR regulon (involved in oxidative stress response), proved the efficiency in identifying bacterial promoters. S. coelicolor shows a complex regulatory network (up to 12% of the genes may be involved in gene regulation) with more than 60 sigma factors, involved in initiation of transcription. A classification method coupled to a searching algorithm (i.e. R’MES) was developed to automatically extract the box1-spacer-box2 composite DNA motifs, structure corresponding to the typical bacterial promoter -35/-10 boxes. Among the 814 DNA motifs described for the whole S. coelicolor genome, those of sigma factors (B, WhiG) could be retrieved from the crude data. We could show that this method could be generalized by applying it successfully in a preliminary attempt to the genome of Bacillus subtilis

APA, Harvard, Vancouver, ISO, and other styles

9

陳迪祥. "A Data Mining Approach to Eliciting Hidden Relationships from Disease Data." Thesis, 2003. http://ndltd.ncl.edu.tw/handle/33856707588342488454.

Full text

Abstract:

碩士
國立暨南國際大學
資訊管理學系
91
Data mining is able to find some unobvious or hidden information from data and it is what the managers of hospitals need for their rich data. There are many kinds of data in those hospitals’ database, such as records of emergency treatment, records of outpatient services, records of examining patients, and records of taking medicines. The data is helpful for exploring medical knowledge by data mining technology. This paper describes a data mining system which processing the standard health insurance files defined by Bureau of National Health Insurance. The system uses FP-Tree for good performance of mining. A distributed and caching architecture has been implemented in the system to balance the loading of mining. Users can acquire mining results from the system quickly. The system will elicit hidden relationships within diseases from those health insurance files. Our frequent patterns also include conditional probabilities that certain diseases may happen if the patient has some disease. Doctors and researchers operate the system by a browser. The mining results discovered by the system will help doctors and researchers with medical researches. Keywords: Data mining, Health Insurance, Medicine, Distributed Architecture

APA, Harvard, Vancouver, ISO, and other styles

10

Yu, Zhun. "Mining Hidden Knowledge from Measured Data for Improving Building Energy Performance." Thesis, 2012. http://spectrum.library.concordia.ca/973713/1/Yu_PhD_S2012.pdf.

Full text

Abstract:

Nowadays, building automation and energy management systems provide an opportunity to collect vast amounts of building-related data (e.g., climatic data, building operational data, etc.). The data can provide abundant useful knowledge about the interactions between building energy consumption and its influencing factors. Such interactions play a crucial role in developing and implementing control strategies to improve building energy performance. However, the data is rarely analyzed and this useful knowledge is seldom extracted due to a lack of effective data analysis techniques. In this research, data mining (classification analysis, cluster analysis, and association rule mining) is proposed to extract hidden useful knowledge from building-related data. Moreover, a data analysis process and a data mining framework are proposed, enabling building-related data to be analyzed more efficiently. The applications of the process and framework to two sets of collected data demonstrate their applicability. Based on the process and framework, four data analysis methodologies were developed and applied to the collected data. Classification analysis was applied to develop a methodology for establishing building energy demand predictive models. To demonstrate its applicability, the methodology was applied to estimate residential building energy performance indexes by modeling building energy use intensity (EUI) levels (either high or low). The results demonstrate that the methodology can classify and predict the building energy demand levels with an accuracy of 93% for training data and 92% for test data, and identify and rank significant factors of building EUI automatically. Cluster analysis was used to develop a methodology for examining the influences of occupant behavior on building energy consumption. The results show that the methodology facilitates the evaluation of building energy-saving potential by improving the behavior of building occupants, and provides multifaceted insights into building energy end-use patterns associated with the occupant behavior. Association rule mining was employed to develop a methodology for examining all associations and correlations between building operational data, thereby discovering useful knowledge about energy conservation. The results show there are possibilities for saving energy by modifying the operation of mechanical ventilation systems and by repairing equipment. Cluster analysis, classification analysis, and association rule mining were combined to formulate a methodology for identifying and improving occupant behavior in buildings. The results show that the methodology was able to identify the behavior which needs to be modified, and provide occupants with feasible recommendations so that they can make required decisions to modify their behavior.

APA, Harvard, Vancouver, ISO, and other styles

11

"Spatio-Temporal Data Mining to Detect Changes and Clusters in Trajectories." Doctoral diss., 2012. http://hdl.handle.net/2286/R.I.15907.

Full text

Abstract:

abstract: With the rapid development of mobile sensing technologies like GPS, RFID, sensors in smartphones, etc., capturing position data in the form of trajectories has become easy. Moving object trajectory analysis is a growing area of interest these days owing to its applications in various domains such as marketing, security, traffic monitoring and management, etc. To better understand movement behaviors from the raw mobility data, this doctoral work provides analytic models for analyzing trajectory data. As a first contribution, a model is developed to detect changes in trajectories with time. If the taxis moving in a city are viewed as sensors that provide real time information of the traffic in the city, a change in these trajectories with time can reveal that the road network has changed. To detect changes, trajectories are modeled with a Hidden Markov Model (HMM). A modified training algorithm, for parameter estimation in HMM, called m-BaumWelch, is used to develop likelihood estimates under assumed changes and used to detect changes in trajectory data with time. Data from vehicles are used to test the method for change detection. Secondly, sequential pattern mining is used to develop a model to detect changes in frequent patterns occurring in trajectory data. The aim is to answer two questions: Are the frequent patterns still frequent in the new data? If they are frequent, has the time interval distribution in the pattern changed? Two different approaches are considered for change detection, frequency-based approach and distribution-based approach. The methods are illustrated with vehicle trajectory data. Finally, a model is developed for clustering and outlier detection in semantic trajectories. A challenge with clustering semantic trajectories is that both numeric and categorical attributes are present. Another problem to be addressed while clustering is that trajectories can be of different lengths and also have missing values. A tree-based ensemble is used to address these problems. The approach is extended to outlier detection in semantic trajectories.
Dissertation/Thesis
Ph.D. Industrial Engineering 2012

APA, Harvard, Vancouver, ISO, and other styles

12

Shao, Qun, Raymond C. Rowe, and Peter York. "Data mining of fractured experimental data using neurofuzzy logic-discovering and integrating knowledge hidden in multiple formulation databases for a fluid-bed granulation process." 2008. http://hdl.handle.net/10454/3439.

Full text

Abstract:

No
In the pharmaceutical field, current practice in gaining process understanding by data analysis or knowledge discovery has generally focused on dealing with single experimental databases. This limits the level of knowledge extracted in the situation where data from a number of sources, so called fractured data, contain interrelated information. This situation is particularly relevant for complex processes involving a number of operating variables, such as a fluid-bed granulation. This study investigated three data mining strategies to discover and integrate knowledge "hidden" in a number of small experimental databases for a fluid-bed granulation process using neurofuzzy logic technology. Results showed that more comprehensive domain knowledge was discovered from multiple databases via an appropriate data mining strategy. This study also demonstrated that the textual information excluded in individual databases was a critical parameter and often acted as the precondition for integrating knowledge extracted from different databases. Consequently generic knowledge of the domain was discovered, leading to an improved understanding of the granulation process.

APA, Harvard, Vancouver, ISO, and other styles

13

Laxman, Srivatsan. "Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And Generalizations." Thesis, 2006. https://etd.iisc.ac.in/handle/2005/375.

Full text

Abstract:

Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets. In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes. Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of |C| candidate serial episodes of size N has space complexity O(N|C|) and time complexity O(ΔTN|C|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of |C| episodes needs O(n|C|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream). This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(|C|) and O(n|C|) respectively (where n denotes the total number of events in the given event sequence and |C| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods. There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, |C|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences. In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework. Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining. In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated. The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented.

APA, Harvard, Vancouver, ISO, and other styles

14

Laxman, Srivatsan. "Discovering Frequent Episodes : Fast Algorithms, Connections With HMMs And Generalizations." Thesis, 2006. http://hdl.handle.net/2005/375.

Full text

Abstract:

Temporal data mining is concerned with the exploration of large sequential (or temporally ordered) data sets to discover some nontrivial information that was previously unknown to the data owner. Sequential data sets come up naturally in a wide range of application domains, ranging from bioinformatics to manufacturing processes. Pattern discovery refers to a broad class of data mining techniques in which the objective is to unearth hidden patterns or unexpected trends in the data. In general, pattern discovery is about finding all patterns of 'interest' in the data and one popular measure of interestingness for a pattern is its frequency in the data. The problem of frequent pattern discovery is to find all patterns in the data whose frequency exceeds some user-defined threshold. Discovery of temporal patterns that occur frequently in sequential data has received a lot of attention in recent times. Different approaches consider different classes of temporal patterns and propose different algorithms for their efficient discovery from the data. This thesis is concerned with a specific class of temporal patterns called episodes and their discovery in large sequential data sets. In the framework of frequent episode discovery, data (referred to as an event sequence or an event stream) is available as a single long sequence of events. The ith event in the sequence is an ordered pair, (Et,tt), where Et takes values from a finite alphabet (of event types), and U is the time of occurrence of the event. The events in the sequence are ordered according to these times of occurrence. An episode (which is the temporal pattern considered in this framework) is a (typically) short partially ordered sequence of event types. Formally, an episode is a triple, (V,<,9), where V is a collection of nodes, < is a partial order on V and 9 is a map that assigns an event type to each node of the episode. When < is total, the episode is referred to as a serial episode, and when < is trivial (or empty), the episode is referred to as a parallel episode. An episode is said to occur in an event sequence if there are events in the sequence, with event types same as those constituting the episode, and with times of occurrence respecting the partial order in the episode. The frequency of an episode is some measure of how often it occurs in the event sequence. Given a frequency definition for episodes, the task is to discover all episodes whose frequencies exceed some threshold. This is done using a level-wise procedure. In each level, a candidate generation step is used to combine frequent episodes from the previous level to build candidates of the next larger size, and then a frequency counting step makes one pass over the event stream to determine frequencies of all the candidates and thus identify the frequent episodes. Frequency counting is the main computationally intensive step in frequent episode discovery. Choice of frequency definition for episodes has a direct bearing on the efficiency of the counting procedure. In the original framework of frequent episode discovery, episode frequency is defined as the number of fixed-width sliding windows over the data in which the episode occurs at least once. Under this frequency definition, frequency counting of a set of |C| candidate serial episodes of size N has space complexity O(N|C|) and time complexity O(ΔTN|C|) (where ΔT is the difference between the times of occurrence of the last and the first event in the data stream). The other main frequency definition available in the literature, defines episode frequency as the number of minimal occurrences of the episode (where, a minimal occurrence is a window on the time axis containing an occurrence of the episode, such that, no proper sub-window of it contains another occurrence of the episode). The algorithm for obtaining frequencies for a set of |C| episodes needs O(n|C|) time (where n denotes the number of events in the data stream). While this is time-wise better than the the windows-based algorithm, the space needed to locate minimal occurrences of an episode can be very high (and is in fact of the order of length, n, of the event stream). This thesis proposes a new definition for episode frequency, based on the notion of, what is called, non-overlapped occurrences of episodes in the event stream. Two occurrences are said to be non-overlapped if no event corresponding to one occurrence appears in between events corresponding to the other. Frequency of an episode is defined as the maximum possible number of non-overlapped occurrences of the episode in the data. The thesis also presents algorithms for efficient frequent episode discovery under this frequency definition. The space and time complexities for frequency counting of serial episodes are O(|C|) and O(n|C|) respectively (where n denotes the total number of events in the given event sequence and |C| denotes the num-ber of candidate episodes). These are arguably the best possible space and time complexities for the frequency counting step that can be achieved. Also, the fact that the time needed by the non-overlapped occurrences-based algorithm is linear in the number of events, n, in the event sequence (rather than the difference, ΔT, between occurrence times of the first and last events in the data stream, as is the case with the windows-based algorithm), can result in considerable time advantage when the number of time ticks far exceeds the number of events in the event stream. The thesis also presents efficient algorithms for frequent episode discovery under expiry time constraints (according to which, an occurrence of an episode can be counted for its frequency only if the total time span of the occurrence is less than a user-defined threshold). It is shown through simulation experiments that, in terms of actual run-times, frequent episode discovery under the non-overlapped occurrences-based frequency (using the algorithms developed here) is much faster than existing methods. There is also a second frequency measure that is proposed in this thesis, which is based on, what is termed as, non-interleaved occurrences of episodes in the data. This definition counts certain kinds of overlapping occurrences of the episode. The time needed is linear in the number of events, n, in the data sequence, the size, N, of episodes and the number of candidates, |C|. Simulation experiments show that run-time performance under this frequency definition is slightly inferior compared to the non-overlapped occurrences-based frequency, but is still better than the run-times under the windows-based frequency. This thesis also establishes the following interesting property that connects the non-overlapped, the non-interleaved and the minimal occurrences-based frequencies of an episode in the data: the number of minimal occurrences of an episode is bounded below by the maximum number of non-overlapped occurrences of the episode, and is bounded above by the maximum number of non-interleaved occurrences of the episode in the data. Hence, non-interleaved occurrences-based frequency is an efficient alternative to that based on minimal occurrences. In addition to being superior in terms of both time and space complexities compared to all other existing algorithms for frequent episode discovery, the non-overlapped occurrences-based frequency has another very important property. It facilitates a formal connection between discovering frequent serial episodes in data streams and learning or estimating a model for the data generation process in terms of certain kinds of Hidden Markov Models (HMMs). In order to establish this connection, a special class of HMMs, called Episode Generating HMMs (EGHs) are defined. The symbol set for the HMM is chosen to be the alphabet of event types, so that, the output of EGHs can be regarded as event streams in the frequent episode discovery framework. Given a serial episode, α, that occurs in the event stream, a method is proposed to uniquely associate it with an EGH, Λα. Consider two N-node serial episodes, α and β, whose (non-overlapped occurrences-based) frequencies in the given event stream, o, are fα and fβ respectively. Let Λα and Λβ be the EGHs associated with α and β. The main result connecting episodes and EGHs states that, the joint probability of o and the most likely state sequence for Λα is more than the corresponding probability for Λβ, if and only if, fα is greater than fβ. This theoretical connection has some interesting consequences. First of all, since the most frequent serial episode is associated with the EGH having the highest data likelihood, frequent episode discovery can now be interpreted as a generative model learning exercise. More importantly, it is now possible to derive a formal test of significance for serial episodes in the data, that prescribes, for a given size of the test, a minimum frequency for the episode needed in order to declare it as statistically significant. Note that this significance test for serial episodes does not require any separate model estimation (or training). The only quantity required to assess significance of an episode is its non-overlapped occurrences-based frequency (and this is obtained through the usual counting procedure). The significance test also helps to automatically fix the frequency threshold for the frequent episode discovery process, so that it can lead to what may be termed parameterless data mining. In the framework considered so far, the input to frequent episode discovery process is a sequence of instantaneous events. However, in many applications events tend to persist for different periods of time and the durations may carry important information from a data mining perspective. This thesis extends the framework of frequent episodes to incorporate such duration information directly into the definition of episodes, so that, the patterns discovered will now carry this duration information as well. Each event in this generalized framework looks like a triple, (Ei, ti, τi), where Ei, as earlier, is the event type (from some finite alphabet) corresponding to the ith event, and ti and τi denote the start and end times of this event. The new temporal pattern, called the generalized episode, is a quadruple, (V, <, g, d), where V, < and g, as earlier, respectively denote a collection of nodes, a partial order over this collection and a map assigning event types to nodes. The new feature in the generalized episode is d, which is a map from V to 2I, where, I denotes a collection of time interval possibilities for event durations, which is defined by the user. An occurrence of a generalized episode in the event sequence consists of events with both 'correct' event types and 'correct' time durations, appearing in the event sequence in 'correct' time order. All frequency definitions for episodes over instantaneous event streams are applicable for generalized episodes as well. The algorithms for frequent episode discovery also easily extend to the case of generalized episodes. The extra design choice that the user has in this generalized framework, is the set, I, of time interval possibilities. This can be used to orient and focus the frequent episode discovery process to come up with temporal correlations involving only time durations that are of interest. Through extensive simulations the utility and effectiveness of the generalized framework are demonstrated. The new algorithms for frequent episode discovery presented in this thesis are used to develop an application for temporal data mining of some data from car engine manufacturing plants. Engine manufacturing is a heavily automated and complex distributed controlled process with large amounts of faults data logged each day. The goal of temporal data mining here is to unearth some strong time-ordered correlations in the data which can facilitate quick diagnosis of root causes for persistent problems and predict major breakdowns well in advance. This thesis presents an application of the algorithms developed here for such analysis of the faults data. The data consists of time-stamped faults logged in car engine manufacturing plants of General Motors. Each fault is logged using an extensive list of codes (which constitutes the alphabet of event types for frequent episode discovery). Frequent episodes in fault logs represent temporal correlations among faults and these can be used for fault diagnosis in the plant. This thesis describes how the outputs from the frequent episode discovery framework, can be used to help plant engineers interpret the large volumes of faults logged, in an efficient and convenient manner. Such a system, based on the algorithms developed in this thesis, is currently being used in one of the engine manufacturing plants of General Motors. Some examples of the results obtained that were regarded as useful by the plant engineers are also presented.

APA, Harvard, Vancouver, ISO, and other styles

15

Akhlaghi, Arash. "A Framework for Discovery and Diagnosis of Behavioral Transitions in Event-streams." 2013. http://scholarworks.gsu.edu/cs_diss/81.

Full text

Abstract:

Date stream mining techniques can be used in tracking user behaviors as they attempt to achieve their goals. Quality metrics over stream-mined models identify potential changes in user goal attainment. When the quality of some data mined models varies significantly from nearby models—as defined by quality metrics—then the user’s behavior is automatically flagged as a potentially significant behavioral change. Decision tree, sequence pattern and Hidden Markov modeling being used in this study. These three types of modeling can expose different aspect of user’s behavior. In case of decision tree modeling, the specific changes in user behavior can automatically characterized by differencing the data-mined decision-tree models. The sequence pattern modeling can shed light on how the user changes his sequence of actions and Hidden Markov modeling can identifies the learning transition points. This research describes how model-quality monitoring and these three types of modeling as a generic framework can aid recognition and diagnoses of behavioral changes in a case study of cognitive rehabilitation via emailing. The date stream mining techniques mentioned are used to monitor patient goals as part of a clinical plan to aid cognitive rehabilitation. In this context, real time data mining aids clinicians in tracking user behaviors as they attempt to achieve their goals. This generic framework can be widely applicable to other real-time data-intensive analysis problems. In order to illustrate this fact, the similar Hidden Markov modeling is being used for analyzing the transactional behavior of a telecommunication company for fraud detection. Fraud similarly can be considered as a potentially significant transaction behavioral change.

APA, Harvard, Vancouver, ISO, and other styles

16

Saradha, R. "Malware Analysis using Profile Hidden Markov Models and Intrusion Detection in a Stream Learning Setting." Thesis, 2014. http://etd.iisc.ac.in/handle/2005/3129.

Full text

Abstract:

In the last decade, a lot of machine learning and data mining based approaches have been used in the areas of intrusion detection, malware detection and classification and also traffic analysis. In the area of malware analysis, static binary analysis techniques have become increasingly difficult with the code obfuscation methods and code packing employed when writing the malware. The behavior-based analysis techniques are being used in large malware analysis systems because of this reason. In prior art, a number of clustering and classification techniques have been used to classify the malwares into families and to also identify new malware families, from the behavior reports. In this thesis, we have analysed in detail about the use of Profile Hidden Markov models for the problem of malware classification and clustering. The advantage of building accurate models with limited examples is very helpful in early detection and modeling of malware families. The thesis also revisits the learning setting of an Intrusion Detection System that employs machine learning for identifying attacks and normal traffic. It substantiates the suitability of incremental learning setting(or stream based learning setting) for the problem of learning attack patterns in IDS, when large volume of data arrive in a stream. Related to the above problem, an elaborate survey of the IDS that use data mining and machine learning was done. Experimental evaluation and comparison show that in terms of speed and accuracy, the stream based algorithms perform very well as large volumes of data are presented for classification as attack or non-attack patterns. The possibilities for using stream algorithms in different problems in security is elucidated in conclusion.

APA, Harvard, Vancouver, ISO, and other styles

17

Saradha, R. "Malware Analysis using Profile Hidden Markov Models and Intrusion Detection in a Stream Learning Setting." Thesis, 2014. http://hdl.handle.net/2005/3129.

Full text

Abstract:

In the last decade, a lot of machine learning and data mining based approaches have been used in the areas of intrusion detection, malware detection and classification and also traffic analysis. In the area of malware analysis, static binary analysis techniques have become increasingly difficult with the code obfuscation methods and code packing employed when writing the malware. The behavior-based analysis techniques are being used in large malware analysis systems because of this reason. In prior art, a number of clustering and classification techniques have been used to classify the malwares into families and to also identify new malware families, from the behavior reports. In this thesis, we have analysed in detail about the use of Profile Hidden Markov models for the problem of malware classification and clustering. The advantage of building accurate models with limited examples is very helpful in early detection and modeling of malware families. The thesis also revisits the learning setting of an Intrusion Detection System that employs machine learning for identifying attacks and normal traffic. It substantiates the suitability of incremental learning setting(or stream based learning setting) for the problem of learning attack patterns in IDS, when large volume of data arrive in a stream. Related to the above problem, an elaborate survey of the IDS that use data mining and machine learning was done. Experimental evaluation and comparison show that in terms of speed and accuracy, the stream based algorithms perform very well as large volumes of data are presented for classification as attack or non-attack patterns. The possibilities for using stream algorithms in different problems in security is elucidated in conclusion.

APA, Harvard, Vancouver, ISO, and other styles

Dissertations / Theses on the topic 'Hidden Data Mining'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles