Tesi: "Concept Drift Detection"

1

Ostovar, Alireza. "Business process drift: Detection and characterization". Thesis, Queensland University of Technology, 2019. https://eprints.qut.edu.au/127157/1/Alireza_Ostovar_Thesis.pdf.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

This research contributes a set of techniques for the early detection and characterization of process drifts, i.e. statistically significant changes in the behavior of business operations, as recorded in transactional data. Early detection and subsequent characterization of process drifts allows organizations to take prompt remedial actions and avoid potential repercussions resulting from unplanned changes in the behavior of their operations.

2

ESCOVEDO, TATIANA. "NEUROEVOLUTIVE LEARNING AND CONCEPT DRIFT DETECTION IN NON-STATIONARY ENVIRONMENTS". PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO, 2015. http://www.maxwell.vrac.puc-rio.br/Busca_etds.php?strSecao=resultado&nrSeq=26748@1.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

PONTIFÍCIA UNIVERSIDADE CATÓLICA DO RIO DE JANEIRO
COORDENAÇÃO DE APERFEIÇOAMENTO DO PESSOAL DE ENSINO SUPERIOR
PROGRAMA DE EXCELENCIA ACADEMICA
Os conceitos do mundo real muitas vezes não são estáveis: eles mudam com o tempo. Assim como os conceitos, a distribuição de dados também pode se alterar. Este problema de mudança de conceitos ou distribuição de dados é conhecido como concept drift e é um desafio para um modelo na tarefa de aprender a partir de dados. Este trabalho apresenta um novo modelo neuroevolutivo com inspiração quântica, baseado em um comitê de redes neurais do tipo Multi-Layer Perceptron (MLP), para a aprendizagem em ambientes não estacionários, denominado NEVE (Neuro-EVolutionary Ensemble). Também apresenta um novo mecanismo de detecção de concept drift, denominado DetectA (Detect Abrupt) com a capacidade de detectar mudanças tanto de forma proativa quanto de forma reativa. O algoritmo evolutivo com inspiração quântica binário-real AEIQ-BR é utilizado no NEVE para gerar automaticamente novos classificadores para o comitê, determinando a topologia mais adequada para a nova rede, selecionando as variáveis de entrada mais apropriadas e determinando todos os pesos da rede neural MLP. O algoritmo AEIQ-R determina os pesos de votação de cada rede neural membro do comitê, sendo possível utilizar votação por combinação linear, votação majoritária ponderada e simples. São implementadas quatro diferentes abordagens do NEVE, que se diferem uma da outra pela forma de detectar e tratar os drifts ocorridos. O trabalho também apresenta resultados de experimentos realizados com o método DetectA e com o modelo NEVE em bases de dados reais e artificiais. Os resultados mostram que o detector se mostrou robusto e eficiente para bases de dados de alta dimensionalidade, blocos de tamanho intermediário, bases de dados com qualquer proporção de drift e com qualquer balanceamento de classes e que, em geral, os melhores resultados obtidos foram usando algum tipo de detecção. Comparando a acurácia do NEVE com outros modelos consolidados da literatura, verifica-se que o NEVE teve acurácia superior na maioria dos casos. Isto reforça que a abordagem por comitê neuroevolutivo é uma escolha robusta para situações em que as bases de dados estão sujeitas a mudanças repentinas de comportamento.
Real world concepts are often not stable: they change with time. Just as the concepts, data distribution may change as well. This problem of change in concepts or distribution of data is known as concept drift and is a challenge for a model in the task of learning from data. This work presents a new neuroevolutive model with quantum inspiration called NEVE (Neuro- EVolutionary Ensemble), based on an ensemble of Multi-Layer Perceptron (MLP) neural networks for learning in non-stationary environments. It also presents a new concept drift detection mechanism, called DetectA (DETECT Abrupt) with the ability to detect changes both proactively as reactively. The evolutionary algorithm with binary-real quantum inspiration AEIQ-BR is used in NEVE to automatically generate new classifiers for the ensemble, determining the most appropriate topology for the new network and by selecting the most appropriate input variables and determining all the weights of the neural network. The AEIQ-R algorithm determines the voting weight of each neural network ensemble member, and you can use voting by linear combination and voting by weighted or simple majority. Four different approaches of NEVE are implemented and they differ from one another by the way of detecting and treating occurring drifts. The work also presents results of experiments conducted with the DetectA method and with the NEVE model in real and artificial databases. The results show that the detector has proved efficient and suitable for data bases with high-dimensionality, intermediate sized blocks, any proportion of drifts and with any class balancing. Comparing the accuracy of NEVE with other consolidated models in the literature, it appears that NEVE had higher accuracy in most cases. This reinforces that the neuroevolution ensemble approach is a robust choice to situations in which the databases are subject to sudden changes in behavior.

3

Roded, Keren. "The concept of drift and operationalization of its detection in simulated data". Thesis, University of British Columbia, 2017. http://hdl.handle.net/2429/63135.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

In this paper, the phenomenon of changes in item characteristics over time (often referred to as drift) is discussed from several theoretical perspectives, and a new procedure for the detection of Item Parameter Drift (IPD) is proposed. An initial evaluation of the utility of the proposed procedure is conducted using simulated data modeled by the 2-Parameter Logistic (2PL) Item Response Theory (IRT) model. In addition to the proposed procedure, an IPD analysis of the simulated data is conducted using two known methods: Kim, Cohen, and Park's (1995) extension of Lord's (1980) Chi-square test of Differential Item Functioning (DIF) to multiple groups, and logistic regression. The results indicate high agreement and accuracy in the detection of true IPD using the two known methods, but poor performance of the proposed procedure. Possible explanations of the findings and future directions are discussed.
Education, Faculty of
Educational and Counselling Psychology, and Special Education (ECPS), Department of
Graduate

4

D'Ettorre, Sarah. "Fine-Grained, Unsupervised, Context-based Change Detection and Adaptation for Evolving Categorical Data". Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/35518.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Concept drift detection, the identfication of changes in data distributions in streams, is critical to understanding the mechanics of data generating processes and ensuring that data models remain representative through time [2]. Many change detection methods utilize statistical techniques that take numerical data as input. However, many applications produce data streams containing categorical attributes. In this context, numerical statistical methods are unavailable, and different approaches are required. Common solutions use error monitoring, assuming that fluctuations in the error measures of a learning system correspond to concept drift [4]. There has been very little research, though, on context-based concept drift detection in categorical streams. This approach observes changes in the actual data distribution and is less popular due to the challenges associated with categorical data analysis. However, context-based change detection is arguably more informative as it is data-driven, and more widely applicable in that it can function in an unsupervised setting [4]. This study offers a contribution to this gap in the research by proposing a novel context-based change detection and adaptation algorithm for categorical data, namely Fine-Grained Change Detection in Categorical Data Streams (FG-CDCStream). This unsupervised method exploits elements of ensemble learning, a technique whereby decisions are made according to the majority vote of a set of models representing different random subspaces of the data [5]. These ideas are applied to a set of concept drift detector objects and merged with concepts from a recent, state-of-the-art, context-based change detection algorithm, the so-called Change Detection in Categorical Data Streams (CDCStream) [4]. FG-CDCStream is proposed as an extension of the batch-based CDCStream, providing instance-by-instance analysis and improving its change detection capabilities especially in data streams containing abrupt changes or a combination of abrupt and gradual changes. FG-CDCStream also enhances the adaptation strategy of CDCStream producing more representative post-change models.

5

Pesaranghader, Ali. "A Reservoir of Adaptive Algorithms for Online Learning from Evolving Data Streams". Thesis, Université d'Ottawa / University of Ottawa, 2018. http://hdl.handle.net/10393/38190.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Continuous change and development are essential aspects of evolving environments and applications, including, but not limited to, smart cities, military, medicine, nuclear reactors, self-driving cars, aviation, and aerospace. That is, the fundamental characteristics of such environments may evolve, and so cause dangerous consequences, e.g., putting people lives at stake, if no reaction is adopted. Therefore, learning systems need to apply intelligent algorithms to monitor evolvement in their environments and update themselves effectively. Further, we may experience fluctuations regarding the performance of learning algorithms due to the nature of incoming data as it continuously evolves. That is, the current efficient learning approach may become deprecated after a change in data or environment. Hence, the question 'how to have an efficient learning algorithm over time against evolving data?' has to be addressed. In this thesis, we have made two contributions to settle the challenges described above. In the machine learning literature, the phenomenon of (distributional) change in data is known as concept drift. Concept drift may shift decision boundaries, and cause a decline in accuracy. Learning algorithms, indeed, have to detect concept drift in evolving data streams and replace their predictive models accordingly. To address this challenge, adaptive learners have been devised which may utilize drift detection methods to locate the drift points in dynamic and changing data streams. A drift detection method able to discover the drift points quickly, with the lowest false positive and false negative rates, is preferred. False positive refers to incorrectly alarming for concept drift, and false negative refers to not alarming for concept drift. In this thesis, we introduce three algorithms, called as the Fast Hoeffding Drift Detection Method (FHDDM), the Stacking Fast Hoeffding Drift Detection Method (FHDDMS), and the McDiarmid Drift Detection Methods (MDDMs), for detecting drift points with the minimum delay, false positive, and false negative rates. FHDDM is a sliding window-based algorithm and applies Hoeffding’s inequality (Hoeffding, 1963) to detect concept drift. FHDDM slides its window over the prediction results, which are either 1 (for a correct prediction) or 0 (for a wrong prediction). Meanwhile, it compares the mean of elements inside the window with the maximum mean observed so far; subsequently, a significant difference between the two means, upper-bounded by the Hoeffding inequality, indicates the occurrence of concept drift. The FHDDMS extends the FHDDM algorithm by sliding multiple windows over its entries for a better drift detection regarding the detection delay and false negative rate. In contrast to FHDDM/S, the MDDM variants assign weights to their entries, i.e., higher weights are associated with the most recent entries in the sliding window, for faster detection of concept drift. The rationale is that recent examples reflect the ongoing situation adequately. Then, by putting higher weights on the latest entries, we may detect concept drift quickly. An MDDM algorithm bounds the difference between the weighted mean of elements in the sliding window and the maximum weighted mean seen so far, using McDiarmid’s inequality (McDiarmid, 1989). Eventually, it alarms for concept drift once a significant difference is experienced. We experimentally show that FHDDM/S and MDDMs outperform the state-of-the-art by representing promising results in terms of the adaptation and classification measures. Due to the evolving nature of data streams, the performance of an adaptive learner, which is defined by the classification, adaptation, and resource consumption measures, may fluctuate over time. In fact, a learning algorithm, in the form of a (classifier, detector) pair, may present a significant performance before a concept drift point, but not after. We define this problem by the question 'how can we ensure that an efficient classifier-detector pair is present at any time in an evolving environment?' To answer this, we have developed the Tornado framework which runs various kinds of learning algorithms simultaneously against evolving data streams. Each algorithm incrementally and independently trains a predictive model and updates the statistics of its drift detector. Meanwhile, our framework monitors the (classifier, detector) pairs, and recommends the efficient one, concerning the classification, adaptation, and resource consumption performance, to the user. We further define the holistic CAR measure that integrates the classification, adaptation, and resource consumption measures for evaluating the performance of adaptive learning algorithms. Our experiments confirm that the most efficient algorithm may differ over time because of the developing and evolving nature of data streams.

6

Henke, Márcia. "Deteção de Spam baseada na evolução das características com presença de Concept Drift". Universidade Federal do Amazonas, 2015. http://tede.ufam.edu.br/handle/tede/4708.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Submitted by Geyciane Santos (geyciane_thamires@hotmail.com) on 2015-11-12T20:17:58Z No. of bitstreams: 1 Tese - Márcia Henke.pdf: 2984974 bytes, checksum: a103355c1a7895956d40d4fa9422347a (MD5)
Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-11-16T18:36:36Z (GMT) No. of bitstreams: 1 Tese - Márcia Henke.pdf: 2984974 bytes, checksum: a103355c1a7895956d40d4fa9422347a (MD5)
Approved for entry into archive by Divisão de Documentação/BC Biblioteca Central (ddbc@ufam.edu.br) on 2015-11-16T18:43:03Z (GMT) No. of bitstreams: 1 Tese - Márcia Henke.pdf: 2984974 bytes, checksum: a103355c1a7895956d40d4fa9422347a (MD5)
Made available in DSpace on 2015-11-16T18:43:03Z (GMT). No. of bitstreams: 1 Tese - Márcia Henke.pdf: 2984974 bytes, checksum: a103355c1a7895956d40d4fa9422347a (MD5) Previous issue date: 2015-03-30
CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior
Electronic messages (emails) are still considered the most significant tools in business and personal applications due to their low cost and easy access. However, e-mails have become a major problem owing to the high amount of junk mail, named spam, which fill the e-mail boxes of users. Among the many problems caused by spam messages, we may highlight the fact that it is currently the main vector for the spread of malicious activities such as viruses, worms, trojans, phishing, botnets, among others. Such activities allow the attacker to have illegal access to penetrating data, trade secrets or to invade the privacy of the sufferers to get some advantage. Several approaches have been proposed to prevent sending unsolicited e-mail messages, such as filters implemented in e-mail servers, spam message classification mechanisms for users to define when particular issue or author is a source of spread of spam and even filters implemented in network electronics. In general, e-mail filter approaches are based on analysis of message content to determine whether or not a message is spam. A major problem with this approach is spam detection in the presence of concept drift. The literature defines concept drift as changes occurring in the concept of data over time, as the change in the features that describe an attack or occurrence of new features. Numerous Intrusion Detection Systems (IDS) use machine learning techniques to monitor the classification error rate in order to detect change. However, when detection occurs, some damage has been caused to the system, a fact that requires updating the classification process and the system operator intervention. To overcome the problems mentioned above, this work proposes a new changing detection method, named Method oriented to the Analysis of the Development of Attacks Characteristics (MECA). The proposed method consists of three steps: 1) classification model training; 2) concept drift detection; and 3) transfer learning. The first step generates classification models as it is commonly conducted in machine learning. The second step introduces two new strategies to avoid concept drift: HFS (Historical-based Features Selection) that analyzes the evolution of the features based on over time historical; and SFS (Similarity-based Features Selection) that analyzes the evolution of the features from the level of similarity obtained between the features vectors of the source and target domains. Finally, the third step focuses on the following questions: what, how and when to transfer acquired knowledge. The answer to the first question is provided by the concept drift detection strategies that identify the new features and store them to be transferred. To answer the second question, the feature representation transfer approach is employed. Finally, the transfer of new knowledge is executed as soon as changes that compromise the classification task performance are identified. The proposed method was developed and validated using two public databases, being one of the datasets built along this thesis. The results of the experiments shown that it is possible to infer a threshold to detect changes in order to ensure the classification model is updated through knowledge transfer. In addition, MECA architecture is able to perform the classification task, as well as the concept drift detection, as two parallel and independent tasks. Finally, MECA uses SVM machine learning algorithm (Support Vector Machines), which is less adherent to the training samples. The results obtained with MECA showed that it is possible to detect changes through feature evolution monitoring before a significant degradation in classification models is achieved.
As mensagens eletrônicas (e-mails) ainda são consideradas as ferramentas de maior prestígio no meio empresarial e pessoal, pois apresentam baixo custo e facilidade de acesso. Por outro lado, os e-mails tornaram-se um grande problema devido à elevada quantidade de mensagens não desejadas, denominadas spam, que lotam as caixas de emails dos usuários. Dentre os diversos problemas causados pelas mensagens spam, destaca-se o fato de ser atualmente o principal vetor de propagação de atividades maliciosas como vírus, worms, cavalos de Tróia, phishing, botnets, dentre outros. Tais atividades permitem ao atacante acesso indevido a dados sigilosos, segredos de negócios ou mesmo invadir a privacidade das vítimas para obter alguma vantagem. Diversas abordagens, comerciais e acadêmicas, têm sido propostas para impedir o envio de mensagens de e-mails indesejados como filtros implementados nos servidores de e-mail, mecanismos de classificação de mensagens de spam para que os usuários definam quando determinado assunto ou autor é fonte de propagação de spam e até mesmo filtros implementados em componentes eletrônicos de rede. Em geral, as abordagens de filtros de e-mail são baseadas na análise do conteúdo das mensagens para determinar se tal mensagem é ou não um spam. Um dos maiores problemas com essa abordagem é a deteção de spam na presença de concept drift. A literatura conceitua concept drift como mudanças que ocorrem no conceito dos dados ao longo do tempo como a alteração das características que descrevem um ataque ou ocorrência de novas características. Muitos Sistemas de Deteção de Intrusão (IDS) usam técnicas de aprendizagem de máquina para monitorar a taxa de erro de classificação no intuito de detetar mudança. Entretanto, quando a deteção ocorre, algum dano já foi causado ao sistema, fato que requer atualização do processo de classificação e a intervenção do operador do sistema. Com o objetivo de minimizar os problemas mencionados acima, esta tese propõe um método de deteção de mudança, denominado Método orientado à Análise da Evolução das Características de Ataques (MECA). O método proposto é composto por três etapas: 1) treino do modelo de classificação; 2) deteção de mudança; e 3) transferência do aprendizado. A primeira etapa emprega modelos de classificação comumente adotados em qualquer método que utiliza aprendizagem de máquina. A segunda etapa apresenta duas novas estratégias para contornar concept drift: HFS (Historical-based Features Selection) que analisa a evolução das características com base no histórico ao longo do tempo; e SFS (Similarity based Features Selection) que observa a evolução das características a partir do nível de similaridade obtido entre os vetores de características dos domínios fonte e alvo. Por fim, a terceira etapa concentra seu objetivo nas seguintes questões: o que, como e quando transferir conhecimento adquirido. A resposta à primeira questão é fornecida pelas estratégias de deteção de mudança, que identificam as novas características e as armazenam para que sejam transferidas. Para responder a segunda questão, a abordagem de transferência de representação de características é adotada. Finalmente, a transferência do novo conhecimento é realizada tão logo mudanças que comprometam o desempenho da tarefa de classificação sejam identificadas. O método MECA foi desenvolvido e validado usando duas bases de dados públicas, sendo que uma das bases foi construída ao longo desta tese. Os resultados dos experimentos indicaram que é possível inferir um limiar para detetar mudanças a fim de garantir o modelo de classificação sempre atualizado por meio da transferência de conhecimento. Além disso, um diferencial apresentado no método MECA é a possibilidade de executar a tarefa de classificação em paralelo com a deteção de mudança, sendo as duas tarefas independentes. Por fim, o MECA utiliza o algoritmo de aprendizagem de máquina SVM (Support Vector Machines), que é menos aderente às amostras de treinamento. Os resultados obtidos com o MECA mostraram que é possível detetar mudanças por meio da evolução das características antes de ocorrer uma degradação significativa no modelo de classificação utilizado.

7

SANTOS, Silas Garrido Teixeira de Carvalho. "Avaliação criteriosa dos algoritmos de detecção de concept drifts". Universidade Federal de Pernambuco, 2015. https://repositorio.ufpe.br/handle/123456789/17310.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Submitted by Fabio Sobreira Campos da Costa (fabio.sobreira@ufpe.br) on 2016-07-11T12:33:28Z No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5)
Made available in DSpace on 2016-07-11T12:33:28Z (GMT). No. of bitstreams: 2 license_rdf: 1232 bytes, checksum: 66e71c371cc565284e70f40736c94386 (MD5) silas-dissertacao-versao-final-2016.pdf: 1708159 bytes, checksum: 6c0efc5f2f0b27c79306418c9de516f1 (MD5) Previous issue date: 2015-02-27
FACEPE
A extração de conhecimento em ambientes com fluxo contínuo de dados é uma atividade que vem crescendo progressivamente. Diversas são as situações que necessitam desse mecanismo, como o monitoramento do histórico de compras de clientes; a detecção de presença por meio de sensores; ou o monitoramento da temperatura da água. Desta maneira, os algoritmos utilizados para esse fim devem ser atualizados constantemente, buscando adaptar-se às novas instâncias e levando em consideração as restrições computacionais. Quando se trabalha em ambientes com fluxo contínuo de dados, em geral não é recomendável supor que sua distribuição permanecerá estacionária. Diversas mudanças podem ocorrer ao longo do tempo, desencadeando uma situação geralmente conhecida como mudança de conceito (concept drift). Neste trabalho foi realizado um estudo comparativo entre alguns dos principais métodos de detecção de mudanças: ADWIN, DDM, DOF, ECDD, EDDM, PL e STEPD. Para execução dos experimentos foram utilizadas bases artificiais – simulando mudanças abruptas, graduais rápidas, e graduais lentas – e também bases com problemas reais. Os resultados foram analisados baseando-se na precisão, tempo de execução, uso de memória, tempo médio de detecção das mudanças, e quantidade de falsos positivos e negativos. Já os parâmetros dos métodos foram definidos utilizando uma versão adaptada de um algoritmo genético. De acordo com os resultados do teste de Friedman juntamente com Nemenyi, em termos de precisão, DDM se mostrou o método mais eficiente com as bases utilizadas, sendo estatisticamente superior ao DOF e ECDD. Já EDDM foi o método mais rápido e também o mais econômico no uso da memória, sendo superior ao DOF, ECDD, PL e STEPD, em ambos os casos. Conclui-se então que métodos mais sensíveis às detecções de mudanças, e consequentemente mais propensos a alarmes falsos, obtêm melhores resultados quando comparados a métodos menos sensíveis e menos suscetíveis a alarmes falsos.
Knowledge extraction from data streams is an activity that has been progressively receiving an increased demand. Examples of such applications include monitoring purchase history of customers, movement data from sensors, or water temperatures. Thus, algorithms used for this purpose must be constantly updated, trying to adapt to new instances and taking into account computational constraints. When working in environments with a continuous flow of data, there is no guarantee that the distribution of the data will remain stationary. On the contrary, several changes may occur over time, triggering situations commonly known as concept drift. In this work we present a comparative study of some of the main drift detection methods: ADWIN, DDM, DOF, ECDD, EDDM, PL and STEPD. For the execution of the experiments, artificial datasets were used – simulating abrupt, fast gradual, and slow gradual changes – and also datasets with real problems. The results were analyzed based on the accuracy, runtime, memory usage, average time to change detection, and number of false positives and negatives. The parameters of methods were defined using an adapted version of a genetic algorithm. According to the Friedman test with Nemenyi results, in terms of accuracy, DDM was the most efficient method with the datasets used, and statistically superior to DOF and ECDD. EDDM was the fastest method and also the most economical in memory usage, being statistically superior to DOF, ECDD, PL and STEPD, in both cases. It was concluded that more sensitive change detection methods, and therefore more prone to false alarms, achieve better results when compared to less sensitive and less susceptible to false alarms methods.

8

Dal, Pozzolo Andrea. "Adaptive Machine Learning for Credit Card Fraud Detection". Doctoral thesis, Universite Libre de Bruxelles, 2015. http://hdl.handle.net/2013/ULB-DIPOT:oai:dipot.ulb.ac.be:2013/221654.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Billions of dollars of loss are caused every year by fraudulent credit card transactions. The design of efficient fraud detection algorithms is key for reducing these losses, and more and more algorithms rely on advanced machine learning techniques to assist fraud investigators. The design of fraud detection algorithms is however particularly challenging due to the non-stationary distribution of the data, the highly unbalanced classes distributions and the availability of few transactions labeled by fraud investigators. At the same time public data are scarcely available for confidentiality issues, leaving unanswered many questions about what is the best strategy. In this thesis we aim to provide some answers by focusing on crucial issues such as: i) why and how undersampling is useful in the presence of class imbalance (i.e. frauds are a small percentage of the transactions), ii) how to deal with unbalanced and evolving data streams (non-stationarity due to fraud evolution and change of spending behavior), iii) how to assess performances in a way which is relevant for detection and iv) how to use feedbacks provided by investigators on the fraud alerts generated. Finally, we design and assess a prototype of a Fraud Detection System able to meet real-world working conditions and that is able to integrate investigators’ feedback to generate accurate alerts.
Doctorat en Sciences
info:eu-repo/semantics/nonPublished

9

Dong, Yue. "Higher Order Neural Networks and Neural Networks for Stream Learning". Thesis, Université d'Ottawa / University of Ottawa, 2017. http://hdl.handle.net/10393/35731.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The goal of this thesis is to explore some variations of neural networks. The thesis is mainly split into two parts: a variation of the shaping functions in neural networks and a variation of learning rules in neural networks. In the first part, we mainly investigate polynomial perceptrons - a perceptron with a polynomial shaping function instead of a linear one. We prove the polynomial perceptron convergence theorem and illustrate the notion by showing that a higher order perceptron can learn the XOR function through empirical experiments with implementation. In the second part, we propose three models (SMLP, SA, SA2) for stream learning and anomaly detection in streams. The main technique allowing these models to perform at a level comparable to the state-of-the-art algorithms in stream learning is the learning rule used. We employ mini-batch gradient descent algorithm and stochastic gradient descent algorithm to speed up the models. In addition, the use of parallel processing with multi-threads makes the proposed methods highly efficient in dealing with streaming data. Our analysis shows that all models have linear runtime and constant memory requirement. We also demonstrate empirically that the proposed methods feature high detection rate, low false alarm rate, and fast response. The paper on the first two models (SMLP, SA) is published in the 29th Canadian AI Conference and won the best paper award. The invited journal paper on the third model (SA2) for Computational Intelligence is under peer review.

10

Togbe, Maurras Ulbricht. "Détection distribuée d'anomalies dans les flux de données". Electronic Thesis or Diss., Sorbonne université, 2022. http://www.theses.fr/2022SORUS400.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

La détection d'anomalies est une problématique importante dans de nombreux domaines d'application comme la santé, le transport, l'industrie etc. Il s'agit d'un sujet d'actualité qui tente de répondre à la demande toujours croissante dans différents domaines tels que la détection d'intrusion, de fraude, etc. Dans cette thèse, après un état de l'art général complet, la méthode non supervisé Isolation Forest (IForest) a été étudiée en profondeur en présentant ses limites qui n'ont pas été abordées dans la littérature. Notre nouvelle version de IForest appelée Majority Voting IForest permet d'améliorer son temps d'exécution. Nos méthodes ADWIN-based IForest ASD et NDKSWIN-based IForest ASD permettent la détection d'anomalies dans les flux de données avec une meilleure gestion du concept drift. Enfin, la détection distribuée d'anomalies en utilisant IForest a été étudiée et évaluée. Toutes nos propositions ont été validées avec des expérimentations sur différents jeux de données
Anomaly detection is an important issue in many application areas such as healthcare, transportation, industry etc. It is a current topic that tries to meet the ever increasing demand in different areas such as intrusion detection, fraud detection, etc. In this thesis, after a general complet state of the art, the unsupervised method Isolation Forest (IForest) has been studied in depth by presenting its limitations that have not been addressed in the literature. Our new version of IForest called Majority Voting IForest improves its execution time. Our ADWIN-based IForest ASD and NDKSWIN-based IForest ASD methods allow the detection of anomalies in data stream with a better management of the drift concept. Finally, distributed anomaly detection using IForest has been studied and evaluated. All our proposals have been validated with experiments on different datasets

11

Costa, Fausto Guzzo da. "Employing nonlinear time series analysis tools with stable clustering algorithms for detecting concept drift on data streams". Universidade de São Paulo, 2017. http://www.teses.usp.br/teses/disponiveis/55/55134/tde-13112017-105506/.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Several industrial, scientific and commercial processes produce open-ended sequences of observations which are referred to as data streams. We can understand the phenomena responsible for such streams by analyzing data in terms of their inherent recurrences and behavior changes. Recurrences support the inference of more stable models, which are deprecated by behavior changes though. External influences are regarded as the main agent actuacting on the underlying phenomena to produce such modifications along time, such as new investments and market polices impacting on stocks, the human intervention on climate, etc. In the context of Machine Learning, there is a vast research branch interested in investigating the detection of such behavior changes which are also referred to as concept drifts. By detecting drifts, one can indicate the best moments to update modeling, therefore improving prediction results, the understanding and eventually the controlling of other influences governing the data stream. There are two main concept drift detection paradigms: the first based on supervised, and the second on unsupervised learning algorithms. The former faces great issues due to the labeling infeasibility when streams are produced at high frequencies and large volumes. The latter lacks in terms of theoretical foundations to provide detection guarantees. In addition, both paradigms do not adequately represent temporal dependencies among data observations. In this context, we introduce a novel approach to detect concept drifts by tackling two deficiencies of both paradigms: i) the instability involved in data modeling, and ii) the lack of time dependency representation. Our unsupervised approach is motivated by Carlsson and Memolis theoretical framework which ensures a stability property for hierarchical clustering algorithms regarding to data permutation. To take full advantage of such framework, we employed Takens embedding theorem to make data statistically independent after being mapped to phase spaces. Independent data were then grouped using the Permutation-Invariant Single-Linkage Clustering Algorithm (PISL), an adapted version of the agglomerative algorithm Single-Linkage, respecting the stability property proposed by Carlsson and Memoli. Our algorithm outputs dendrograms (seen as data models), which are proven to be equivalent to ultrametric spaces, therefore the detection of concept drifts is possible by comparing consecutive ultrametric spaces using the Gromov-Hausdorff (GH) distance. As result, model divergences are indeed associated to data changes. We performed two main experiments to compare our approach to others from the literature, one considering abrupt and another with gradual changes. Results confirm our approach is capable of detecting concept drifts, both abrupt and gradual ones, however it is more adequate to operate on complicated scenarios. The main contributions of this thesis are: i) the usage of Takens embedding theorem as tool to provide statistical independence to data streams; ii) the implementation of PISL in conjunction with GH (called PISLGH); iii) a comparison of detection algorithms in different scenarios; and, finally, iv) an R package (called streamChaos) that provides tools for processing nonlinear data streams as well as other algorithms to detect concept drifts.
Diversos processos industriais, científicos e comerciais produzem sequências de observações continuamente, teoricamente infinitas, denominadas fluxos de dados. Pela análise das recorrências e das mudanças de comportamento desses fluxos, é possível obter informações sobre o fenômeno que os produziu. A inferência de modelos estáveis para tais fluxos é suportada pelo estudo das recorrências dos dados, enquanto é prejudicada pelas mudanças de comportamento. Essas mudanças são produzidas principalmente por influências externas ainda desconhecidas pelos modelos vigentes, tal como ocorre quando novas estratégias de investimento surgem na bolsa de valores, ou quando há intervenções humanas no clima, etc. No contexto de Aprendizado de Máquina (AM), várias pesquisas têm sido realizadas para investigar essas variações nos fluxos de dados, referidas como mudanças de conceito. Sua detecção permite que os modelos possam ser atualizados a fim de apurar a predição, a compreensão e, eventualmente, controlar as influências que governam o fluxo de dados em estudo. Nesse cenário, algoritmos supervisionados sofrem com a limitação para rotular os dados quando esses são gerados em alta frequência e grandes volumes, e algoritmos não supervisionados carecem de fundamentação teórica para prover garantias na detecção de mudanças. Além disso, algoritmos de ambos paradigmas não representam adequadamente as dependências temporais entre observações dos fluxos. Nesse contexto, esta tese de doutorado introduz uma nova metodologia para detectar mudanças de conceito, na qual duas deficiências de ambos paradigmas de AM são confrontados: i) a instabilidade envolvida na modelagem dos dados, e ii) a representação das dependências temporais. Essa metodologia é motivada pelo arcabouço teórico de Carlsson e Memoli, que provê uma propriedade de estabilidade para algoritmos de agrupamento hierárquico com relação à permutação dos dados. Para usufruir desse arcabouço, as observações são embutidas pelo teorema de imersão de Takens, transformando-as em independentes. Esses dados são então agrupados pelo algoritmo Single-Linkage Invariante à Permutação (PISL), o qual respeita a propriedade de estabilidade de Carlsson e Memoli. A partir dos dados de entrada, esse algoritmo gera dendrogramas (ou modelos), que são equivalentes a espaços ultramétricos. Modelos sucessivos são comparados pela distância de Gromov-Hausdorff a fim de detectar mudanças de conceito no fluxo. Como resultado, as divergências dos modelos são de fato associadas a mudanças nos dados. Experimentos foram realizados, um considerando mudanças abruptas e o outro mudanças graduais. Os resultados confirmam que a metodologia proposta é capaz de detectar mudanças de conceito, tanto abruptas quanto graduais, no entanto ela é mais adequada para cenários mais complicados. As contribuições principais desta tese são: i) o uso do teorema de imersão de Takens para transformar os dados de entrada em independentes; ii) a implementação do algoritmo PISL em combinação com a distância de Gromov-Hausdorff (chamado PISLGH); iii) a comparação da metodologia proposta com outras da literatura em diferentes cenários; e, finalmente, iv) a disponibilização de um pacote em R (chamado streamChaos) que provê tanto ferramentas para processar fluxos de dados não lineares quanto diversos algoritmos para detectar mudanças de conceito.

12

Albakour, Subhy. "Stream-automl : automated machine learning overimbalanced data streams for bipartite ranking problems". Electronic Thesis or Diss., Institut polytechnique de Paris, 2024. http://www.theses.fr/2024IPPAT015.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Malgré sa popularité dans la littérature scientifique, l’apprentissage en ligne doit encore concrétiser son utilité pratique dans les applications industrielles. Vu que l’apprentissage en ligne gère les flux incessants de données volumineuses, à haute vélocité et en évolution constante par conception, le marketing en ligne semble être le candidat favori pour que l’apprentissage en ligne fasse son entrée dans l’industrie. Dans ce contexte, l’état de l’art de l’apprentissage en ligne n’a qu’une utilité limitée, car il se concentre principalement sur les problèmes de classification, tandis que le classement biparti constitue une meilleure modélisation du problème de marketing en ligne. Récemment, la combinaison de l’apprentissage en continu et de l’apprentissage automatique automatisé, c’est-à-dire Stream-AutoML, attire davantage l’attention de la communauté scientifique. Cette thèse explore l’applicabilité de Stream-AutoML aux problèmes de classement biparti lorsque les données sont déséquilibrées. Nous commençons par développer un cadre pour exécuter et évaluer les pipelines Stream-AutoML. Ensuite, nous proposons un cadre pour calculer AUC-ROC de manière progressive, et pour introduire une décroissance exponentielle aux données. Nous proposons également un cadre pour la détection des dérives conceptuelles en utilisant AUC-ROC. Dans ce cadre, nous développons six tests statistiques pour les différences d’AUC-ROC avec des bornes théoriques pour les erreurs de type I et de type II. Enfin, nous proposons quatre générateurs de données qui enrichissent les cadres d’évaluation des détecteurs des dérives conceptuelles dans des environnements contrôlés. Les résultats ont montré que les méthodes proposées réduisent considérablement les ressources allouées à l’évaluation et détectent les dérives conceptuelles en ayant très peu de faux positifs. Ces contributions préparent le terrain pour que Stream-AutoML puisse résoudre des problèmes de classement biparti, et peuvent ensuite être exploités dans les applications de marketing en ligne. Des implémentations optimisées des méthodes proposées ont été développées et ont déjà été adoptées dans le produit de marketing en ligne d’IDAaaS
Despite its popularity in the scientific literature, stream learning has yet to substantiate its practical utility in industrial applications. Characterized by the incessant influx of high-velocity, voluminous, and dynamically changing data, online marketing seems to be the favorite candidate for stream learning to make its entry into the industry. In this context, state-of-theart stream learning is of little utility, as it mainly focuses on classification, while bipartite ranking constitutes better modeling of the problem of online marketing. Recently, the combination of stream learning and AutoML, i.e., Stream-AutoML, has been drawing more attention from the scientific community. This work investigates the applicability of Stream-AutoML to bipartite ranking problems when data is imbalanced. We commence by developing a framework to execute and evaluate Stream-AutoML pipelines of stream learning models. Then we propose a framework for computing AUC-ROC incrementally, as well as introducing exponential decay to serve as a forgetting mechanism. We also propose a framework for concept drift detection using AUC-ROC, for which we develop six statistical tests for differences in AUC-ROC with theoretical bounds of type I and type II errors. Finally, we propose four data generators that enrich the tool kit to evaluate concept drift detectors under controlled environments. Results have shown that the proposed methods reduce the resources allocated for evaluation considerably and detect concept drifts with very small false positives. These contributions prepare the field for Stream-AutoML to solve bipartite ranking problems, which can be then exploited in online marketing applications. Optimized implementations of the proposed methods were developed and have already been adopted in the online marketing product of IDAaaS

13

Wan, Jones Sai-Wang, e 尹世泓. "Concept Drift Detection Based on Pre-Clustering and Statistical Testing". Thesis, 2017. http://ndltd.ncl.edu.tw/handle/4298j5.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

碩士
國立臺灣大學
電機工程學研究所
105
Stream data mining is one of the common data mining methods in real-world applications nowadays. However, it is challenging due to the nature of data stream in real-world, especially concept drift. To handle concept drift, drift detection method is necessary when the accessing data label is unavailable. In this paper, we propose a drift detection method based on the statistical test with clustering as preprocessing and reduce the execution time with principal component analysis (PCA) for the feature extraction method. Experiment result on synthetic and real-world streaming data show the clustering preprocessing improve the performance of the drift detection and feature extraction trade-off an insignificant performance of detection for great speed up for the execution time.

14

Chang, Chuan-nan, e 張全男. "Classification of time--changing data streams based on concept drift detection". Thesis, 2008. http://ndltd.ncl.edu.tw/handle/14813010931947505640.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

碩士
南華大學
資訊管理學研究所
96
The present paper flows in the discussion material in changes as necessary produces under the concept drifting environment (DataStream) the classification the question. Because this continuously grows under the material environment has One-pass the limit to cause us to be unable to review its histor-icalmaterial. At present already some might the application develop the algorithm. How but do they aim at in retain the material the effectiveness for a period of time to say. But neglects for retain the attempt wrong cost which the effectiveness for a period of time pays, is stable with the concept when wastes maintenance cost. Detects classification of the Concept Drifting to be possible to avoid the above question. However this method actually because detects the method the limit to cause it detects in the multi-categories material on possibly cau-ses in some efficiency the question. Therefore we in the statistical foundation proposed as detects the method take the card side examination to develop the algorithm to be called the Chi-Square drifting to detect develops the algorithm. CDDC(Concept Drift Detection of Chi-S-quare). With take aims at the drifting construction the idea it "the attribute value-category-concept unit" the idea correction as "the attribute value-concept unit".

15

Chiu, Yao-Ching, e 邱耀慶. "A Parallel Detection and Prediction Method for Concept Drift in Dynamic Data Driven Application System". Thesis, 2015. http://ndltd.ncl.edu.tw/handle/e864zc.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

碩士
國立交通大學
資訊管理研究所
103
The traditional data analysis and prediction method assumes that data distribution is stable. Therefore, it can predict unlabeled data precisely by analyzing the historical data. However, in today’s big-data environment, which is changing frequently, the traditional approach can no longer be effective; it cannot handle concept drift in a Dynamic Data Driven Application System (DDDAS). This thesis proposes a parallel detection and prediction method for concept drift in DDDAS. The proposed method can detect changing data and then feedback to the prediction model for better subsequent predictions. Furthermore, this method computes a global prediction by aggregating local predictions. Therefore, prediction accuracy is increased and computation time is decreased. In simulation, Map-Reduce is used for parallel processing. Two cases are tested. Results show that prediction accuracy is raised by 14% and 35% for these two cases, respectively. The execution time is improved by almost 45% and 29%, respectively.

16

Farid, D. M., L. Zhang, A. Hossain, C. M. Rahman, R. Strachan, G. Sexton e Keshav P. Dahal. "An adaptive ensemble classifier for mining concept drifting data streams". 2013. http://hdl.handle.net/10454/9573.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

No
It is challenging to use traditional data mining techniques to deal with real-time data stream classifications. Existing mining classifiers need to be updated frequently to adapt to the changes in data streams. To address this issue, in this paper we propose an adaptive ensemble approach for classification and novel class detection in concept drifting data streams. The proposed approach uses traditional mining classifiers and updates the ensemble model automatically so that it represents the most recent concepts in data streams. For novel class detection we consider the idea that data points belonging to the same class should be closer to each other and should be far apart from the data points belonging to other classes. If a data point is well separated from the existing data clusters, it is identified as a novel class instance. We tested the performance of this proposed stream classification model against that of existing mining algorithms using real benchmark datasets from UCI (University of California, Irvine) machine learning repository. The experimental results prove that our approach shows great flexibility and robustness in novel class detection in concept drifting and outperforms traditional classification models in challenging real-life data stream applications. (C) 2013 Elsevier Ltd. All rights reserved.

17

Renda, Alessandro. "Algorithms and techniques for data stream mining". Doctoral thesis, 2021. http://hdl.handle.net/2158/1235915.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

The abstraction of data streams encompasses a vast range of diverse applications that continuously generate data and therefore require dedicated algorithms and approaches for exploitation and mining. In this framework both unsupervised and supervised approaches are generally employed, depending on the task and on the availability of annotated data. This thesis proposes novel algorithms and techniques specifically tailored for the streaming setting and for knowledge discovery from Social Networks. In the first part of this work we propose a novel clustering algorithm for data streams. Our investigation stems from the discussion of general challenges posed by cluster analysis and of those purely related to the streaming setting. First, we propose SF-DBSCAN (streaming fuzzy DBSCAN) a preliminary solution conceived as an extension of the popular DBSCAN algorithm. SF-DBSCAN handles the arrival of new objects and continuously updates the clustering result by taking advantage of concepts from fuzzy set theory. However, it gives equal importance to every collected object and therefore is not suitable to manage unbounded data streams and to adapt to evolving settings. Then, we introduce TSF-DBSCAN, a novel "temporal" adaptation of streaming fuzzy DBSCAN: it overcomes the limits of the previous proposal and proves to be effective in handling evolving and potentially unbounded data streams, discovering clusters with fuzzy overlapping borders. In the second part of the thesis we explore a supervised learning application: the goal of our analysis is to discover the public opinion towards the vaccination topic in Italy, by exploiting the popular Twitter platform as data source. First, we discuss the design and development of a system for stance detection from text. The deployment of the classification model for the online monitoring of the public opinion, however, cannot ignore that tweets can be seen as a particular form of a temporal data stream. Then, we discuss the importance of leveraging user-related information, which enables the design of a set of techniques aimed at deepening and enhancing the analysis. Finally, we compare different learning schemes for addressing concept-drift, i.e. a change in the underlying data distribution, in a dynamic environment affected by the occurrence of real world context-related events. In this case study and throughout the thesis, the proposal of algorithms and techniques is supported by in-depth experimental analysis.

18

Obenauff, Alexander. "A progressive learning method for classification of manufacturing errors based on machine data". Master's thesis, 2019. http://hdl.handle.net/10362/76579.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Abstract (sommario):

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
Manufacturing companies face significant market pressure in today’s globalised world. Fierce global competition and product individualisation mean that production systems require continuous optimisation. This means that automation, flexibility and efficiency have all become vital elements for manufacturers. In this paper, a method based on incremental classification used for manufacturing errors is presented. The analysis and classification focus on data of binary form collected from a machine control unit during manufacturing operation in real time. Various methods that can learn from data incrementally and autonomously are to be applied. The training starts with the least amount of data possible and other important steps like data preprocessing are reviewed under the aspect of incremental learning.

19

Wu, Tsun-Yuan, e 吳存媛. "Faults and Concept Drifts Detection and Adaptation of Wind Turbines". Thesis, 2019. http://ndltd.ncl.edu.tw/handle/e5vkmq.

Testo completo

Gli stili APA, Harvard, Vancouver, ISO e altri

Tesi sul tema "Concept Drift Detection"

Cita una fonte nei formati APA, MLA, Chicago, Harvard e in molti altri stili