Tesis: "Data Subgroup"

1

Atzmüller, Martin. "Knowledge-intensive subgroup mining : techniques for automatic and interactive discovery /". Berlin : Aka, 2007. http://deposit.d-nb.de/cgi-bin/dokserv?id=2928288&prov=M&dok_var=1&dok_ext=htm.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

2

Atzmüller, Martin. "Knowledge-intensive subgroup mining techniques for automatic and interactive discovery". Berlin Aka, 2006. http://deposit.d-nb.de/cgi-bin/dokserv?id=2928288&prov=M&dok_var=1&dok_ext=htm.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

3

Belfodil, Aimene. "An order theoretic point-of-view on subgroup discovery". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI078.

Texto completo

Resumen

Comme le titre pourrait le suggérer, l’objectif principal de cette thèse est de fournir une meilleure compréhension de la tâche de la découverte de sous-groupes à travers la théorie de l’ordre. La découverte de sous-groupes (Subgroup Discovery - SD) est la tâche automatique dont le but est la découverte d’hypothèses intéressantes dans les bases de données. Autrement dit, étant donnée une base de donnée, l’espace de recherche de toutes les hypothèses que l’analyste voudra tester ainsi qu’un moyen formel pour évaluer la qualité de ces hypothèses ; la tâche automatique de la découverte de sous-groupe s’efforce de trouver les meilleurs hypothèses quant à ces trois paramètres. Afin d’élaborer des algorithmes efficaces et efficients pour cette tâche, il est important de comprendre les propriétés des espaces de recherche d’une part et les propriétés de la mesure de qualité d’autre part. Dans cette thèse, nous étendons l’état de l’art par: (i) fournir une vue unifiée sur les espaces d’hypothèses derrière la tâche de découverte de sous-groupes en utilisant la théorie de l’ordre, (ii) proposer l’espace d’hypothèses de conjonctions d’inégalités linéaires dans les bases de données numériques ainsi que différents algorithmes permettant de les énumérer et (iii) proposer un algorithme anytime - fournit progressivement des résultats - pour la tâche particulière de fouille de sous-groupe discriminants dans les bases de données numériques. Ce dernier fournit des garanties sur la qualité des sous-groupes extraits même si l’algorithme est interrompu
As the title of this dissertation may suggest, the aim of this thesis is to provide an order-theoretic point of view on the task of subgroup discovery. Subgroup discovery is the automatic task of discovering interesting hypotheses in databases. That is, given a database, the hypothesis space the analyst wants to explore and a formal way of how the analyst gauges the quality of the hypotheses (e.g. a quality measure); the automated task of subgroup discovery aims to extract the interesting hypothesis w.r.t. these parameters. In order to elaborate fast and efficient algorithms for subgroup discovery, one should understand the underlying properties of the hypothesis space on the one hand and the properties of its quality measure on the other. In this thesis, we extend the state-of-the-art by: (i) providing a unified view of the hypotheses space behind subgroup discovery using the well-founded mathematical tool of order theory, (ii) proposing the new hypothesis space of conjunction of linear inequalities in numerical databases and the algorithms enumerating its elements and (iii) proposing an anytime algorithm for discriminative subgroup discovery on numerical datasets providing guarantees upon interruption

Los estilos APA, Harvard, Vancouver, ISO, etc.

4

Mistry, Dipesh. "Recursive partitioning based approaches for low back pain subgroup identification in individual patient data meta-analyses". Thesis, University of Warwick, 2014. http://wrap.warwick.ac.uk/64032/.

Texto completo

Resumen

This thesis presents two novel approaches for performing subgroup analyses or identifying subgroups in an individual patient data (IPD) meta-analyses setting. The work contained in this thesis originated from an important research priority in the area of low back pain (LBP); identifying subgroups that most (or least) benefit from treatment. Typically, a subgroup is evaluated by applying a statistical test for interaction between a baseline characteristic and treatment. A systematic review found that subgroup analyses in the area of LBP are severely underpowered and are of a rather poor quality (Chapter 4). IPD meta-analyses provide an ideal framework with improved statistical power to investigate and identify subgroups. However, conventional approaches to subgroup analyses applied in both a single trial setting and an IPD setting have a number of issues, one of them being that subgroups are typically investigated one at a time. As individuals have multiple characteristics that may be related to response to treatment, alternative statistical methods are required to overcome the associated issues. Tree based methods are a promising alternative that systematically search the entire covariate space to identify subgroups defined by multiple characteristics. In this work, a number of relevant tree methods, namely the Interaction Tree (IT), Simultaneous Threshold Interaction Modelling Algorithm (STIMA) and Subpopulation Identification based on a Differential Effect Search (SIDES), were identified and evaluated in a single trial setting in a simulation study. The most promising methods (IT and SIDES) were extended for application in an IPD meta-analyses setting by incorporating fixed-effect and mixed-effect models to account for the within trial clustering in the hierarchical data structure, and again assessed in a simulation study. Thus, this work proposes two statistical approaches to subgroup analyses or subgroup identification in an IPD meta-analysis framework. Though the application is based in a LBP setting, the extensions are applicable in any research discipline where subgroup analyses in an IPD meta-analysis setting is of interest.

Los estilos APA, Harvard, Vancouver, ISO, etc.

5

Doubleday, Kevin. "Generation of Individualized Treatment Decision Tree Algorithm with Application to Randomized Control Trials and Electronic Medical Record Data". Thesis, The University of Arizona, 2016. http://hdl.handle.net/10150/613559.

Texto completo

Resumen

With new treatments and novel technology available, personalized medicine has become a key topic in the new era of healthcare. Traditional statistical methods for personalized medicine and subgroup identification primarily focus on single treatment or two arm randomized control trials (RCTs). With restricted inclusion and exclusion criteria, data from RCTs may not reflect real world treatment effectiveness. However, electronic medical records (EMR) offers an alternative venue. In this paper, we propose a general framework to identify individualized treatment rule (ITR), which connects the subgroup identification methods and ITR. It is applicable to both RCT and EMR data. Given the large scale of EMR datasets, we develop a recursive partitioning algorithm to solve the problem (ITR-Tree). A variable importance measure is also developed for personalized medicine using random forest. We demonstrate our method through simulations, and apply ITR-Tree to datasets from diabetes studies using both RCT and EMR data. Software package is available at https://github.com/jinjinzhou/ITR.Tree.

Los estilos APA, Harvard, Vancouver, ISO, etc.

6

Mueller, Marianne Larissa [Verfasser], Stefan [Akademischer Betreuer] Kramer y Frank [Akademischer Betreuer] Puppe. "Data Mining Methods for Medical Diagnosis : Test Selection, Subgroup Discovery, and Contrained Clustering / Marianne Larissa Mueller. Gutachter: Stefan Kramer ; Frank Puppe. Betreuer: Stefan Kramer". München : Universitätsbibliothek der TU München, 2012. http://d-nb.info/1024964264/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

7

Li, Rui [Verfasser], Burkhard [Akademischer Betreuer] [Gutachter] Rost y Stefan [Gutachter] Kramer. "Data Mining and Machine Learning Methods for High-dimensional Patient Data in Dementia Research: Voxel Features Mining, Subgroup Discovery and Multi-view Learning / Rui Li ; Gutachter: Burkhard Rost, Stefan Kramer ; Betreuer: Burkhard Rost". München : Universitätsbibliothek der TU München, 2017. http://d-nb.info/1125018224/34.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

8

Domingue, Jean-Laurent. "Nurses’ Knowledge, Attitudes and Documentation Practices in a Context of HIV Criminalization: A Secondary Subgroup Analysis of Data from California, Florida, New York, and Texas Nurses". Thesis, Université d'Ottawa / University of Ottawa, 2016. http://hdl.handle.net/10393/35570.

Texto completo

Resumen

Under international legal norms, HIV criminalization is considered to be an overly broad use of criminal law. In the United States, at least 33 states have HIV-specific criminal laws. Data from California, Florida, New York, and Texas nurses provided exemplars from different HIV-related criminal law approaches and the impact of those laws on nurses’ practices. Nurses who cared for patients who expressed fears or concerns about HIV criminalization or patients who had been arrested for HIV-related crimes were more likely to correctly identify the presence or absence of HIV-specific laws in the states where they practised, when compared to nurses who did not care for such patients. Lack of knowledge about HIV-related criminal laws may erode the nurse-patient relationship. Jurisdiction specific education should be created and offered to nurses in order to address this knowledge gap and protect the dignity of people living with HIV.

Los estilos APA, Harvard, Vancouver, ISO, etc.

9

Belfodil, Adnene. "Exceptional model mining for behavioral data analysis". Thesis, Lyon, 2019. http://www.theses.fr/2019LYSEI086.

Texto completo

Resumen

Avec la prolifération rapide des plateformes de données qui récoltent des données relatives à plusieurs domaines tels que les données de gouvernements, d’éducation, d’environnement ou les données de notations de produits, plus de données sont disponibles en ligne. Ceci représente une opportunité sans égal pour étudier le comportement des individus et les interactions entre eux. Sur le plan politique, le fait de pouvoir interroger des ensembles de données de votes peut fournir des informations intéressantes pour les journalistes et les analystes politiques. En particulier, ce type de données peut être exploité pour l’investigation des sujet exceptionnellement conflictuels ou consensuels. Considérons des données décrivant les sessions de votes dans le parlement Européen (PE). Un tel ensemble de données enregistre les votes de chaque député (MPE) dans l’hémicycle en plus des informations relatives aux parlementaires (e.g., genre, parti national, parti européen) et des sessions (e.g., sujet, date). Ces données offrent la possibilité d’étudier les accords et désaccords de sous-groupes cohérents, en particulier pour mettre en évidence des comportements inattendus. Par exemple, il est attendu que sur la majorité des sessions, les députés votent selon la ligne politique de leurs partis politiques respectifs. Cependant, lorsque les sujets sont plutôt d’intérêt d’un pays particulier dans l’Europe, des coalitions peuvent se former ou se dissoudre. À titre d’exemple, quand une procédure législative concernant la pêche est proposée devant les MPE dans l’hémicycle, les MPE des nations insulaires du Royaume-Uni peuvent voter en accord sans être influencés par la différence entre les lignes politiques de leurs alliances respectives, cela peut suggérer un accord exceptionnel comparé à la polarisation observée habituellement. Dans cette thèse, nous nous intéressons à ce type de motifs décrivant des (dés)accords exceptionnels, pas uniquement sur les données de votes mais également sur des données similaires appelées données comportementales. Nous élaborons deux méthodes complémentaires appelées Debunk et Deviant. La première permet la découverte de (dés)accords exceptionnels entre groupes tandis que la seconde permet de mettre en évidence les comportements exceptionnels qui peuvent au sein d’un même groupe. Idéalement, ces deux méthodes ont pour objective de donner un aperçu complet et concis des comportements exceptionnels dans les données comportementales. Dans l’esprit d’évaluer la capacité des deux méthodes à réaliser cet objectif, nous évaluons les performances quantitatives et qualitatives sur plusieurs jeux de données réelles. De plus, nous motivons l’utilisation des méthodes proposées dans le contexte du journalisme computationnel
With the rapid proliferation of data platforms collecting and curating data related to various domains such as governments data, education data, environment data or product ratings, more and more data are available online. This offers an unparalleled opportunity to study the behavior of individuals and the interactions between them. In the political sphere, being able to query datasets of voting records provides interesting insights for data journalists and political analysts. In particular, such data can be leveraged for the investigation of exceptionally consensual/controversial topics. Consider data describing the voting behavior in the European Parliament (EP). Such a dataset records the votes of each member (MEP) in voting sessions held in the parliament, as well as information on the parliamentarians (e.g., gender, national party, European party alliance) and the sessions (e.g., topic, date). This dataset offers opportunities to study the agreement or disagreement of coherent subgroups, especially to highlight unexpected behavior. It is to be expected that on the majority of voting sessions, MEPs will vote along the lines of their European party alliance. However, when matters are of interest to a specific nation within Europe, alignments may change and agreements can be formed or dissolved. For instance, when a legislative procedure on fishing rights is put before the MEPs, the island nation of the UK can be expected to agree on a specific course of action regardless of their party alliance, fostering an exceptional agreement where strong polarization exists otherwise. In this thesis, we aim to discover such exceptional (dis)agreement patterns not only in voting data but also in more generic data, called behavioral data, which involves individuals performing observable actions on entities. We devise two novel methods which offer complementary angles of exceptional (dis)agreement in behavioral data: within and between groups. These two approaches called Debunk and Deviant, ideally, enables the implementation of a sufficiently comprehensive tool to highlight, summarize and analyze exceptional comportments in behavioral data. We thoroughly investigate the qualitative and quantitative performances of the devised methods. Furthermore, we motivate their usage in the context of computational journalism

Los estilos APA, Harvard, Vancouver, ISO, etc.

10

Wesley, S. Scott. "Background data subgroups and career outcomes : some developmental influences on person job-matching". Diss., Georgia Institute of Technology, 1989. http://hdl.handle.net/1853/31065.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

11

Lütz, Elin. "Unsupervised machine learning to detect patient subgroups in electronic health records". Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-251669.

Texto completo

Resumen

The use of Electronic Health Records (EHR) for reporting patient data has been widely adopted by healthcare providers. This data can encompass many forms of medical information such as disease symptoms, results from laboratory tests, ICD-10 classes and other information from patients. Structured EHR data is often high-dimensional and contain many missing values, which impose a complication to many computing problems. Detecting meaningful structures in EHR data could provide meaningful insights in diagnose detection and in development of medical decision support systems. In this work, a subset of EHR data from patient questionnaires is explored through two well-known clustering algorithms: K-Means and Agglomerative Hierarchical. The algorithms were tested on different types of data, primarily raw data and data where missing values have been imputed using different imputation techniques. The primary evaluation index for the clustering algorithms was the silhouette value using euclidean and cosine distance measures. The result showed that natural groupings most likely exist in the data set. Hierarchical clustering created higher quality clusters than k-means, and the cosine measure yielded a good interpretation of distance. The data imputation imposed large effects to the data and likewise to the clustering results, and other or more sophisticated techniques are needed for handling missing values in the data set.
Användandet av digitala journaler för att rapportera patientdata har ökat i takt med digitaliseringen av vården. Dessa data kan innehålla många typer av medicinsk information så som sjukdomssymptom, labbresultat, ICD-10 diagnoskoder och annan patientinformation. EHR data är vanligtvis högdimensionell och innehåller saknade värden, vilket kan leda till beräkningssvårigheter i ett digitalt format. Att upptäcka grupperingar i sådana patientdata kan ge värdefulla insikter inom diagnosprediktion och i utveckling av medicinska beslutsstöd. I detta arbete så undersöker vi en delmängd av digital patientdata som innehåller patientsvar på sjukdomsfrågor. Detta dataset undersöks genom att applicera två populära klustringsalgoritmer: k-means och agglomerativ hierarkisk klustring. Algoritmerna är ställda mot varandra och på olika typer av dataset, primärt rådata och två dataset där saknade värden har ersatts genom imputationstekniker. Det primära utvärderingsmåttet för klustringsalgoritmerna var silhuettvärdet tillsammans med beräknandet av ett euklidiskt distansmått och ett cosinusmått. Resultatet visar att naturliga grupperingar med stor sannolikhet finns att hitta i datasetet. Hierarkisk klustring visade på en högre klusterkvalitet än k-means, och cosinusmåttet var att föredra för detta dataset. Imputation av saknade data ledde till stora förändringar på datastrukturen och således på resultatet av klustringsexperimenten, vilket tyder på att andra och mer avancerade dataspecifika imputationstekniker är att föredra.

Los estilos APA, Harvard, Vancouver, ISO, etc.

12

Hawken, Steven. "Methodological Approaches to Studying Risk Factors for Adverse Events Following Routine Vaccinations in the General Population and Vulnerable Subgroups of Individuals Using Health Administrative Data". Thesis, Université d'Ottawa / University of Ottawa, 2014. http://hdl.handle.net/10393/31774.

Texto completo

Resumen

Objectives: This thesis included 6 manuscripts which focused on the analysis of adverse events following immunization (AEFIs), including general health services utilization (emergency room (ER) visits and hospital admissions) and specific diagnoses (e.g. febrile convulsions). The main objectives of this research were: 1) To demonstrate the utility of the self-controlled case series (SCCS) design coupled with health administrative data for studying the safety of vaccines; 2) Introducing an innovative approach using relative incidence ratios (RIRs) within an SCCS analysis to identify risk factors for AEFIs and to overcome the healthy vaccinee bias; and 3) To demonstrate how SCCS and RIR analyses of health services outcomes in health administrative data can provide important insights into underlying physiological and behavioural mechanisms. Data Sources: This work utilized Ontario health administrative data housed at the Institute for Clinical Evaluative Sciences (ICES). The study included all children born in Ontario, Canada between 2002 and 2011 (over 1 million children). Vaccinations were identified using OHIP fee for service billing codes for general vaccination. Admissions and ER visits for any reason were identified in the Discharge Abstract Database (DAD) and National Ambulatory Care Reporting System (NACRS). Primary reasons for admissions and ER visits were investigated using ICD-10-CA codes reported in the DAD and NACRS databases. Statistical Methods: The self-controlled case series design (SCCS) was used to calculate the relative incidence of admissions, ER visits and other AEFIs. To investigate relative incidence for AEFIs across risk groups of interest, as well as addressing the healthy vaccinee effect bias, RIRs were calculated. RIRs are the ratio of incidence ratios in a subgroup of interest relative to a designated reference group. Results and Conclusions: The combined approach of using the SCCS design and RIRs to identify risk factors and overcome the healthy vaccinee bias proved to be a powerful approach to studying vaccine safety. Future work will be important to characterize the performance and validity of the SCCS + RIR approach in the presence of increasing levels of confounding and differing manifestations of the healthy vaccinee bias, as well as to elucidate the biological and behavioural mechanisms underlying our findings.

Los estilos APA, Harvard, Vancouver, ISO, etc.

13

Hammal, Mohamed Ali. "Contribution à la découverte de sous-groupes corrélés : Application à l’analyse des systèmes territoriaux et des réseaux alimentaires". Thesis, Lyon, 2020. http://www.theses.fr/2020LYSEI024.

Texto completo

Resumen

Mieux nourrir les villes en quantité et en qualité, notamment les grandes agglomérations, constitue un défi majeur dont la résolution passe par une meilleure compréhension des relations entre les populations urbaines et leur alimentation. A l’échelle des systèmes alimentaires urbains, on a besoin de diagnostics ciblant la disponibilité des ressources alimentaires croisée avec les profils socio-économiques des territoires et l’on manque d’outils et de méthodes pour appréhender de façon systématique les relations entre les bassins de consommation, l’offre et les comportements alimentaires. L’objectif de cette thèse est de contribuer à l’élaboration de nouveaux outils informatiques pour traiter des données temporelles, hétérogènes et multi-sources afin d’identifier et de caractériser des comportements propres à une zone géographique. Pour cela, nous nous appuyons sur l’exploration conjointe de motifs graduels, identifiant des corrélations de rang, et de sous-groupes afin de découvrir des contextes pour lesquels les corrélations décrites par les motifs graduels sont exceptionnellement fortes par rapport au reste des données. Nous proposons un algorithme d’énumération s’appuyant sur des propriétés d’élagage avec des bornes supérieures, ainsi qu’un autre algorithme qui échantillonne les motifs selon la mesure de qualité. Ces approches sont validées non seulement sur des jeux de données de référence, mais aussi à travers une étude empirique de laformation des déserts alimentaires sur l’agglomération lyonnaise
Better feeding cities in quantity and quality, especially large cities, is a major challenge, whose resolution requires a better understanding of the relationships between urban populations and their food. On the scale of urban food systems, we need to understand the availability of food resources crossed with the socio-economic profiles of the territories. But we lack tools and methods to systematically understand the relationships between consumption basins, supply and eating habits. The objective of this thesis is to contribute to the development of new IT tools to process temporal, heterogeneous and multi-sources data in order to identify and characterize behaviors specific to a geographic area. For this, we rely on the joint exploration of gradual patterns, to discover rank correlations, and subgroups in order to find contexts for which the correlations described by the gradual patterns are exceptionally strong compared to the remaining of the data. We propose an enumeration algorithm based on pruning properties with upper bounds, as well as another algorithm which samples the patterns according to the quality measure. These approaches are validated not only on benchmark datasets, but also through an empirical study of the formation of food deserts in the Lyon urban area

Los estilos APA, Harvard, Vancouver, ISO, etc.

14

Tillberg, Anders. "A multidisciplinary risk assessment of dental restorative materials". Doctoral thesis, Umeå : Univ, 2008. http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-1860.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

15

Bosc, Guillaume. "Anytime discovery of a diverse set of patterns with Monte Carlo tree search". Thesis, Lyon, 2017. http://www.theses.fr/2017LYSEI074/document.

Texto completo

Resumen

La découverte de motifs qui caractérisent fortement une classe vis à vis d'une autre reste encore un problème difficile en fouille de données. La découverte de sous-groupes (Subgroup Discovery, SD) est une approche formelle de fouille de motifs qui permet la construction de classifieurs intelligibles mais surtout d'émettre des hypothèses sur les données. Cependant, cette approche fait encore face à deux problèmes majeurs : (i) comment définir des mesures de qualité appropriées pour caractériser l'intérêt d'un motif et (ii) comment sélectionner une méthode heuristique adaptée lorsqu’une énumération exhaustive de l'espace de recherche n'est pas réalisable. Le premier problème a été résolu par la fouille de modèles exceptionnels (Exceptional Model Mining, EMM) qui permet l'extraction de motifs couvrant des objets de la base de données pour lesquels le modèle induit sur les attributs de classe est significativement différent du modèle induit par l'ensemble des objets du jeu de données. Le second problème a été étudié en SD et EMM principalement avec la mise en place de méthodes heuristiques de type recherche en faisceau (beam-search) ou avec des algorithmes génétiques qui permettent la découverte de motifs non redondants, diversifiés et de bonne qualité. Dans cette thèse, nous soutenons que la nature gloutonne des méthodes d'énumération précédentes génère cependant des ensembles de motifs manquant de diversité. Nous définissons formellement la fouille de données comme un jeu que nous résolvons par l'utilisation de la recherche arborescente de Monte Carlo (Monte Carlo Tree Search, MCTS), une technique récente principalement utilisée pour la résolution de jeux et de problèmes de planning en intelligence artificielle. Contrairement aux méthodes traditionnelles d'échantillonnage, MCTS donne la possibilité d'obtenir une solution à tout instant sans qu'aucune hypothèse ne soit faite que ce soit sur la mesure de qualité ou sur les données. Cette méthode d'énumération converge vers une approche exhaustive si les budgets temps et mémoire disponibles sont suffisants. Le compromis entre l'exploration et l'exploitation que propose cette approche permet une augmentation significative de la diversité dans l'ensemble des motifs calculés. Nous montrons que la recherche arborescente de Monte Carlo appliquée à la fouille de motifs permet de trouver rapidement un ensemble de motifs diversifiés et de bonne qualité à l'aide d'expérimentations sur des jeux de données de référence et sur un jeu de données réel traitant de l'olfaction. Nous proposons et validons également une nouvelle mesure de qualité spécialement conçue pour des jeux de donnée multi labels présentant une grande variance de fréquences des labels
The discovery of patterns that strongly distinguish one class label from another is still a challenging data-mining task. Subgroup Discovery (SD) is a formal pattern mining framework that enables the construction of intelligible classifiers, and, most importantly, to elicit interesting hypotheses from the data. However, SD still faces two major issues: (i) how to define appropriate quality measures to characterize the interestingness of a pattern; (ii) how to select an accurate heuristic search technique when exhaustive enumeration of the pattern space is unfeasible. The first issue has been tackled by Exceptional Model Mining (EMM) for discovering patterns that cover tuples that locally induce a model substantially different from the model of the whole dataset. The second issue has been studied in SD and EMM mainly with the use of beam-search strategies and genetic algorithms for discovering a pattern set that is non-redundant, diverse and of high quality. In this thesis, we argue that the greedy nature of most such previous approaches produces pattern sets that lack diversity. Consequently, we formally define pattern mining as a game and solve it with Monte Carlo Tree Search (MCTS), a recent technique mainly used for games and planning problems in artificial intelligence. Contrary to traditional sampling methods, MCTS leads to an any-time pattern mining approach without assumptions on either the quality measure or the data. It converges to an exhaustive search if given enough time and memory. The exploration/exploitation trade-off allows the diversity of the result set to be improved considerably compared to existing heuristics. We show that MCTS quickly finds a diverse pattern set of high quality in our application in neurosciences. We also propose and validate a new quality measure especially tuned for imbalanced multi-label data

Los estilos APA, Harvard, Vancouver, ISO, etc.

16

Underwood, Marilyn. "The Relationship of 10th-Grade District Progress Monitoring Assessment Scores to Florida Comprehensive Assessment Test Scores in Reading and Mathematics for 2008-2009". Doctoral diss., University of Central Florida, 2010. http://digital.library.ucf.edu/cdm/ref/collection/ETD/id/3845.

Texto completo

Resumen

The focus of this research was to investigate the use of a district created formative benchmark assessment in reading to predict student achievement for 10th-grade students on the Florida Comprehensive Assessment Test (FCAT) in one county in north central Florida. The purpose of the study was to provide information to high school principals and teachers to better understand how students were performing and learning and to maximize use of the formative district benchmark assessment in order to modify instruction and positively impact student achievement. This study expanded a prior limited study which correlated district benchmark assessment scores to FCAT scores for students in grades three through five in five elementary schools in the targeted county. The high correlations suggested further study. This research focused on secondary reading, specifically in 10th grade where both state and targeted county FCAT scores were low in years preceding this research. Investigated were (a) the district formative assessment in reading as a predictor of FCAT Reading scores, (b) differences in strength of correlation and prediction among student subgroups and between high schools, and (c) any relationships between reading formative assessment scores and Mathematics FCAT scores. An additional focus of this study was to determine best leadership practices in schools where there were the highest correlations between the formative assessment and FCAT Reading scores. Research on best practices was reviewed, and principals were interviewed to determine trends and themes in practice. Tenth grade students in the seven Florida targeted district high schools were included in the study. The findings of the study supported the effective use of formative assessments both in instruction and as predictors of students' performance on the FCAT. The results of the study also showed a significant correlation between performance on the reading formative assessment and performance on FCAT Mathematics. The data indicated no significant differences in the strength of correlation between student subgroups or between the high schools included in the study. Additionally, the practices of effective principals in using formative assessment data to inform instruction, gathered through personal interviews, were documented and described.
Ed.D.
Department of Educational Research, Technology and Leadership
Education
Education EdD

Los estilos APA, Harvard, Vancouver, ISO, etc.

17

Tseng, Jen Yu y 曾仁佑. "Subgroup Data Analysis Using Survival Tree". Thesis, 2016. http://ndltd.ncl.edu.tw/handle/24649969791940075527.

Texto completo

Resumen

碩士
國立清華大學
統計學研究所
104
In this thesis, that we adopt the subgroup analysis to right censored data depends on the method of Su et al. (2008). There are two methods that include Interaction Tree and using the random forest to estimate the importance of each covariate for the subgroup analysis. We try to exploit simulation and real data analysis to observe the performance of them. In real data analysis, we analyze the data of the patients with lung cancer and use their gene expression as the covariate. However, in the large number of covariate, the problem of the calculation speed of Interaction Tree is manifest. In our envision, we decide to sort the covariate in advance and sift the front members having bigger marginal effect to analyze. In the result, the subgroup with heterogeneity of the treatment effect can be defined through this method exactly.

Los estilos APA, Harvard, Vancouver, ISO, etc.

18

Lemmerich, Florian. "Novel Techniques for Efficient and Effective Subgroup Discovery". Doctoral thesis, 2014. https://nbn-resolving.org/urn:nbn:de:bvb:20-opus-97812.

Texto completo

Resumen

Large volumes of data are collected today in many domains. Often, there is so much data available, that it is difficult to identify the relevant pieces of information. Knowledge discovery seeks to obtain novel, interesting and useful information from large datasets. One key technique for that purpose is subgroup discovery. It aims at identifying descriptions for subsets of the data, which have an interesting distribution with respect to a predefined target concept. This work improves the efficiency and effectiveness of subgroup discovery in different directions. For efficient exhaustive subgroup discovery, algorithmic improvements are proposed for three important variations of the standard setting: First, novel optimistic estimate bounds are derived for subgroup discovery with numeric target concepts. These allow for skipping the evaluation of large parts of the search space without influencing the results. Additionally, necessary adaptations to data structures for this setting are discussed. Second, for exceptional model mining, that is, subgroup discovery with a model over multiple attributes as target concept, a generic extension of the well-known FP-tree data structure is introduced. The modified data structure stores intermediate condensed data representations, which depend on the chosen model class, in the nodes of the trees. This allows the application for many popular model classes. Third, subgroup discovery with generalization-aware measures is investigated. These interestingness measures compare the target share or mean value in the subgroup with the respective maximum value in all its generalizations. For this setting, a novel method for deriving optimistic estimates is proposed. In contrast to previous approaches, the novel measures are not exclusively based on the anti-monotonicity of instance coverage, but also takes the difference of coverage between the subgroup and its generalizations into account. In all three areas, the advances lead to runtime improvements of more than an order of magnitude. The second part of the contributions focuses on the \emph{effectiveness} of subgroup discovery. These improvements aim to identify more interesting subgroups in practical applications. For that purpose, the concept of expectation-driven subgroup discovery is introduced as a new family of interestingness measures. It computes the score of a subgroup based on the difference between the actual target share and the target share that could be expected given the statistics for the separate influence factors that are combined to describe the subgroup. In doing so, previously undetected interesting subgroups are discovered, while other, partially redundant findings are suppressed. Furthermore, this work also approaches practical issues of subgroup discovery: In that direction, the VIKAMINE II tool is presented, which extends its predecessor with a rebuild user interface, novel algorithms for automatic discovery, new interactive mining techniques, as well novel options for result presentation and introspection. Finally, some real-world applications are described that utilized the presented techniques. These include the identification of influence factors on the success and satisfaction of university students and the description of locations using tagging data of geo-referenced images
Neue Techniken für effiziente und effektive Subgruppenentdeckung

Los estilos APA, Harvard, Vancouver, ISO, etc.

19

Atzmüller, Martin. "Knowledge-Intensive Subgroup Mining - Techniques for Automatic and Interactive Discovery". Doctoral thesis, 2006. https://nbn-resolving.org/urn:nbn:de:bvb:20-opus-21004.

Texto completo

Resumen

Data mining has proved its significance in various domains and applications. As an important subfield of the general data mining task, subgroup mining can be used, e.g., for marketing purposes in business domains, or for quality profiling and analysis in medical domains. The goal is to efficiently discover novel, potentially useful and ultimately interesting knowledge. However, in real-world situations these requirements often cannot be fulfilled, e.g., if the applied methods do not scale for large data sets, if too many results are presented to the user, or if many of the discovered patterns are already known to the user. This thesis proposes a combination of several techniques in order to cope with the sketched problems: We discuss automatic methods, including heuristic and exhaustive approaches, and especially present the novel SD-Map algorithm for exhaustive subgroup discovery that is fast and effective. For an interactive approach we describe techniques for subgroup introspection and analysis, and we present advanced visualization methods, e.g., the zoomtable that directly shows the most important parameters of a subgroup and that can be used for optimization and exploration. We also describe various visualizations for subgroup comparison and evaluation in order to support the user during these essential steps. Furthermore, we propose to include possibly available background knowledge that is easy to formalize into the mining process. We can utilize the knowledge in many ways: To focus the search process, to restrict the search space, and ultimately to increase the efficiency of the discovery method. We especially present background knowledge to be applied for filtering the elements of the problem domain, for constructing abstractions, for aggregating values of attributes, and for the post-processing of the discovered set of patterns. Finally, the techniques are combined into a knowledge-intensive process supporting both automatic and interactive methods for subgroup mining. The practical significance of the proposed approach strongly depends on the available tools. We introduce the VIKAMINE system as a highly-integrated environment for knowledge-intensive active subgroup mining. Also, we present an evaluation consisting of two parts: With respect to objective evaluation criteria, i.e., comparing the efficiency and the effectiveness of the subgroup discovery methods, we provide an experimental evaluation using generated data. For that task we present a novel data generator that allows a simple and intuitive specification of the data characteristics. The results of the experimental evaluation indicate that the novel SD-Map method outperforms the other described algorithms using data sets similar to the intended application concerning the efficiency, and also with respect to precision and recall for the heuristic methods. Subjective evaluation criteria include the user acceptance, the benefit of the approach, and the interestingness of the results. We present five case studies utilizing the presented techniques: The approach has been successfully implemented in medical and technical applications using real-world data sets. The method was very well accepted by the users that were able to discover novel, useful, and interesting knowledge
Data Mining wird mit großem Erfolg in vielen Domänen angewandt. Subgruppenentdeckung als wichtiges Teilgebiet des Data Mining kann zum Beispiel gut im Marketing, oder zur Qualitätskontrolle und Analyse in medizinischen Domänen eingesetzt werden. Das allgemeine Ziel besteht darin, potentiell nützliches and letztendlich interessantes Wissen zu entdecken. Jedoch können diese Anforderungen im praktischen Einsatz oft nicht erfüllt werden, etwa falls die eingesetzten Methoden eine schlechte Skalierbarkeit für größere Datensätze aufweisen, falls dem Benutzer zu viele Ergebnisse präsentiert werden, oder falls der Anwender viele der gefundenen Subgruppen-Muster schon kennt. Diese Arbeit stellt eine Kombination von automatischen und interaktiven Techniken vor, um mit den genannten Problemen besser umgehen zu können: Es werden automatische heuristische und vollständige Subgruppenentdeckungs-Verfahren diskutiert, und insbesondere der neuartige SD-Map Algorithmus zur vollständigen Subgruppenentdeckung vorgestellt der sowohl schnell als auch effektiv ist. Bezüglich der interaktiven Techniken werden Methoden zur Subgruppen-Introspektion und Analyse, und fortgeschrittene Visualisierungstechniken vorgestellt, beispielsweise die Zoomtable, die die für die Subgruppenentdeckung wichtigsten Parameter direkt visualisiert und zur Optimierung und Exploration eingesetzt werden kann. Zusätzlich werden verschiedene Visualisierungen zum Vergleich und zur Evaluation von Subgruppen beschrieben um den Benutzer bei diesen essentiellen Schritten zu unterstützen. Weiterhin wird leicht zu formalisierendes Hintergrundwissen vorgestellt, das im Subgruppenentdeckungsprozess in vielfältiger Weise eingesetzt werden kann: Um den Entdeckungsprozess zu fokussieren, den Suchraum einzuschränken, und letztendlich die Effizienz der Entdeckungsmethode zu erhöhen. Insbesondere wird Hintergrundwissen eingeführt, um die Elemente der Anwendungsdomäne zu filtern, um geeignete Abstraktionen zu definieren, Werte zusammenzufassen, und die gefundenen Subgruppenmuster nachzubearbeiten. Schließlich werden diese Techniken in einen wissensintensiven Prozess integriert, der sowohl automatische als auch interaktive Methoden zur Subgruppenentdeckung einschließt. Die praktische Bedeutung des vorgestellten Ansatzes hängt stark von den verfügbaren Werkzeugen ab. Dazu wird das VIKAMINE System als hochintegrierte Umgebung für die wissensintensive aktive Subgruppenentdeckung präsentiert. Die Evaluation des Ansatzes besteht aus zwei Teilen: Hinsichtlich einer Evaluation von Effizienz und Effektivität der Verfahren wird eine experimentelle Evaluation mit synthetischen Daten vorgestellt. Für diesen Zweck wird ein neuartiger in der Arbeit entwickelter Datengenerator angewandt, der eine einfache und intuitive Spezifikation der Datencharakteristiken erlaubt. Für die Evaluation des Ansatzes wurden Daten erzeugt, die ähnliche Charakteristiken aufweisen wie die Daten des angestrebten Einsatzbereichs. Die Ergebnisse der Evaluation zeigen, dass der neuartige SD-Map Algorithmus den anderen in der Arbeit beschriebenen Standard-Algorithmen überlegen ist. Sowohl hinsichtlich der Effizienz, als auch von Precision/Recall bezogen auf die heuristischen Algorithmen bietet SD-Map deutliche Vorteile. Subjektive Evaluationskriterien sind durch die Benutzerakzeptanz, den Nutzen des Ansatzes, und die Interessantheit der Ergebnisse gegeben. Es werden fünf Fallstudien für den Einsatz der vorgestellten Techniken beschrieben: Der Ansatz wurde in medizinischen und technischen Anwendungen mit realen Daten eingesetzt. Dabei wurde er von den Benutzern sehr gut angenommen, und im praktischen Einsatz konnte neuartiges, nützliches, und interessantes Wissen entdeckt werden

Los estilos APA, Harvard, Vancouver, ISO, etc.

20

Costa, Afonso José Ourives Marques da. "Handling Data Difficulty Factors via a Meta-Learning Approach". Master's thesis, 2020. http://hdl.handle.net/10316/92560.

Texto completo

Resumen

Trabalho de Projeto do Mestrado Integrado em Engenharia Biomédica apresentado à Faculdade de Ciências e Tecnologia
As aplicações de aprendizagem-máquina são desafiadas pelos fatores de complexidade dos dados. Estes são responsáveis pela degradação da qualidade dos dados, sendo que lidar com estes fatores é uma tarefa importante para evitar a degradação do desempenho de classificadores. Dentro dos fatores de complexidade, o desequilíbrio de classes, que é característico em diversas bases de dados biomédicas, normalmente é abordado com algoritmos de pré-processamento, que são eficazes em melhorar o desempenho de tarefas de classificação.Dado que a seleção do algoritmo mais indicado para lidar com o desequilíbrio de classes muitas vezes é baseada em abordagens de "força-bruta", sistemas de recomendação têm sido desenvolvidos de forma a providenciar a estratégia ótima a utilizar para um dado problema, baseado nas meta-características do conjunto de dados. No entanto, embora diversos sistemas de recomendação tenham sido bem-sucedidos, estes não têm a capacidade de fornecer conhecimento interpretável, uma vez que apenas a entrada (conjunto de dados) e a saída (estratégia recomendada) destes sistemas são conhecidas.De forma a solucionar este problema, o objetivo da presente dissertação é estudar as relações entre meta-características dos dados e algoritmos de pré-processamento no desempenho de classificadores. Para alcançar os objetivos, uma metodologia de meta-aprendizagem foi desenvolvida, baseada em "Exceptional Preferences Mining", que demonstrou ser apropriada para fornecer condições interpretáveis, referentes às relações entre as meta-características dos dados e o ranking de algoritmos de pré-processamento. Em adição, uma nova métrica é proposta com a finalidade de salientar os subgrupos onde grandes variações são observadas, no desempenho de vários algoritmos de pré-processamento.As experiências realizadas incluem 163 bases de dados, pré-processadas com 9 estratégias a nível dos dados, de onde meta-características provenientes de 8 grupos foram extraídas. Os resultados mais relevantes salientam que a utilização de uma estratégia para lidar com o desequilíbrio de classes pode nem sempre ser necessária e que não existe uma relação evidente com a proporção de pontos entre as classes maioritária e minoritária, mas sim com a associação do desequilíbrio de classes com outros fatores de complexidade. Adicionalmente, os domínios de aplicação de estratégias para lidar com distribuições assimétricas de classes são individualmente descritas, para além de outros resultados úteis para o desenvolvimento de novos sistemas de recomendação.
Machine learning applications are challenged by data difficulty factors, which are responsible for the degradation of data quality and dealing with them is a demanding task. Among the difficulty factors, class imbalance, which is noticeable in many biomedical databases, is often tackled with preprocessing algorithms that effectively improve classification performance.Since the selection of an imbalance strategy for a problem often encompasses "brute-force" approaches, recommendation systems have been developed to provide optimal imbalance strategies for the problem at hand, based on the meta-characteristics of the dataset. However, despite the success of such systems, arguably these do not provide any insightful information, since only the inputs (datasets) and outputs (recommended imbalance strategies) of these systems are provided.Addressing this issue, the purpose of this dissertation is to provide a study of the relations between data meta-characteristics and imbalance strategies in the performance of classifiers. To this end, a meta-learning-based framework was developed, based on Exceptional Preferences Mining, which has proven to be suitable to deliver interpretable conditions, concerning the relations between data meta-characteristics and the ranking of preprocessing algorithms. Additionally, a novel metric was proposed, which is suitable to highlight the subgroups where steep performance variations are observable, among the performance of imbalance strategies.The experiments considered 163 datasets, where meta-features from 8 groups were extracted and preprocessed with 9 data-level imbalance strategies. The main findings include that employing an imbalance strategy may not always be required and that there is no evident relation with the imbalance ratio, rather with the association of imbalance with other difficulty factors. Moreover, the domains of application of individual imbalance strategies are described, among other findings suitable for the design of novel recommendation systems.

Los estilos APA, Harvard, Vancouver, ISO, etc.

21

Wang, Xiaojing. "Bayesian Modeling Using Latent Structures". Diss., 2012. http://hdl.handle.net/10161/5848.

Texto completo

Resumen

This dissertation is devoted to modeling complex data from the

Bayesian perspective via constructing priors with latent structures.

There are three major contexts in which this is done -- strategies for

the analysis of dynamic longitudinal data, estimating

shape-constrained functions, and identifying subgroups. The

methodology is illustrated in three different

interdisciplinary contexts: (1) adaptive measurement testing in

education; (2) emulation of computer models for vehicle crashworthiness; and (3) subgroup analyses based on biomarkers.

Chapter 1 presents an overview of the utilized latent structured

priors and an overview of the remainder of the thesis. Chapter 2 is

motivated by the problem of analyzing dichotomous longitudinal data

observed at variable and irregular time points for adaptive

measurement testing in education. One of its main contributions lies

in developing a new class of Dynamic Item Response (DIR) models via

specifying a novel dynamic structure on the prior of the latent

trait. The Bayesian inference for DIR models is undertaken, which

permits borrowing strength from different individuals, allows the

retrospective analysis of an individual's changing ability, and

allows for online prediction of one's ability changes. Proof of

posterior propriety is presented, ensuring that the objective

Bayesian analysis is rigorous.

Chapter 3 deals with nonparametric function estimation under

shape constraints, such as monotonicity, convexity or concavity. A

motivating illustration is to generate an emulator to approximate a computer

model for vehicle crashworthiness. Although Gaussian processes are

very flexible and widely used in function estimation, they are not

naturally amenable to incorporation of such constraints. Gaussian

processes with the squared exponential correlation function have the

interesting property that their derivative processes are also

Gaussian processes and are jointly Gaussian processes with the

original Gaussian process. This allows one to impose shape constraints

through the derivative process. Two alternative ways of incorporating derivative

information into Gaussian processes priors are proposed, with one

focusing on scenarios (important in emulation of computer

models) in which the function may have flat regions.

Chapter 4 introduces a Bayesian method to control for multiplicity

in subgroup analyses through tree-based models that limit the

subgroups under consideration to those that are a priori plausible.

Once the prior modeling of the tree is accomplished, each tree will

yield a statistical model; Bayesian model selection analyses then

complete the statistical computation for any quantity of interest,

resulting in multiplicity-controlled inferences. This research is

motivated by a problem of biomarker and subgroup identification to

develop tailored therapeutics. Chapter 5 presents conclusions and

some directions for future research.

Dissertation

Los estilos APA, Harvard, Vancouver, ISO, etc.

22

Shen, Hua. "Statistical Methods for Life History Analysis Involving Latent Processes". Thesis, 2014. http://hdl.handle.net/10012/8496.

Texto completo

Resumen

Incomplete data often arise in the study of life history processes. Examples include missing responses, missing covariates, and unobservable latent processes in addition to right censoring. This thesis is on the development of statistical models and methods to address these problems as they arise in oncology and chronic disease. Methods of estimation and inference in parametric, weakly parametric and semiparametric settings are investigated. Studies of chronic diseases routinely sample individuals subject to conditions on an event time of interest. In epidemiology, for example, prevalent cohort studies aiming to evaluate risk factors for survival following onset of dementia require subjects to have survived to the point of screening. In clinical trials designed to assess the effect of experimental cancer treatments on survival, patients are required to survive from the time of cancer diagnosis to recruitment. Such conditions yield samples featuring left-truncated event time distributions. Incomplete covariate data often arise in such settings, but standard methods do not deal with the fact that the covariate distribution is also affected by left truncation. We develop a likelihood and algorithm for estimation for dealing with incomplete covariate data in such settings. An expectation-maximization algorithm deals with the left truncation by using the covariate distribution conditional on the selection criterion. An extension to deal with sub-group analyses in clinical trials is described for the case in which the stratification variable is incompletely observed. In studies of affective disorder, individuals are often observed to experience recurrent symptomatic exacerbations of symptoms warranting hospitalization. Interest lies in modeling the occurrence of such exacerbations over time and identifying associated risk factors to better understand the disease process. In some patients, recurrent exacerbations are temporally clustered following disease onset, but cease to occur after a period of time. We develop a dynamic mover-stayer model in which a canonical binary variable associated with each event indicates whether the underlying disease has resolved. An individual whose disease process has not resolved will experience events following a standard point process model governed by a latent intensity. If and when the disease process resolves, the complete data intensity becomes zero and no further events will arise. An expectation-maximization algorithm is developed for parametric and semiparametric model fitting based on a discrete time dynamic mover-stayer model and a latent intensity-based model of the underlying point process. The method is applied to a motivating dataset from a cohort of individuals with affective disorder experiencing recurrent hospitalization for their mental health disorder. Interval-censored recurrent event data arise when the event of interest is not readily observed but the cumulative event count can be recorded at periodic assessment times. Extensions on model fitting techniques for the dynamic mover-stayer model are discussed and incorporate interval censoring. The likelihood and algorithm for estimation are developed for piecewise constant baseline rate functions and are shown to yield estimators with small empirical bias in simulation studies. Data on the cumulative number of damaged joints in patients with psoriatic arthritis are analysed to provide an illustrative application.

Los estilos APA, Harvard, Vancouver, ISO, etc.

23

Lee, Hsi-Yen y 李錫諺. "Iterative clustering of gene expression data in search of subgroups of general population". Thesis, 2018. http://ndltd.ncl.edu.tw/handle/4sadzd.

Texto completo

Los estilos APA, Harvard, Vancouver, ISO, etc.

Tesis sobre el tema "Data Subgroup"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros